If you have used MySQL for some time you know that mysqld can write binlogs. This is usually used for backup purposes and JITR or for replication purposes so a slave can collect the changes made on the master and apply them locally.
Most of the time apart from configuring how long you keep these binlogs they are pretty much ignored.
Recently I came across an issue. I have a slave server which is NOT configured read only and which has an additional database used to collect statistics from the replicated database and provided aggregation and business information. The typical sales per country, per product, per day, week, month, year, per whatever etc. This is the usual datawarehouse type functionality. It’s done on a slave and not the master so that the load is not passed on down stream to all boxes.
So this slave is quite busy and in fact is bigger than other normal slaves which are used for other things. The data mining processes are pretty heavy and create a lot of I/O on the server as would be expected. Some times I noticed that the daily batch process would take a lot longer than usual, sometimes significantly longer. I monitor the server with various tools including cacti and MySQL Enterprise Monitor (Merlin), but for this particular issue which would happen from time to time the tools I was using could not identify why the batch jobs took so much longer. This was very frustrating. Managers often get upset if their reporting tools do not provide up to date information so you can imagine…
After quite a lot of head-scratching eventually it seemed like a good idea to look at the binlogs. These files are not in a very readable format but using the mysqlbinlog binary you can get the output into something which is readable. This is true if you use statement based replication which I do (the reasons I do not yet use row-based replication needs a post another day). So a bit of perl later and I was able to analyze the binlogs and see on a day to day basis the number of statements that were stored in the binlogs and also the operations (INSERT, UPDATE, DELETE) which were applied on these tables. A bit more manipulation enables you to count the rows affected for each statement.
So I looked at the results for a few different days and hey presto. The cause of the slowdown on the analytical slaves co-incided exactly with days were the slave was processing more changes via replication. The number of statements had gone up by more than a factor of three, due to a new script which had been recently added. Neither the master or other slaves were really affected by this extra load but the analytical slaves generated a lot more extra I/O and basically this change was enough to trace the cause and via the replicated tables which were affected it was possible to find out which process was being heavier than expected and address the problem.
Since coming across this problem and because there is no real other way to see the “replication” related load on many slaves (at least using Merlin) I’ve now put in place some scripts which push this table activity statistics into a dba database so I can look at these number and see how they change over time. If I have a similar problem I can look at the statistics and hopefully identify the source of the problem more quickly and thus resolve the problem. This also keeps managers happy which is never a bad thing.
Processing binlogs takes some time, and mine are configured as 1GB files. Storing this summary information is much more compact and thus I can keep a much longer history than would otherwise be possible.
So if you have not considered doing this before and want to “measure” your replication load and perhaps see how it changes over time then consider analyzing your binlogs, storing the results in a database, and providing some pretty tables or graphs which you can use to show colleagues or managers the cause behind unexpected changes in batch processing times…
Update 24/10/2010 11:00 CEST
Since writing this article I also came across this web page which does something similar to what I am doing. It does not go into the detail I am trying to do but does collect some extra statistics which perhaps I should incorporate into my scripts. Certainly the events, bytes and transactions per second counters (max and average) are worthwhile adding, especially if you record data over time.
It was also pointed out to me to look at Maatkit’s mk-query-digest but despite reading the different options I do not see a combination of options which gives per table or per database statistics (number of statements, average and maximum size of each one, number of rows affected). This is a command with a lot of options but I don’t think it covers my needs.
Since perhaps it was not clear actually what I am collecting this table below gives you a quick brief summary taken from one of the master servers (table names have been adjusted).
count values columns bytes op table
269877 0 1096500 841309698 U MainTable
456 675786 45152749 474400297 I MainTable_new
144764 321457 5811580 345366449 I Statistics1_201010
172069 172104 7575605 320017961 I Statistics2_201010
247208 0 9662207 282552711 U Statistics1_201010
144765 324509 3963200 243302911 I Statistics3_201010
617756 617756 2471024 179151900 I SomethingDemandAverage
144765 148219 5452203 142284127 I Statistics4_201010
17978 525758 3136621 129989738 I BookingNotes
16811 516676 7750189 107097598 I SomeTypeOfMessages
count – number of statements related to this table
values – for INSERT type statements the number of VALUES ( ..) , ( … ) … values, basically rows inserted
columns – for INSERT and UPDATE statements the number of columns affected ( multiplied by the number of rows in the case of an INSERT )
bytes – total sum of bytes in each statement for each table.
op – I = Insert Statements, U = Update Statements, ….
Ordering by bytes or count, or grouping by table name allows you to focus on exactly which where the chunk of processing takes place.
While I have only been focusing on SBR replication you can reasonably easily do similar processing and get similar statistics based on RBR.
Missing from my current stats are using the exec time values (MySQL should record the exact number of seconds taken to do an execution, not the whole number of seconds. I should really file a feature request for this if it’s not in 5.5.) and correctly assigning the table to the appropriate database. That’s something that will happen shortly in my code.