Archive for October, 2010

Bacula 5.0.3 binary rpms for CentOS 5 on http://postfix.wl0.org/ftp/bacula/

Friday, October 29th, 2010

I’ve recently updated my CentOS 5 x86_64 rpms for bacula-5.0.3.

Hope they are useful to you.

How analysing your binlogs can be quite informative

Saturday, October 23rd, 2010

If you have used MySQL for some time you know that mysqld can write binlogs. This is usually used for backup purposes and JITR or for replication purposes so a slave can collect the changes made on the master and apply them locally.

Most of the time apart from configuring how long you keep these binlogs they are pretty much ignored.

Recently I came across an issue.  I have a slave server which is NOT configured read only and which has an additional database used to collect statistics from the replicated database and provided aggregation and business information. The typical sales per country, per product, per day, week, month, year, per whatever etc.  This is the usual datawarehouse type functionality.  It’s done on a slave and not the master so that the load is not passed on down stream to all boxes.

So this slave is quite busy and in fact is bigger than other normal slaves which are used for other things. The data mining processes are pretty heavy and create a lot of I/O on the server as would be expected.  Some times I noticed that the daily batch process would take a lot longer than usual, sometimes significantly longer. I monitor the server with various tools including cacti and MySQL Enterprise Monitor (Merlin), but for this particular issue which would happen from time to time the tools I was using could not identify why the batch jobs took so much longer.  This was very frustrating. Managers often get upset if their reporting tools do not provide up to date information so you can imagine…

After quite a lot of head-scratching eventually it seemed like a good idea to look at the binlogs. These files are not in a very readable format but using the mysqlbinlog binary you can get the output into something which is readable. This is true if you use statement based replication which I do (the reasons I do not yet use row-based replication needs a post another day). So a bit of perl later and I was able to analyze the binlogs and see on a day to day basis the number of statements that were stored in the binlogs and also the operations (INSERT, UPDATE, DELETE) which were applied on these tables.  A bit more manipulation enables you to count the rows affected for each statement.

So I looked at the results for a few different days and hey presto. The cause of the slowdown on the analytical slaves co-incided exactly with days were the slave was processing more changes via replication. The number of statements had gone up by more than a factor of three, due to a new script which had been recently added.  Neither the master or other slaves were really affected by this extra load but the analytical slaves generated a lot more extra I/O and basically this change was enough to trace the cause and via the replicated tables which were affected it was possible to find out which process was being heavier than expected and address the problem.

Since coming across this problem and because there is no real other way to see the “replication” related load on many slaves (at least using Merlin) I’ve now put in place some scripts which push this table activity statistics into a dba database so I can look at these number and see how they change over time. If I have a similar problem I can look at the statistics and hopefully identify the source of the problem more quickly and thus resolve the problem. This also keeps managers happy which is never a bad thing.

Processing binlogs takes some time, and mine are configured as 1GB files. Storing this summary information is much more compact and thus I can keep a much longer history than would otherwise be possible.

So if you have not considered doing this before and want to “measure” your replication load and perhaps see how it changes over time then consider analyzing your binlogs, storing the results in a database, and providing some pretty tables or graphs which you can use to show colleagues or managers the cause behind unexpected changes in batch processing times…

Update 24/10/2010 11:00 CEST

Since writing this article I also came across this web page which does something similar to what I am doing. It does not go into the detail I am trying to do but does collect some extra statistics which perhaps I should incorporate into my scripts.  Certainly the events, bytes and transactions per second counters (max and average) are worthwhile adding, especially if you record data over time.

It was also pointed out to me to look at Maatkit’s mk-query-digest but despite reading the different options I do not see a combination of options which gives per table or per database statistics (number of statements, average and maximum size of each one, number of rows affected). This is a command with a lot of options but I don’t think it covers my needs.

Since perhaps it was not clear actually what I am collecting this table below gives you a quick brief summary taken from one of the master servers (table names have been adjusted).


count     values     columns        bytes    op   table
269877         0     1096500    841309698    U    MainTable
456    675786    45152749    474400297    I    MainTable_new
144764    321457     5811580    345366449    I    Statistics1_201010
172069    172104     7575605    320017961    I    Statistics2_201010
247208         0     9662207    282552711    U    Statistics1_201010
144765    324509     3963200    243302911    I    Statistics3_201010
617756    617756     2471024    179151900    I    SomethingDemandAverage
144765    148219     5452203    142284127    I    Statistics4_201010
17978    525758     3136621    129989738    I    BookingNotes
16811    516676     7750189    107097598    I    SomeTypeOfMessages

columns:
——-
count – number of statements related to this table
values – for INSERT type statements the number of VALUES ( ..) , ( … ) … values, basically rows inserted
columns – for INSERT and UPDATE statements the number of columns affected ( multiplied by the number of rows in the case of an INSERT )
bytes – total sum of bytes in each statement for each table.
op – I = Insert Statements, U = Update Statements, ….

Ordering by bytes or count, or grouping by table name allows you to focus on exactly which where the chunk of processing takes place.

While I have only been focusing on SBR replication you can reasonably easily do similar processing and get similar statistics based on RBR.

Missing from my current stats are using the exec time values (MySQL should record the exact number of seconds taken to do an execution, not the whole number of seconds. I should really file a feature request for this if it’s not in 5.5.) and correctly assigning the table to the appropriate database. That’s something that will happen shortly in my code.

Vigor2820n 3.3.4 firmware upgrade also breaks SIP registrations

Saturday, October 16th, 2010

A few days after writing my last post about problems after upgrading my ADSL router’s firmware I also noticed that my VoIP connections were not working properly. I have several SIP providers and after a day or so of using the new firmware the SIP registrations to my providers started failing.  Initially I thought this was caused by my ISP as not all registrations seemed to be affected. I have not changed my Asterisk configuration in some time and did not associate the problem with the change in router firmware as everything else seemed to work fine.

The asterisk logging showed:

[Oct 15 02:12:04] NOTICE[3329] chan_sip.c:    -- Registration for '....@xxxxxxxxxx.com' timed out, trying again (Attempt #5)
[Oct 15 02:12:24] NOTICE[3329] chan_sip.c:    -- Registration for '....@xxxxxxxxxx.com' timed out, trying again (Attempt #6)
[Oct 15 02:12:44] NOTICE[3329] chan_sip.c:    -- Registration for '....@xxxxxxxxxx.com' timed out, trying again (Attempt #7)

A router reboot fixes the problem but it does come back again. Rebooting a router on a daily basis is not something I really want to do or think should be necessary.

I’ve reported the problem to Draytek support so will see what they say. In the meantime I see they’ve posted a 3.3.4.1 version of the firmware so perhaps this is one of a few known problems. Let’s see if this latest version solves my problem.

Update 25/10/2010

In spite of the upgrade to 3.3.4.1 I still notice the same problem:


[Oct 25 07:56:43] NOTICE[19553] chan_sip.c: -- Registration for 'myaccount@voip-provider2.com' timed out, trying again (Attempt #6898)
[Oct 25 07:56:56] NOTICE[19553] chan_sip.c: -- Registration for '123456789@voip-provider1.com' timed out, trying again (Attempt #5151)
[Oct 25 07:57:03] NOTICE[19553] chan_sip.c: -- Registration for 'myaccount@voip-provider2.com' timed out, trying again (Attempt #6899)
[Oct 25 07:57:16] NOTICE[19553] chan_sip.c: -- Registration for '123456789@voip-provider1.com' timed out, trying again (Attempt #5152)
[Oct 25 07:57:23] NOTICE[19553] chan_sip.c: -- Registration for 'myaccount@voip-provider2.com' timed out, trying again (Attempt #6900)
[Oct 25 07:57:36] NOTICE[19553] chan_sip.c: -- Registration for '123456789@voip-provider1.com' timed out, trying again (Attempt #5153)
[Oct 25 07:57:43] NOTICE[19553] chan_sip.c: -- Registration for 'myaccount@voip-provider2.com' timed out, trying again (Attempt #6901)
[Oct 25 07:57:56] NOTICE[19553] chan_sip.c: -- Registration for '123456789@voip-provider1.com' timed out, trying again (Attempt #5154)
[Oct 25 07:58:03] NOTICE[19553] chan_sip.c: -- Registration for 'myaccount@voip-provider2.com' timed out, trying again (Attempt #6902)
[Oct 25 07:58:16] NOTICE[19553] chan_sip.c: -- Registration for '123456789@voip-provider1.com' timed out, trying again (Attempt #5155)
[Oct 25 07:58:23] NOTICE[19553] chan_sip.c: -- Registration for 'myaccount@voip-provider2.com' timed out, trying again (Attempt #6903)
[Oct 25 07:58:36] NOTICE[19553] chan_sip.c: -- Registration for '123456789@voip-provider1.com' timed out, trying again (Attempt #5156)
[Oct 25 07:58:43] NOTICE[19553] chan_sip.c: -- Registration for 'myaccount@voip-provider2.com' timed out, trying again (Attempt #6904)
[Oct 25 07:58:56] NOTICE[19553] chan_sip.c: -- Registration for '123456789@voip-provider1.com' timed out, trying again (Attempt #5157)
[Oct 25 07:59:03] NOTICE[19553] chan_sip.c: -- Registration for 'myaccount@voip-provider2.com' timed out, trying again (Attempt #6905)

So looks like I’m going to have to revert back to 3.3.3.