Thoughts about extlookup() in puppet

May 11th, 2011

To manage some of the database servers I look after I use puppet.

I’ve recently been trying to clean up quite an extensive set of puppet recipes I’d written. A suggestion that has been made was to remove the data from the recipes and a pointer was made to the new extlookup functionality. That is keep data and code separate and make it easier to modify the behaviour without having to modify the code itself.

In my case I manage a few sets of database servers. Similar groups of boxes are configured the same way. One specific function I was looking at to improve was this snippet.

case $domain {
'': { s_db::master_vif::check { 'master vif': interface => 'eth0:123', ip_address => '' } }
'': { s_db::master_vif::check { 'master vif': interface => 'eth0:456', ip_address => '' } }

The intention here being that on the primary master the interface would be enabled and on the standby master the interface would NOT be enabled.

Note: this snippet applies to a pair of boxes but there are several other pairs to configure in the same way, each with their own specific settings.

The extlookup() functionality only allows a single “lookup value” to be searched and implicitly rather than explicitly where to look for this value. The snippet above has 2 “parameters” and if I have several pairs of boxes: MDB_A, MDB_B, MDB_C (the name is not part of the domain or the hostname) then using the extlookup() functionality I’m going to have to setup a lot of data files and this looks unwieldy, especially as I have other resources defined which take several parameters.

So I thought about how I would do this to make the lookup facility more generic and visible.

What seems better to me would be something which is a bit more generic and based on looking up a < “config_item”, “lookup_value” > pair in a single location to get a single value as now. That would make the extlookup() prototype probably change to something like:

extlookup( “config_item”, [ list of variables to use ] , “default_value”, [ lookup "tables/csv files" ] )

parameters being:
[1] config_item ( grouping of values to lookup )
[2] array of values to apply the lookup on, done in order as now but more explicitly shown
[3] default value
[4] explict array of locations of the data files.

The .csv files would have 3 columns: config_item,lookup_value,result_value

Using something like this I can then simplify the snippet above to:

$interface_name = extlookup( 'MDB_A-vif-interface', [ $hostname, $domain ], '', [ 'dba_config' ] )
$ip_address = extlookup( 'MDB_A-vif-ip-address', [ $hostname, $domain ], '', [ 'dba_config' ] )

s_db::master_vif::check { 'master vif': interface => $interface_name, ip_address => $ip_address } }

This is clearer as the configuration is more explicit then the current extlookup() definition.

If performance of having a large number of entries in a single file were a concern then talking to a DB backend and doing something like the following would work:

SELECT lookup_value FROM some_table WHERE config_item = ? AND lookup_key = ?

Also while in this example above I can use $hostname or $domain as a retrieval key in several other boxes I probably can’t but might need to lookup on another value such as server_type or db_name.
Having the explicit [ .... ] list where I can add any string I want to be looked up give more flexibility and is clearer.

Using the current (puppet 2.6) mechanism would require me, if I understand it correctly, to configure different
.csv files to be searched for each “configuration parameter”. I may also need in several places to override the $extlookup_precedence. In any case this can’t be overridden several times within the same module which is what I would need if I want to lookup different variables.

That doesn’t strike me as being very useful. It’s also unclear from the current extlookup() call itself where the data is being retrieved from. Using these external variables seems to me to make the calls behave like magic.

So my question was to ask how others who are using the extlookup() functionality how they cope with more complex settings than the system settings which depend on say $hostname or $domain etc. and whether my proposed extension makes any sense.

Thanks for your thoughts.

2011-05-12, update

Another snippet of something I’d like to do which seems easier to fulful using my proposed extlookup() setup.

$notifications_1 = ''
$notifications_2 = ''
$notifications_3 = ''

case $hostname {
'host01':  { $email = $notifications_1 }
'host02': { $email = $notifications_2 }
default:   { $email = $notifications_3 }

This would seem more naturally expressed as:

$email = extlookup( 'notification_email', [ $hostname ], $notifications_3 )

or perhaps the following if you want to only check for boxes in your domain.

$email = extlookup( 'notification_email', [ $hostname, $domain ] )

but most importantly with a single data file containing:

How to start mysqld using numactl

March 15th, 2011

Various people have complained about Linux swapping unexpectedly on boxes running mysqld, when apparently mysqld was not using all the memory and there was quite a bit of free memory available.

There’s also an article by Jeremy Cole. However, his solution requires a one-line change to mysqld_safe which while it’s small does not work very well if you have to maintain a large number of servers and are using packages asa package upgrade will overwrite the modified file mysqld_safe and then restart mysqld with the unmodified script. This leads to the need to repatch the script and then restart mysqld. Not very helpful.

So I was looking for another solution and came up with this option which basically requires a minor change to /etc/my.cnf and the use of a small shell wrapper script. The change to my.cnf is simply to add the following to the [mysqld_safe] section:


... other options you might be using ...

# wrapper around mysqld

The wrapper script is little more than:

# work around to startup mysqld using numactl

numactl=/usr/bin/numactl        # Adjust the path as needed
mysqld=/usr/sbin/mysqld         # Adjust the path as needed

# use exec to avoid having an extra shell around.
exec $numactl --interleave all $mysqld "$@"

This minor change means that you can still use the standard mysql scripts and they will work even after an upgrade to a new rpm or other type of package. So if you’ve seen your mysql server swapping when you think you have the memory settings correctly adjusted and have been trying to figure out how to easily adjust the configuration to try out numactl to see if this improves things this seems to be an easier way of doing it. – script for monitoring MySQL instances

February 24th, 2011

I thought I would upload a small script I use which I find quite useful. is normally run minutely from cron and collects statistics from any MySQL instance it finds. Data is kept for up to a week and then rotated. This allows you to check what happened over a period of time, for example if an instance crashes or seems to have a lot of load.

I have found it quite invaluable on the instances I’m monitoring and once setup it’s usually nice and silent. If you find this useful or have suggestions for improvements please let me know.

A few more things to add to my desired SOHO FreeSWITCH configuration

February 11th, 2011

Since my last post I’ve spoken some more on #freeswitch and also freeswitch-users and have some ideas of how to implement some of the things I wanted to do on my SOHO PBX setup.

Additionally I’ve also added several other tasks which I think are important. So here is where you can find the updated post:

I’ve not fully understood the configuration in FreeSWITCH using 2 different ports for internal and external traffic, especially as most SIP systems just use the default SIP port 5060. As such I want to configure FreeSWITCH so it works on a single port
rather than using the 2 ports the default configuration uses.

Since the requirements list has become quite long I want to try to document how I get to the final setup in the hope it will allow someone else to do this more quickly than it’s taken me.

If you see a configuration / description of FreeSWITCH behaviour which is wrong, or a better way to implement these features please let me know. This is a learning process for me.

Freeswitch – progress and frustration

February 2nd, 2011

I’ve not posted for a while and thought it about time I gave an update on my progress trying to get FreeSWITCH to work.


I previously mentioned that I was trying out Bluebox and it looked promising. I did play around with it for a while but had various issues which I had trouble solving.  One of these was due to the way the configuration is stored. Bluebox configuration is stored in a database and then auto-generated. This means that the configuration files have entries for things like sip_interface_1, trunk_1, trunk_2, etc… When looking in the log files this can be a bit confusing as it’s not apparent which “real entry these names refer to”.

2011-01-09 22:40:33.837612 [NOTICE] sofia_reg.c:342 Registering trunk_1
2011-01-09 22:40:33.837612 [ERR] sofia_reg.c:1612 trunk_1 Registration Failed with status DNS Error [503]. failure #37
2011-01-11 12:49:56.927418 [DEBUG] switch_core_state_machine.c:77 sofia/sipinterface_1/ Standard ROUTING
2011-01-11 12:49:56.935221 [INFO] mod_dialplan_xml.c:331 Processing 1001 <1001>->003491xxxxxxx in context context_2

I’d suggest that the configuration optionally allows the user to provide their own names so that you can entries such as the_name_of_my_sip_provider instead.

There were also some configuration issues with registrations and dialing not working as I had expected.  To be honest I did get offers of help from some of the Bluebox developers, but did not think it right to expect them to hand-hold me debugging certain software issues. In the end I could not get Bluebox to work the way I expected and so gave up. That is not to say that Bluebox is bad, but I think it needs to include some more tools to help diagnose the status of various things.  After all if you provide a GUI interface it helps if you also provide some sort of GUI test tools.  Perhaps I will try again later and see if  some of the issues I had experienced have been resolved.

Back to Manual Configuration File editing

In the end I decided to try again and edit the configuration files manually. I used the Bluebox install iso as a base (running inside vmware) and this time decided before I started to try to write up a requirements document and work through it. That can be found here. I decided to run the configuration inside a vmware guest while I was testing as this keeps the configuration behind my router’s NAT setup and in theory FreeSWITCH is not visible to the outside world so is safe while I learn and adjust the configuration.

The results of doing things this way have been promising but I notice several things which perhaps for a FreeSWITCH newbie may not be ideal.

  • I believe that the default FreeSWITCH logging may be too verbose. This is great for developers who understand all the logging messages and may be looking for a specific log entry but can be rather intimidating for someone who is trying to see if something worked or not and if not why not. For those type of people a less verbose log level may be appropriate, and of course you can later increase logging if you need to get the details of why something is not working as expected. So I’m not sure that verbose logging is ideal for the default freeswitch configuration.
  • xml file breakage. I’ve occasionally edited the xml configuration files incorrectly and when running reloadxml get the frustrating message about a failure to parse the file and some weird line number. Figuring out what entry is wrong and why from this message is quite tricky.  I only recently came across some comment saying that the preparsed xml files get written to log/freeswitch.xml.fsxml.  This helps a lot as you see the final configuration file generated from the various individual files but I believe mention of this should be more prominently commented in the various tutorials. It’s easy to miss. That said FreeSWITCH could at least report the full path of the xml filename where the error is found.
  • Related to the same issue is the fact that, as far as I can see, comments need to be in the format <!– some stuff –>. It is not clear to me whether white space around these comments is allowed always or only sometimes as the “xml breakage” I seem to have triggered was often caused by adding several lines of comments as above, often with white space before or after the comments. I did not expect that to make a difference but it seems that in some places it does.  I’d love some clarification on this and if it’s possible to make FreeSWITCH’s xml parser slightly more tolerant that would be nice.

Default configuration FreeSWITCH security concerns

Note: the comments here relate to the 1.0.6 version of FreeSWITCH. I’ve tried to build newer versions on my Linux box but the build fails. Checking out 1.0.6 builds without problems but I realise that a lot of changes have happened since. It would be nice to test on a newer version.
The default/sample configuration seems to be very open which is great for learning as a lot of functionality is enabled by default. However, this seems to be a perfect opportunity for SPIT (which seems to be the term used for the equivalent of VoIP SPAM).  While the default password for the extensions is configurable in vars.xml it is not apparent that it may be really important that you change this to something really secure.  I was working with FreeSWITCH behind a NAT’d router so was pretty relaxed about this, thinking they couldn’t reach the software, as it was not accessible on my public IP address.  However, that assumption seems not to be as true as I thought. It seems that registering with an external SIP gateway opens a hole out, but also opens a hole through the firewall to FreeSWITCH. That’s scary and I have not seen it mentioned anywhere.

A few days ago I noticed a lot of FreeSWITCH logging of something apparently trying to dial various international numbers. Looking at the logs it seems that someone had managed to guess one of my extension’s (not a default extension) passwords and was trying to dial out again to various numbers including places like Somalia.  As my configuration was not complete, and the default gateway not setup, this was failing but I was rather horrified at how easily someone could get in.  I was lucky and this could have cost me a lot of money.

I still don’t understand how the attacker found/figured out the extension that was being used to dial out from. Not being a default extension it seems it must have been guessing a large number and finally found one that worked.

However, this really reminded me of a similar situation several years back with sendmail. Before qmail, postfix and several other newer mail servers were popular most UNIX boxes came pre-installed with sendmail. Sendmail is great: you can configure it to do anything but the configuration file is rather complex and hard to really understand. At the time the configuration also assumed that it was fine to resend mail and often sendmail was configured as an open relay.  Newer MTAs (and sendmail too now) are configured with a slightly different attitude: the Internet is hostile and there are lots of people who, if they can, may try to abuse your system.  So now the software is configured to be very restrictive of what mail it will accept and forward.

How does this relate to FreeSWITCH? I’ve seen attacks to my current Asterisk setup (though not successfully) and with the attack of my FreeSWITCH configuration it seems that VoIP software should have something similar. The configuration should be designed to allow the least amount of access necessary and also to have controls in place to mitigate attack attempts.  Things that come to mind are:

  • the external profile should log authentication failures by default
  • FreeSWITCH should have some sort of rate limiting configured, so that someone trying to access FreeSWITCH frequently will trigger this and be ignored for a while, with the issue logged clearly.
  • It should be more obvious how to configure network ACLs for extensions, and these should be configured by default.  Postfix by default allows relaying (registering an extension) only from hosts in my networks which by default is the network of the network interface. This can of course be adjusted but is a good safe starting point.  FreeSWITCH’s default configuration should perhaps be similar, only allow registrations from clients on my network.
  • It seems that to register the client can use FreeSWITCH’s ip address or domain name.  Once someone outside has figured out where the SIP listener is located it’s quite easy to try to register with different users at the ip address. If you only accept the registration to then the attacker has to know what the correct configuration realm is before they can make much progress. That’s much harder for them. So it again seems like a good idea to NOT allow registrations directly to the IP address (at least by default).
  • For trunk connections if you have a DID number you expect the VoIP provider to call you. In this case it seems good that the example configuration clearly allow you to limit where the incoming calls to that incoming trunk number may come from, and whether they should be password protected or not.
  • Allow rate limiting of calls to a gateway, or from an extension. That’s different to rate limiting registration attempts to a specific extension, or from a specific ip address.  Also allow the configuration and limiting of the number of concurrent connections that may be allowed through a trunk/gateway.

I believe that some and perhaps  all of these features are possible in FreeSWITCH but they are not documented explicitly in the default configuration in a way which you really take note. It would be helpful if that were addressed.  You expect to have a mail server on all unix servers, and they should be secure. Many people using voice software probably expect the same.

Additionally a hacked mail server does not cost you money. A hacked PBX does.

FreeSWITCH cookbook

One thing that I would like to see is a FreeSWITCH cookbook, something that gives examples of lots of different types of configuration and is complete enough to use to solve different tasks that many people using FreeSWITCH may need to get
resolved. Like many other cookbooks I’ve seen (for perl, mysql) comments on the how and why help understand things better.  This could be part of the existing FreeSWITCH book, though perhaps a separate book would be more appropriate.

The FreeSWITCH configuration files are complex and there are many of them. I think this is part of what makes the software hard to follow for someone who is not using it on a day to day basis. Compare that with Asterisk which basically
has 2 configuration files you are going to change: sip.conf and extensions.conf. FreeSWITCH has many more.  This is why I think that you see a lot of people playing with Asterisk: it’s configuration is reasonably straight forward to understand and doing the basics is quite simple. From there you can hack away adding things that you want to improve. So a FreeSWITCH cookbook would be great. It could mention many of the things that I’ve mentioned in this post but others such some of ones I mention here below are issues I’m still working on resolving:

  • how to setup dialing to a trunk with a fallback trunk if the call fails.
  • common voicemail setups
    • set different languages for different extensions
    • set different languages for an outside caller depending on how he dials FreeSWITCH (which incoming DID number is used)
    • setup of a common voicemail mailbox (typical in a SOHO environment) shared by various extensions
  • how to setup FreeSWITCH in a new language (what’s required to make this work). FreeSWITCH seems to come with English and Russian and there are some external non-free language sound files, but from what I can see no more free ones. Can we “hack” or use a set of Asterisk sound files (using as many of them that “match”) in the meantime as Asterisk has sound support in many more languages? A short guide of what is wanted, or perhaps some sort of fundraising to bulid these would be nice.  I’d be willing to donate something if this were coordinated by the FreeSWITCH developers. If I could figure out what is needed then I may do this myself.  My specific interest is in Spanish, but it would be nice to have some “British English” sound files. Not strictly necessary of course but would be nice.
  • How to change the dial tones to sound like different countries. I live in Spain but have never really figured out how to change the dial tones to match what my family members expect to hear. Though I think that the default setup seems to match pretty closely, it would be nice to match this properly.  Again specific examples for different countries is useful for everyone.
  • Setting up the equivalent of Asterisk’s macros. I’ve seen in several places that I’d like to do the same thing and can’t figure out how to do what I’ve done in Asterisk which is setup a macro to parameterise the functionality. Something as simple as:
    • dial_gateway(gateway_name,gateway_name_file,number_to_call) which makes my configuration
      • say “using gateway <gateway_name>” (taken from some wmv files)
      • set up recording of the call
      • dial the number via the specified gateway
    • for calls to an extension(extension_number,language)
      • ring for a while and if picked up, setup recording and answer
      • if not picked up redirect to voicemail (in the appropriate language)

These are just some examples of things which it seems need to be grouped together as common functionality and so far I have not figured out how to do this in FreeSWITCH short of duplicating the configuration for each extension or trunk.

Summing up

In the meantime I am going to try and figure out how to tighten down my configuration so that it feels safe. Having seen various
attack attempts and one that partly succeeded I really want to be able to trust the configuration to not be the cause of a large bill
at some time in the future.  I like FreeSWITCH and would not have persevered with it if I did not think that the software is good.
It just seems to me that the documentation, while available, is not there in a form which many of us can digest easily.  The FreeSWITCH 1.0.6 book has been very helpful but I believe needs to cover more detail than now perhaps in a second edition. I think that more and better documentation will encourage more people to use this software. Sometimes I wonder if I am the only one who finds FreeSWITCH tricky to configure.

My Review of MySQL High Availability

November 24th, 2010

Originally submitted at O’Reilly

Server bottlenecks and failures are a fact of life in any database deployment, but they don’t have to bring everything to a halt. MySQL provides several features that can protect you from outages, whether you’re running directly on the hardware, on virtual machines, or in the cloud. This…

Good coverage of MySQL and replication

By sjmudd from Madrid, Spain on 11/24/2010
4out of 5

Pros: Easy to understand, Accurate, Well-written, Concise, Helpful examples

Best Uses: Intermediate, Expert

Describe Yourself: Sys Admin, Database Administrator

I work in quite a large site and found the book full of useful tips of things to do, and ways to do them in order to achieve the title’s goal of MySQL High Availability.

To be honest the book should really be called “MySQL and replication in large environments” but that does not sound so exciting.

There’s less information on performance tuning and more on working with a large replicated setup. That’s fine: it fits my needs and will certainly fit others too.

It’s also interesting to note how the authors solve many problems by writing wrapper routines around MySQL and the required replication procedures. That’s done in python which is fine but most of the routines I have written with shell scripts or perl. Again if you are looking at the technique the implementation language does not really matter. Oracle should really take note and come up with some standard routines to make more of these tasks doable the same way by everyone. That makes all DBA or sysadmins lives’ easier.

So the book was full of good tips and procedures and is certainly worth the read.


BlueBox GUI for FreeSWITCH looks very promising

November 11th, 2010

(NOTE: If you’ve come here from my apologies. I’ve notified them to only follow my database related posts and hope they’ll not follow my full blog feed shortly.)

I recently came across a new site which offers a GUI configuration tool for FreeSWITCH.

Despite buying the FreeSWITCH 1.0.6 Book which is a very good read and playing a bit with the config files I’ve not found hand editing the native xml configuration files that intuitive.  This is probably because I don’t have enough spare time now to look at these things or participate in the freeswitch mailing list.

Anyway, I found BlueBox which seems to solve this nicely. The project seems quite new but as far as I can see the interface and usability is pretty good. This is basically similar to something like trixbox for Asterisk. Bluebox make things easy by providing a downloadable custom install image based on CentOS 5 and that gives you an empty PBX. While things are not yet quite working for me it does seem quite easy to add extensions (devices), trunks (gateways for calling outside) and then configure behaviour of voicemail, and for example whether you record incoming or outgoing calls. So this is nice, certainly nicer than doing this all by hand. However, it’s not quite working yet so I need to do further investigation to see why. Probably a mistake my end, a misunderstanding of the required input data or similar.

This project looks quite new and I have seen a couple of issues so far:

  • I attempted to install bluebox on vmware. Thinking this is just a PBX I only gave the vmware guest 512 MB of RAM. This was too little as I see that FreePBX uses 235 MB, Apache nearly 330 MB and MySQL (which stores the configuration) 240 MB. For a SOHO setup like mine that seems a lot.  Configuring the guest with 512MB does make the GUI a bit sluggish which is not helpful so I hope that bluebox will at least state the requirements more clearly and if possible figure out how to reduce the memory footprint (my current Asterisk setup uses 500MB with 80MB resident but the OS has plenty of memory spare).
  • Perhaps not surprisingly many PBX systems like this provide a default set of routes which look normal when used in your own country. The problem is the world is not as simple as this, or consistent with it’s numbering, so if you live somewhere else all these helpful initial routes are useless and you need to build your own. I’ve put in a ticket which is intended to address this for bluebox and give optional routes if based in Spain where I live.  Hopefully, the bluebox people will accept this and people who live in other countries can supply equivalent routes, thus making the initial configuration simpler for newbies like me. See BLUEBOX-221.
  • It’s possible to define more than one route to the same destination via different trunks. Currently it doesn’t seem possible to prioritise which of these should take preference though you can define that if a route fails to try another one. I’d like to see a way to prioritise the order in this case as one route may be less reliable but cheaper and so a good first choice option to try.

I’m sure these issues will get addressed one way or another and must admit to a very good first impression of Bluebox. I’m hoping to clear up my current mis-configurations shortly and thus be able to test behaviour and see how things go. Hopefully this will also allow me to look at the generated configuration files and see better how to configure FreeSWITCH in the future.
I’ll let you know how I get on.

Updated: 2010-11-13

Other thoughts which come to mind are:

  • The bad thing about GUIs is that debugging problems means the person reporting the problem needs to describe the problem very well, or better still provide screen shots of what is going on. Bluebox is going to have the same issue. I’d suggest that if possible some debug option is possible which will at least save the configuration and logging (excluding password information) so that people who are trying to help can see what is going on.  This certainly helped the MySQL support people who had to help me debugging several Merlin problems.
  • I believe the bluebox configuration is completely stored in the database. That’s good as it potentially allows more flexibility. One thing I notice from the current configuration is the configuration generation uses names like location_1, sipinterface_1, trunk_1, etc.  These names are logical from the point of view of the configuration generator, but not from the point of view of someone looking at their own code. So I’d suggest that for many of the configuration options like Trunks, SIP Interfaces, Locations etc that it’s possible to provide an “identifier type name” which if provided would substitute the less readable default tags. This perhaps requires a bit of work but would help someone trying to understand their configuration files.
  • The URL to adjust your location bluebox/index.php/locationmanager/edit/n provides fields for “Location name” and “Domain Name”. It might be clearer to and label this Domain Name/Realm which is what is shown in bluebox/index.php/locationmanager/index. See BLUBOX-224.

Bacula 5.0.3 binary rpms for CentOS 5 on

October 29th, 2010

I’ve recently updated my CentOS 5 x86_64 rpms for bacula-5.0.3.

Hope they are useful to you.

How analysing your binlogs can be quite informative

October 23rd, 2010

If you have used MySQL for some time you know that mysqld can write binlogs. This is usually used for backup purposes and JITR or for replication purposes so a slave can collect the changes made on the master and apply them locally.

Most of the time apart from configuring how long you keep these binlogs they are pretty much ignored.

Recently I came across an issue.  I have a slave server which is NOT configured read only and which has an additional database used to collect statistics from the replicated database and provided aggregation and business information. The typical sales per country, per product, per day, week, month, year, per whatever etc.  This is the usual datawarehouse type functionality.  It’s done on a slave and not the master so that the load is not passed on down stream to all boxes.

So this slave is quite busy and in fact is bigger than other normal slaves which are used for other things. The data mining processes are pretty heavy and create a lot of I/O on the server as would be expected.  Some times I noticed that the daily batch process would take a lot longer than usual, sometimes significantly longer. I monitor the server with various tools including cacti and MySQL Enterprise Monitor (Merlin), but for this particular issue which would happen from time to time the tools I was using could not identify why the batch jobs took so much longer.  This was very frustrating. Managers often get upset if their reporting tools do not provide up to date information so you can imagine…

After quite a lot of head-scratching eventually it seemed like a good idea to look at the binlogs. These files are not in a very readable format but using the mysqlbinlog binary you can get the output into something which is readable. This is true if you use statement based replication which I do (the reasons I do not yet use row-based replication needs a post another day). So a bit of perl later and I was able to analyze the binlogs and see on a day to day basis the number of statements that were stored in the binlogs and also the operations (INSERT, UPDATE, DELETE) which were applied on these tables.  A bit more manipulation enables you to count the rows affected for each statement.

So I looked at the results for a few different days and hey presto. The cause of the slowdown on the analytical slaves co-incided exactly with days were the slave was processing more changes via replication. The number of statements had gone up by more than a factor of three, due to a new script which had been recently added.  Neither the master or other slaves were really affected by this extra load but the analytical slaves generated a lot more extra I/O and basically this change was enough to trace the cause and via the replicated tables which were affected it was possible to find out which process was being heavier than expected and address the problem.

Since coming across this problem and because there is no real other way to see the “replication” related load on many slaves (at least using Merlin) I’ve now put in place some scripts which push this table activity statistics into a dba database so I can look at these number and see how they change over time. If I have a similar problem I can look at the statistics and hopefully identify the source of the problem more quickly and thus resolve the problem. This also keeps managers happy which is never a bad thing.

Processing binlogs takes some time, and mine are configured as 1GB files. Storing this summary information is much more compact and thus I can keep a much longer history than would otherwise be possible.

So if you have not considered doing this before and want to “measure” your replication load and perhaps see how it changes over time then consider analyzing your binlogs, storing the results in a database, and providing some pretty tables or graphs which you can use to show colleagues or managers the cause behind unexpected changes in batch processing times…

Update 24/10/2010 11:00 CEST

Since writing this article I also came across this web page which does something similar to what I am doing. It does not go into the detail I am trying to do but does collect some extra statistics which perhaps I should incorporate into my scripts.  Certainly the events, bytes and transactions per second counters (max and average) are worthwhile adding, especially if you record data over time.

It was also pointed out to me to look at Maatkit’s mk-query-digest but despite reading the different options I do not see a combination of options which gives per table or per database statistics (number of statements, average and maximum size of each one, number of rows affected). This is a command with a lot of options but I don’t think it covers my needs.

Since perhaps it was not clear actually what I am collecting this table below gives you a quick brief summary taken from one of the master servers (table names have been adjusted).

count     values     columns        bytes    op   table
269877         0     1096500    841309698    U    MainTable
456    675786    45152749    474400297    I    MainTable_new
144764    321457     5811580    345366449    I    Statistics1_201010
172069    172104     7575605    320017961    I    Statistics2_201010
247208         0     9662207    282552711    U    Statistics1_201010
144765    324509     3963200    243302911    I    Statistics3_201010
617756    617756     2471024    179151900    I    SomethingDemandAverage
144765    148219     5452203    142284127    I    Statistics4_201010
17978    525758     3136621    129989738    I    BookingNotes
16811    516676     7750189    107097598    I    SomeTypeOfMessages

count – number of statements related to this table
values – for INSERT type statements the number of VALUES ( ..) , ( … ) … values, basically rows inserted
columns – for INSERT and UPDATE statements the number of columns affected ( multiplied by the number of rows in the case of an INSERT )
bytes – total sum of bytes in each statement for each table.
op – I = Insert Statements, U = Update Statements, ….

Ordering by bytes or count, or grouping by table name allows you to focus on exactly which where the chunk of processing takes place.

While I have only been focusing on SBR replication you can reasonably easily do similar processing and get similar statistics based on RBR.

Missing from my current stats are using the exec time values (MySQL should record the exact number of seconds taken to do an execution, not the whole number of seconds. I should really file a feature request for this if it’s not in 5.5.) and correctly assigning the table to the appropriate database. That’s something that will happen shortly in my code.

Vigor2820n 3.3.4 firmware upgrade also breaks SIP registrations

October 16th, 2010

A few days after writing my last post about problems after upgrading my ADSL router’s firmware I also noticed that my VoIP connections were not working properly. I have several SIP providers and after a day or so of using the new firmware the SIP registrations to my providers started failing.  Initially I thought this was caused by my ISP as not all registrations seemed to be affected. I have not changed my Asterisk configuration in some time and did not associate the problem with the change in router firmware as everything else seemed to work fine.

The asterisk logging showed:

[Oct 15 02:12:04] NOTICE[3329] chan_sip.c:    -- Registration for '' timed out, trying again (Attempt #5)
[Oct 15 02:12:24] NOTICE[3329] chan_sip.c:    -- Registration for '' timed out, trying again (Attempt #6)
[Oct 15 02:12:44] NOTICE[3329] chan_sip.c:    -- Registration for '' timed out, trying again (Attempt #7)

A router reboot fixes the problem but it does come back again. Rebooting a router on a daily basis is not something I really want to do or think should be necessary.

I’ve reported the problem to Draytek support so will see what they say. In the meantime I see they’ve posted a version of the firmware so perhaps this is one of a few known problems. Let’s see if this latest version solves my problem.

Update 25/10/2010

In spite of the upgrade to I still notice the same problem:

[Oct 25 07:56:43] NOTICE[19553] chan_sip.c: -- Registration for '' timed out, trying again (Attempt #6898)
[Oct 25 07:56:56] NOTICE[19553] chan_sip.c: -- Registration for '' timed out, trying again (Attempt #5151)
[Oct 25 07:57:03] NOTICE[19553] chan_sip.c: -- Registration for '' timed out, trying again (Attempt #6899)
[Oct 25 07:57:16] NOTICE[19553] chan_sip.c: -- Registration for '' timed out, trying again (Attempt #5152)
[Oct 25 07:57:23] NOTICE[19553] chan_sip.c: -- Registration for '' timed out, trying again (Attempt #6900)
[Oct 25 07:57:36] NOTICE[19553] chan_sip.c: -- Registration for '' timed out, trying again (Attempt #5153)
[Oct 25 07:57:43] NOTICE[19553] chan_sip.c: -- Registration for '' timed out, trying again (Attempt #6901)
[Oct 25 07:57:56] NOTICE[19553] chan_sip.c: -- Registration for '' timed out, trying again (Attempt #5154)
[Oct 25 07:58:03] NOTICE[19553] chan_sip.c: -- Registration for '' timed out, trying again (Attempt #6902)
[Oct 25 07:58:16] NOTICE[19553] chan_sip.c: -- Registration for '' timed out, trying again (Attempt #5155)
[Oct 25 07:58:23] NOTICE[19553] chan_sip.c: -- Registration for '' timed out, trying again (Attempt #6903)
[Oct 25 07:58:36] NOTICE[19553] chan_sip.c: -- Registration for '' timed out, trying again (Attempt #5156)
[Oct 25 07:58:43] NOTICE[19553] chan_sip.c: -- Registration for '' timed out, trying again (Attempt #6904)
[Oct 25 07:58:56] NOTICE[19553] chan_sip.c: -- Registration for '' timed out, trying again (Attempt #5157)
[Oct 25 07:59:03] NOTICE[19553] chan_sip.c: -- Registration for '' timed out, trying again (Attempt #6905)

So looks like I’m going to have to revert back to 3.3.3.

Vigor2820n 3.3.4 firmware upgrade stops you accessing the “external ip” from inside

September 30th, 2010

I recently upgraded the firmware on my Draytek Vigor 2820n ADSL router to version 3.3.4 from 3.3.3. One thing that surprised me was that the change stopped me being able to access my public IP from behind the router. That is I have an internal LAN with RFC1918 addresses such as 192.168/16 and could access my public ip address which gets routed back by the router to an internal host. The 3.3.4 firmware upgrade stops that working. That makes it a bit of a pain to test connectivity to sites like my web site from inside compared to from outside and it took me a while to figure out what had happened. In the end I’ve had to implement split DNS so that internal references to the “external sites” I provide resolve to internal ip addresses internally but the public see my external ip address. A lot of work for a home network but otherwise things just don’t work properly.

So if you have a Vigor 2820n then be aware of the change in behaviour from the previous version of the firmware.

Found a nasty COALESCE() related bug in 5.5.6-rc

September 30th, 2010

Seeing as it looks like 5.5 is shortly about to go GA I thought I’d give it a run and see how well it works. The only way really to test it is to give it a bit of load and look for things which break. That I did with the 5.5.6-rc community rpms, compared to the 5.1 advanced rpms I usually run.

My colleagues, Ben and Peter, found a horrible problem which means that I can’t use this even for any real usage on my real servers. See: bug#57095 for all the gory details. Thanks to them both for finding the problem and then digging down and figuring out the real cause. Sometimes developers work a long way from the database so their errors don’t translate into something I can really look at in the database. They delved into the problem and then found the cause and a nice easy test case which I could report to MySQL.

After a couple of days of running this was the only thing that turned up. It may not catch many people but it might catch you.

I hope this gets resolved quickly so I can try and find the next bug. 5.5 does look interesting and I’m looking forward to it becoming GA. Once that happens I have to decide the upgrade plan again.  That is a surprisingly long procedure to run from start to finish. I do not want to chase MySQL and it’s newest release but performance changes in 5.5 make it very interesting. I remember when I tested 5.4 last year it was clear that 5.4 was much better than 5.1 performancewise, but it was not GA. Had I jumped on the 5.4 bandwagon to get these extra features or speed I would probably have paid for that. So it does pay to be patient.

Getting temporal configuration values into date-range value equivalents

September 7th, 2010

I collect a lot of configuration values from my database servers and most of these values are stored by date.

So often I end up with values such as:

config_date config_value
2010-09-01  value_1
2010-09-02  value_1
2010-09-03  value_2
2010-09-04  value_3
2010-09-05  value_3
2010-09-06  value_3
2010-09-07  value_3
2010-09-08  value_4
2010-09-09  value_4
2010-09-10  value_1
2010-09-11  value_5
2010-09-12  value_5
2010-09-13  value_5
I´ve been unsuccessfully been trying to figure out how to convert this in SQL into something like the following:

config_from config_to   config_value
2010-09-01  2010-09-02  value_1
2010-09-03  2010-09-03  value_2
2010-09-04  2010-09-07  value_3
2010-09-08  2010-09-09  value_4
2010-09-10  2010-09-10  value_1
2010-09-11  2010-09-13  value_5

The second format is often much shorter if configuration changes are not that frequent. While this is straight forward to do in a programming language in a single pass once you have the config_date ordered data, I can’t figure out how to do this with SQL and no programming. Celko´s books haven’t enlightened me either.

So do you know how to do this?

Careful how you monitor MySQL

September 7th, 2010

I was recently struck by a problem which is unusual. In order to keep an eye on the database server I use nagios, cacti, merlin and some local scripts to monitor the database instance and ensure that it is working properly.  That normally works fine.  The different monitor processes do various things, one of which is to monitor the replication status of a slave, and warn me if the replication is not working or if it’s behind. This is done with the command SHOW SLAVE STATUS.

The server I was looking at runs some large local batch jobs aggregating data. Unfortunately, I was experiencing that replication was interfering with these batch jobs so decided to see if things would perform better if I stopped replication and let the batch jobs complete, restarting replication afterwards.

So the command STOP SLAVE was sent to the server, and this took some time. In the meantime SHOW SLAVE STATUS hangs. The STOP SLAVE command waits for the SQL replication thread to finish it’s task but that thread was waiting on the tables being used by the batch process, which was already running. The consequence of this was that STOP SLAVE waited, all calls to SHOW SLAVE STATUS blocked and with nagios, cacti, merlin and the local scripts all doing periodic SHOW SLAVE STATUS commands which would hang I ended up with the mysqld running out of user connections.  Nasty!

While I see if MySQL can do something about the behaviour of stopping the slave and showing the slave status I’m going to implement grants which limit the monitor users so they are only allowed to have a user concurrent connections open.  This is done using the syntax

GRANT whatever ON whereever TO some_user WITH MAX_USER_CONNECTIONS 5;

5 looks like a good number for nagios, merlin and the local scripts and 10 is probably sufficient as I collect a lot of graph data for the mysql server and the cacti requests are often done in parallel.

So if you monitor MySQL it may be worth you applying some sort of connection limit on your monitoring scripts, if not on other database users.

XS4ALL offer IPv6 connectivity to retail customers

September 5th, 2010

Good news. I was told by a colleague that the Dutch ISP XS4ALL is offering IPv6 connectivity to its retail customers. You can see here although the comments are in Dutch.  They also provide a list of ADSL routers which should work for their service. The Cisco name may not be surprising but this is good publicity for Draytek and AVM FRITZ!box for their products. Hopefully it will also stimulate other SOHO router providers into the act to get their names on the list.  Let us hope that more ISPs start to offer this sort of service to their customers.

I’m not sure if there’s a list of residential ISPs in each country which provide IPv6 connectivity. If not it might be worth making one and updating it as new providers off this service.

I checked my ISP, Jazztel, and was not really surprised to see that neither technical support or the sales staff really new what I was talking about when I asked if they were planning on offering IPv6 support. That’s unfortunate, but I don’t think any other Spanish ISP is any better.  That is any other residential ISP in Spain. I expect some of the larger bigger ISPs are likely to off this to business customers.

Asterisk attack

August 31st, 2010

There was a lot of talk about this being the next menace after email spam. I’m not actually sure what it’s called for VoIP systems, but my Asterisk setup has started to be attacked over the last few days. Lots of entries like:

[Aug 27 19:20:30] NOTICE[18826] chan_sip.c: Registration from '"742"<sip:742@a.b.c.d>' failed for '' - No matching peer found
[Aug 31 10:13:10] NOTICE[18826] chan_sip.c: Registration from '"1002" <sip:1002@a.b.c.d>' failed for '' - Wrong password

Lots of messages get logged a second and I noticed this as suddenly CPU load on my PC jumped up quite a bit.

For the moment I’ve routed these addresses via the interface lo0 so they won’t bother me any more, but I need to come up with a better solution.

First I’m curious if applications like Asterisk or FreeSwitch have any built-in anti-abuse controls to recognise bad behaviour and to disable those abusers. I’m pretty sure that I’ve not read about anything for Asterisk, and I’m currently reading the FreeSWITCH book I bought but haven’t come across this mentioned yet.  Seems that applications like this may need to have these controls added at some time, just as sendmail, postfix and most mail servers have had to adjust to a hostile world.

The other option of course is to use a firewall or packet filter to limit the incoming traffic rate from a single IP to port 5060 or whereever the SIP connection is being accepted so that when going over the limit the ip will be blocked for some time. iptables can do this I think so I’m going to have to read about how to configure and set that up.

There are other applications designed to watch logs and use them to automatically add temporary blocks. fail2ban is one of these. I’ll also have to see if I can configure it for this task.

So if this has happened to you how do you protect your VoIP systems from that hostile world of the Internet?

Don’t forget to monitor your nameservers

August 24th, 2010

As I mentioned in a past article I got my IPv6 connectivity working so started working on setting up various IPv6 services. One of these was to setup my name server so it also worked on IPv6.

This worked fine, but recently I lost my IPv6 connectivity but thought no more about it. I’m trying to get it working again but that’s not happened yet.

Over the last few days I’ve had some intermittent issues logging in to my home pc from my laptop and I couldn’t figure out why. Eventually I had a look at the DNS requests and of course some requests were being made to my non-reachable name server. This generated some network delays while the resolver timed out and then tried talking to one of the IPv4 name servers and correctly retrieving the information.

So conclusion, if you setup yourself with a name server for your domain and include access on IPv6 then ensure that you monitor it. I do run nagios at home and intend to add some extra checks so that I can see if there are problems. Even when playing with IPv6 if you don’t have things setup properly it can affect non-IPv6 activity…

Once the unreachable IPv6 AAAA entry was removed from my name server things went back to normal.

Using LVM snapshot filesystems for development database instances

August 22nd, 2010

The Problem

Developers often need to have a development database copy of the live production system you are using in able to allow them to test their code and to test new functionality and make schema changes to the database for this new functionality to work.

That’s normal and happens everywhere. A typical DBA task is to make a copy of the live system, sometimes to remove any confidential or sensitive information which perhaps the development database users should not be able to see, and then give them access to this development instance. The developers then “hack away”, changing their code and perhaps things in the database until they are ready to put these new changes into production when they then come along and discuss how to apply these changes into the live systems.

Once the development database has been created it soon becomes stale so often the developers want a new up to date copy to be made to “simplify” their testing.

This is all fine until the database size begins to grow and this process of dumping and copying the data for the developers takes hours rather than minutes and therefore can only be done on a daily or weekly basis.

I have recently been experimenting with the use of mounting the development database instance on an  LVM snapshot of the original filesystem where the live system is running.  This procedure does not have to be Linux specific but should work with any OS or storage which provides a facility to make a filesystem snapshot based on the contents of another filesystem.

So what does this mean in practice?

Previous Behaviour:

Situation: server1 has a live production copy of the database. We want to make a copy to server2. server2 is already prepared with a configuration which will work based on the copy of the production data.

Procedure: stop server1, copy the filesystem holding the database to server2, start server1. start server2. [the copy procedure takes hours.]

New Behaviour:

Situation: server1 has a live production copy of the database (probably via a slave), and space/memory for a second development instance to run concurrently on the same server.

Procedure: stop server1 (live instance), make a LVM snapshot of the live filesystem (the snapshot size can be much smaller than the live filesystem size), start server1 (live instance), start server1 (development instance). [the copy procedure takes just a few seconds.]

Since I do this with the live system being a slave, I tend to also include a routine to disable replication information on the snapshot filesystem by removing the appropriate files. It may also be necessary adjust the grants on the dev-instance so that it is appropriate for the new set of db users.

To all intents and purposes when you login to the development instance it looks like an up to date copy of live system. You can make as many changes as you like as long as the number of disk blocks on the snapshot which get changed don’t exceed the snapshot size. At this point the snapshot filesystem becomes invalid and mysqld is unable to access it. Mysqld gets a bit upset about this, but you just kill it and then go and rebuild the instance again if this happens – it only takes a few seconds.

This works pretty well and speeds things up for the developers. I can create a new development environment from the live system in seconds rather than hours. The disk storage requirements also tend to drop significantly. It also helps the devs. If you do a daily refresh of this development instance then it allows the developers to test any schema changes which will be needed to be applied to the live system much more easily as “going back to the current live state” is so easy.

So if you haven’t done something like this it might be worth giving it a go.

This is an example of the output from a script I’m currently using:

[root@myhost ~]# clone_instance -s40G -d live-instance dev-instance
Aug 22 10:57:00 myhost clone_instance[16405] Cloning live-instance to dev-instance with a snapshot volume of size 40G
Aug 22 10:57:00 myhost clone_instance[16405] /mysql/live-instance is mounted as expected
Aug 22 10:57:00 myhost clone_instance[16405] Device /dev/volgroup1/live-instance is mounted on /mysql/live-instance, having volume group: volgroup1, logical volume: live-instance
Aug 22 10:57:00 myhost clone_instance[16405] live-instance is defined in /etc/my.cnf [mysqld1]
Aug 22 10:57:00 myhost clone_instance[16405] dev-instance is defined in /etc/my.cnf [mysqld2]
Aug 22 10:57:00 myhost clone_instance[16405] Found defaults file /root/.my-live.cnf needed to shutdown live-instance
Aug 22 10:57:00 myhost clone_instance[16405] Going to viciously kill any processes using files under mount point: /mysql/dev-instance
Aug 22 10:57:02 myhost clone_instance[16405] Unmounting /mysql/dev-instance
Aug 22 10:57:02 myhost clone_instance[16405] Removing existing SNAPSHOT LV /dev/volgroup1/dev-instance
Logical volume "dev-instance" successfully removed
Aug 22 10:57:03 myhost clone_instance[16405] SNAPSHOT LV /dev/volgroup1/dev-instance removed
Aug 22 10:57:03 myhost clone_instance[16405] Shutting down live-instance [mysqld1] using mysqldmin and defaults file /root/.my-live.cnf (as 'mysqld_multi stop 1' does not seem work properly)
Aug 22 11:02:34 myhost clone_instance[16405] Creating new snapshot LV dev-instance (40G) based on /dev/volgroup1/live-instance
Logical volume "dev-instance" created
Aug 22 11:02:35 myhost clone_instance[16405] Restarting live-instance [mysqld1] using mysqld_multi start 1
Aug 22 11:02:35 myhost clone_instance[16405] Mounting SNAPSHOT LV /dev/volgroup1/dev-instance on /mysql/dev-instance
Aug 22 11:02:35 myhost clone_instance[16405] Cleaning up log files on SNAPSHOT LV
Aug 22 11:02:35 myhost clone_instance[16405] Removing replication information from SNAPSHOT LV
Aug 22 11:02:35 myhost clone_instance[16405] Found defaults file /root/.my-dev.cnf needed to access dev-instance
Aug 22 11:02:35 myhost clone_instance[16405] Starting SNAPSHOT DB instance dev-instance [mysqld2] using: 'mysqld_multi start 2'
Aug 22 11:02:35 myhost clone_instance[16405] Clone procedure complete.
[root@myhost ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
160G  135G   26G  84% /mysql/live-instance
160G  134G   27G  84% /mysql/dev-instance
[root@myhost ~]# lvs
LV             VG        Attr   LSize   Origin         Snap%  Move Log Copy%  Convert
live-instance  volgroup1 owi-ao 160.00G
dev-instance   volgroup1 swi-ao  40.00G live-instance    2.06
[root@myhost ~]#

The longest part of the procedure is shutting down the active live slave. The rest of the time is insignificant.
The “dev experience” when using this dev-instance is just as before. The instance works, can be modified and behaves just as you would expect.

Review of IPv6 Network Administration

August 18th, 2010

Originally submitted at O’Reilly

This essential guide explains what works, what doesn’t, and most of all, what’s practical about IPv6–the next-generation Internet standard. A must-have for network administrators everywhere looking to fix their network’s scalability and management problems. Also covers other IPv6 ben…

A little outdated – needs updating

By Simon Mudd from Madrid, Spain on 8/18/2010
3out of 5

Pros: Well-written, Easy to understand

Cons: Too basic

I forgot to check the date this book was published before buying it and that was a mistake. The content was pretty good but as it was written in 2005 a lot of what it says regarding versions of operating systems and the current status of software is well out of date. That’s unfortunate and could probably be easily corrected. I think that while the target audience is not clear, I am looking from a hobbyist’s perspective as to how to setup my current home network to IPv6 and I did not really get enough information. The fact that wifi is pretty common now, but that if the wifi router you use does not support IPv6 (which is typical) then you have to make tunnels in your own IPv4 lan to achieve connectivity between hosts.

I have many other O’Reilly titles and some of them are very good. I was hoping for more from this one especially as I had previously bought IPv6 Essentials some time back.