I thought I would upload a small script I use which I find quite useful. log_processlist.sh is normally run minutely from cron and collects statistics from any MySQL instance it finds. Data is kept for up to a week and then rotated. This allows you to check what happened over a period of time, for example if an instance crashes or seems to have a lot of load.
I have found it quite invaluable on the instances I’m monitoring and once setup it’s usually nice and silent. If you find this useful or have suggestions for improvements please let me know.
Although I’m a fan of writing scripts to solve administrative issues… running this every minute is going to consume useless resources. You’re better off using a mysql monitoring/performance tuning app to track historical use of the server. I’m referring to MySQL Enterprise Monitor, MonYog, Kontrollbase, ZABBIX, or one of the many other solutions out there.
@FooBar:
All of those tools you more resources than invoking a small process to grab ‘show processlist’. With the vast resources of modern computers, something running once a minute is not going to be a big drag. I’m not suggesting that you shouldn’t be running a graphing tool for trending (I noticed that Cacti is missing from your list) is a great idea.
@Simon:
There are lots of ways to get information about your queries, but one of my least favorites is the processlist.
My personal favorite is the slow query log, which doesn’t consume a lot of resources and can be configured to collect all queries.
If you do want to grab ‘processlist’ information, you might want to consider using mk-query-digest and the –query-review and –review-history options to regularly report on the information and know at a glance when new queries are added, etc. It supports reading a processlist from a running server.
If you want to find information about periodic spikes, then I suggest you look at stalk/collect. If you can identify a symptom of your spike (such as the number of connections exceeds a threshold for N intervals, etc) you can use stalk to identify when that symptom is happening and invoke collect, which will grab the processlist, iostat, vmstat, show innodb status, etc, all specifically targeted to the event.
Sorry, that suffered from some bad editing.
All monitoring uses resources. Suggesting that a simple script that runs once a minute would have less impact than say, enterprise monitor is probably not right.
I think it is a good idea to use a trending system, but it will use resources to do its job too.
There is some humour to the comment that this tiny shell script running lightweight diagnostic commands wastes resources running every minute, and then suggests to use a bloated tool instead which would consume even more resources.
You are right. This script is not meant to be used as an alternative to nagios, cacti, Merlin (MySQL Enterprise Monitor) or any other equivalent software which should also be running on a well managed system.
This script fills in a whole left by that software in not giving you snapshots of state over time, which you can go back and check later. I’ve used this when there’s been a sudden change in the number of connections going to the database to figure out where those connections came from, which users generated the problem and what they were doing. None of the software you mention can do that. They tend to provide counters and graphs and can warn you if certain counters are out of the expected range.
This script gives you some adhoc information which allows you to have an idea of what happened AFTER the event when you are diagnosing in more detail. Or to give you an idea of how things change over a day or few days. Often it requires extra scripts to manipulate this raw information but it’s the raw information you want to store.
As for resources, I have not noticed this being a problem. The MySQL servers I manage are busy and this script has been running for several years and proved itself. It requires next to no maintenance. The disk storage it generates acts like a circular buffer and so does not require cleaning up.
If it is not suitable for your environment then of course do not use it.
Justin, I was unaware of stalk/collect. Sounds interesting. I’ll have to look at this.
I’m not so sure about using the slow query log to collect all queries, especially on a busy master server processing thousands of queries a second. You are right that it can provide some invaluable information into what happens and perhaps in many cases when configured with a low enough threshold gives you enough information to see bad queries or queries you don’t expect to be there.
Other tools like Merlin’s Query Analyzer is also very useful but I have seen issues with that on a busy server. Providing the proxy with a smaller sample of the total traffic tends to help a lot in that respect.
In response to you all the script _is_ lightweight and _has_ provided it’s use in the past. In the end we all need several tools to monitor our database servers and pick the one that has most information when diagnosing issues. More instrumentation inside the mysql server itself would be nice and 5.5 and performance_schema seems to be the way the developers are going to make that available to us. That is very welcome.
Scripts like this are vital as you say — for diagnosis you need not only the stream of events, but the snapshots of state at points in time. The slow query log doesn’t provide a good way to, for example, see how many queries were running at a given instant and what state they were in.
On the other hand, I think the slow query log often falls victim to a fallacy of “too much load.” If the server isn’t overloaded, it isn’t a problem. If the server is overloaded, you need it more than anything. The only real issue with it is disk consumption, but that’s easy to solve.
I like the performance schema, but it is also one-dimensional, and proper diagnosis needs lots of views on problems. So it doesn’t go far enough, and I don’t think we should expect it to. It’s only one approach; I doubt it’s possible to create a single solution that works for everything.
You are doing something we do similarly though in our shop, we use events to monitor the number of running processes and snapshot when the count gets too high or a thread has been running too long. It would be easier to do this externally if we wanted to also grab INNODB STATUS output but that is beyond what the event captures for now. Nice to hear how others are doing similar things.
KB
We are currently working to make pt-stalk in Percona Toolkit capable of automatic fault detection, which will allow it to learn what is “normal” behavior over time and capture information when the server’s behavior is not normal.
myname=$(basename $0)
lockfile=/tmp/$myname.lock
and later
echo $$ > $lockfile
doesn’t look like a good idea – it makes the script vulnerable to symlink attacks and allows an attacker (which can be any user with shell access) to overwrite/destroy any file. Using a directory that is not world-writeable would be a good idea 😉
Besides that, I’d recommend some quoting around $myname and $lockfile – if someone is cracy enough to rename the script to something with a space in it, it will break. (OK, you could also call it “shoot yourself in the foot” 😉