Archive for 18th June 2009

cinfo for seeing how much of a file is being cached

cinfo is a kick ass program written by the author of PowerDNS Bert Hubert to show how much of a file is being cached by the operating system. It’s great for testing to see if myisam data files are being read from the operating system cache or off disk. For example this is the cache status of myd files for my blog:

wordpress_comments.MYD: 1127 pages, 1121 pages cached (99.47%)
wordpress_links.MYD: 1 pages, 1 pages cached (100.00%)
wordpress_options.MYD: 309 pages, 309 pages cached (100.00%)
wordpress_postmeta.MYD: 2 pages, 2 pages cached (100.00%)
wordpress_posts.MYD: 100 pages, 94 pages cached (94.00%)
wordpress_term_relationships.MYD: 2 pages, 2 pages cached (100.00%)
wordpress_terms.MYD: 1 pages, 1 pages cached (100.00%)
wordpress_term_taxonomy.MYD: 1 pages, 1 pages cached (100.00%)
wordpress_usermeta.MYD: 10 pages, 10 pages cached (100.00%)
wordpress_users.MYD: 2 pages, 2 pages cached (100.00%)

Most of the files are in the filesystem cache so there is a very good chance that when you loaded this page mysql was able to read all the data for my blog from ram.

Cinfo is distributed by source tarball but I’ve created a x86_64 rpm. You can download a modified tarball that can be used with rpmbuild to create your own rpm. The basic process for building an rpm from my tarball is:

create a ~/.rpmmacros file with a path to a buildroot:

%_topdir /tmp/my_buildroot

Then create some directories in that buildroot

cd /tmp/my_buildroot
mkdir SPECS SRPMS BUILD RPMS

The rpm can be built directly from the tarball with:

rpmbuild -ta cinfo-0.1.tar.gz

If everything works ok it should drop an arch specific rpm in:

/tmp/my_buildroot/RPMS

Load average

Modern operating systems give the impression of running more processes than the number of available processors by giving each process that wants to run, such as your web browser or text editor a tiny slice of time in which they can execute. By switching out the running process very quickly it looks as though all processes are running even though only a few are at a time. Other processes that want to run are placed in a queue where they wait for their slice of cpu time to execute. The number of processes waiting in this queue is called load average.

Most people who run a flavor of unix inevitably come across load average. It’s in quite a few tools such as uptime, top, and w. It’s usually displayed as a 1, 5, and 15 minute average. Load average can change based on the number of processes wanting to run on a machine. This is a deceptively evil monitoring metric because it’s very difficult to define what a good and bad load average is.

Load average is often used as a comparison metric when trying to diagnose machines that are overloaded. While participating in discussions about machine load I often hear things like, “But the load average is OK” or, “The load average is way high.”. The problem with this is that on one server an OK load average may be 10 and on a server that’s overloaded it may also be 10, or 5, or 200. For situations like this it’s much easier to use response time because it can be easily compared between different types of servers or software. A response time of one second is a response time of one second no matter where you are. It’s very easy to judge if that’s good or bad based on the application. On the same servers one could have a response time of one second and a load average of 200 and another could have a response time of one second with a load average of 50.

Using load average as a comparison metric is so difficult because a higher load average doesn’t really mean a server is more loaded or that it isn’t responding appropriately. The reason for this is that the time a process spends waiting in the queue can be effected by different forces including the number of processes in line before it, how long those processes run for, and if your process gets and interrupt that can let it jump the queue. This is why a server with a load average of 10 may be almost impossible to work with and a server on the same hardware right next to it with a load average of 200 may be responding just fine.

The moral of all of this is to use load average for exactly what it is, the depth of the run queue. Don’t use it for comparison between different types of servers and don’t quote it when saying a server is overloaded.