cinfo for seeing how much of a file is being cached
cinfo is a kick ass program written by the author of PowerDNS Bert Hubert to show how much of a file is being cached by the operating system. It’s great for testing to see if myisam data files are being read from the operating system cache or off disk. For example this is the cache status of myd files for my blog:
wordpress_comments.MYD: 1127 pages, 1121 pages cached (99.47%)
wordpress_links.MYD: 1 pages, 1 pages cached (100.00%)
wordpress_options.MYD: 309 pages, 309 pages cached (100.00%)
wordpress_postmeta.MYD: 2 pages, 2 pages cached (100.00%)
wordpress_posts.MYD: 100 pages, 94 pages cached (94.00%)
wordpress_term_relationships.MYD: 2 pages, 2 pages cached (100.00%)
wordpress_terms.MYD: 1 pages, 1 pages cached (100.00%)
wordpress_term_taxonomy.MYD: 1 pages, 1 pages cached (100.00%)
wordpress_usermeta.MYD: 10 pages, 10 pages cached (100.00%)
wordpress_users.MYD: 2 pages, 2 pages cached (100.00%)
Most of the files are in the filesystem cache so there is a very good chance that when you loaded this page mysql was able to read all the data for my blog from ram.
Cinfo is distributed by source tarball but I’ve created a x86_64 rpm. You can download a modified tarball that can be used with rpmbuild to create your own rpm. The basic process for building an rpm from my tarball is:
create a ~/.rpmmacros file with a path to a buildroot:
%_topdir /tmp/my_buildroot
Then create some directories in that buildroot
cd /tmp/my_buildroot
mkdir SPECS SRPMS BUILD RPMS
The rpm can be built directly from the tarball with:
rpmbuild -ta cinfo-0.1.tar.gz
If everything works ok it should drop an arch specific rpm in:
/tmp/my_buildroot/RPMS
Load average
Modern operating systems give the impression of running more processes than the number of available processors by giving each process that wants to run, such as your web browser or text editor a tiny slice of time in which they can execute. By switching out the running process very quickly it looks as though all processes are running even though only a few are at a time. Other processes that want to run are placed in a queue where they wait for their slice of cpu time to execute. The number of processes waiting in this queue is called load average.
Most people who run a flavor of unix inevitably come across load average. It’s in quite a few tools such as uptime, top, and w. It’s usually displayed as a 1, 5, and 15 minute average. Load average can change based on the number of processes wanting to run on a machine. This is a deceptively evil monitoring metric because it’s very difficult to define what a good and bad load average is.
Load average is often used as a comparison metric when trying to diagnose machines that are overloaded. While participating in discussions about machine load I often hear things like, “But the load average is OK” or, “The load average is way high.”. The problem with this is that on one server an OK load average may be 10 and on a server that’s overloaded it may also be 10, or 5, or 200. For situations like this it’s much easier to use response time because it can be easily compared between different types of servers or software. A response time of one second is a response time of one second no matter where you are. It’s very easy to judge if that’s good or bad based on the application. On the same servers one could have a response time of one second and a load average of 200 and another could have a response time of one second with a load average of 50.
Using load average as a comparison metric is so difficult because a higher load average doesn’t really mean a server is more loaded or that it isn’t responding appropriately. The reason for this is that the time a process spends waiting in the queue can be effected by different forces including the number of processes in line before it, how long those processes run for, and if your process gets and interrupt that can let it jump the queue. This is why a server with a load average of 10 may be almost impossible to work with and a server on the same hardware right next to it with a load average of 200 may be responding just fine.
The moral of all of this is to use load average for exactly what it is, the depth of the run queue. Don’t use it for comparison between different types of servers and don’t quote it when saying a server is overloaded.
fadvise syscall, myisam data file caching, and a lesson learned in debugging
fadvise is a system call that can be used to give Linux hints about how it should be caching files. It has a few options for caching, not caching, read ahead, and random access. I was looking into used fadvise because a client ran into an issue where some infrequently used myisam data files were being pushed out of the filesystem cache by binary logs and other activity. The files are used infrequently for queries but when they are used they need to be fast. When the files weren’t cached the particular query ran in about 30 seconds. When they were cached the query ran in .8 seconds — huge difference. The fix seemed pretty trivial, call fadvise on the myd files, they stay in cache, the queries are consistently faster and the problem is solved. It seemed simple but it wasn’t. I’ll cover the myisam issue in more detail in another post, this is all about fadvise and debugging.
I couldn’t find an rpm that contained a simple tool to control fadvise that I could called from a shell script (aside from the perl with inline C tool) so I decided to roll my own. The client has an existing system for building C tools to be deployed on servers. The system is such that it takes a bit of time to compile, deploy, and test so I wrote a small C app to test my theory which I could later merge into the clients tool system. I wrote the tool, tested it and the files stayed in cache as predicted so I started on the process of merging the tool with the clients system. The system used to compile the tool is 32bit the mysql servers are 64bit. When I originally wrote the test tool I compiled it directly on the 64 bit boxes.
When I rolled the fadvise tool compiled on the 32 bit system to the 64 bit system it failed with an illegal seek error. My original call to fadvise looked something like this:
posix_fadvise(fd, 0, lseek(fd, 0, SEEK_END), POSIX_FADV_WILLNEED);
I’m sure some of you are thinking, “the lseek is pointless, 0 will include the entire file.” I learned that lesson later on and it plays an important part in this story. When I ran strace on the 32bit binary on the 64 bit system the fadvise syscall looked like:
fadvise64(3, 0, 0, 0×594 /* POSIX_FADV_??? */) = -1 EINVAL (Invalid argument)
Strange, right? I checked the code, added some debugging output for the lseek and tried again with the same result. I went back to my small test program and checked to see if it had the same fadvise and lseek calls. It did. I pinged a friend on IRC and he suggested quite a few things to check including removing the unnecessary lseek. After the lseek is removed the strace output changes to:
fadvise64(3, 0, 0, POSIX_FADV_NORMAL) = 0
Notice that the last argument is POSIX_FADV_NORMAL instead of POSIX_FADV_WILLNEED. At this point I had been trying to debug this for a few hours and I started to think I was going insane. Frustration lead to typos and eyestrain. I sent the output back to my friend assuring him that I’m not insane and this has to be another problem. After some more debugging something clicks in his brain and he realizes it’s a kernel bug. (Note: I don’t fully understand this part so if you’re reading this and you do, please leave a comment.). The 32bit binary running on the 64bit kernel with the bug is taking the 3rd argument and splitting it across two registers to make up the third and fourth argument. POSIX_FADV_NORMAL is defined as 0 so the call is really
fadvise64(3,0,0,0)
Which in my case is a no op but it’s a successful no op and since I haven’t checked it with strace I think it’s caching files. I move this fix over to the clients tool, recompile, and retry my myisam performance test only to find that the files aren’t being kept in memory. Rerunning the fadvise tool doesn’t bring them in memory because it’s being called with POSIX_FADV_NORMAL. After more work with my super smart friend we pin down the problem to the register issue above and I resolve to write a test the next day (today) to test the the theory.
The test works.
Now I have two binaries created from the same source code, one that will successfully execute on both 32 and 64 bit architectures but is a no op on the 64 bit architecture. Since the clients build system isn’t setup to compile 64 bit binaries I end up giving them two binaries with a shell script that will detect the architecture and execute the correct binary.
The debugging lesson learned is to stay calm and think through problems. Frustration and disbelief led to the debugging cycle taking a few hours longer than it should have (it was about 10hrs total including all the testing). It also stressed me out. The other lesson learned is that is pays to have really smart friends. Thanks guys, I owe you a beer.
Using vim as a man page viewer
I love color on my terminal. Black and white editors and and man pages are boring and more difficult to read than simple syntax highlighting. Vim supports syntax highlighting for man pages. Using it is as simple as setting an environment variable.
MANPAGER=”col -b | vim -c ’set ft=man nomod nolist’ -”
You can add this line into your .bashrc to enable vim to be the man page viewer by default.
export MANPAGER=”col -b | vim -c ’set ft=man nomod nolist’ -”
The one problem with this is that it gets somewhat annoying to type :q to quit viewing a man page. The default man page viewer, less has ‘q’ mapped to quit. I map this in vim to make it easier to quit viewing files and man pages. Simply add this to your .vimrc file.
map q :q!
This will map the q key to quit vim while in visual mode. I’m not sure what q does by default but whatever it does I don’t use it.
Why Oracle can’t kill MySQL.
When Oracle agreed to acquire Sun there was some speculation that Oracle might try to kill MySQL. First this wouldn’t be a very prudent effort on Oracle’s part and second it’s not even possible. I think Monty has the best explanation from his comment on his blog:
The simple fact is one can’t own an open source project. One can control it by controlling the people which are leading and developing it. When a company doesn’t take good care of their employees and those employees start to leave the company and work on the project elsewhere, that company has lost control of the project.
The whole comment is worth reading. Monty does a good job of putting the Sun purchase into perspective with regards to MySQL and the developer community.
Update of Google’s Sysbench patch to 0.4.12
[Update: I found the magic javascript links that show old releases of sysbench.]
Sysbench is an application that can be used to benchmark different system parameters and also includes support for testing MySQL directly. Google has released a patch for sysbench that adds a lot of new OLTP tests. It’s great for testing MySQL and for drag races against Mark’s tests. Their patch seems to apply against sysbench 0.4.10. I was able to find sysbench 0.4.10 but it wasn’t easy so I’ve ported Google’s patch to sysbench 0.4.12.
Grab the patch here.
Percona Performance Conference EMT Presentation Slides
I sat down about 20 minutes ago to write a blog post that included a link to the slides of my EMT presentation. It turned into a long post about the presentation, how I feel EMT was received and my feelings on presentations in general. Here is the short post and the link to the slides.
The MySQL conference always inspires me to write so expect a longer post in a few days.
MySQL Brings the Heat
This week throngs of MySQL developers, users, and enthusiasts descended on silicon valley. Apparently the valley’s cooling system can’t keep up because as they arrived the outside temperature went up into the 90s (32s for those of you who choose to use a sane temperature measurement system). I’m not attending the conference this year but I almost wish I was to get some of the air conditioning. It’s supposed to cool off and rain on Saturday. As the brain power leaves I suspect the valley is going to cool and moisture in the air will condense into rain, or tears.
I bet you thought this post was going to be about Oracle and Sun, sorry. I think the weather is more interesting.
Longest beta ever, myisamchk –parallel-recover
I was reading through the manual and noticed that myisamchk parallel recover option is still listed as beta code. The feature was added in 4.0.2 which was released in july 2002. This means it’s been in beta longer than gmail
Where did 5.0.79 enterprise come from?
While updating the mirror last week I was surprised to see that the newest MRU MySQL release is numbered 5.0.79. Previously enterprise releases had even numbers and community releases had odd numbers. I posted the question in #mysql-dev and HarrisonF was kind enough to explain it all.
MySQL 5.0 is running out of version numbers. There are limitations in mysql_get_version(), the executable comment syntax, and other places that mean MySQL can only have two digit release version numbers. MySQL Enterprise has started using odd and even version numbers to extend the life of 5.0.
This raises a few questions. What will happen to 5.0 when it runs out of release numbers? Is community going to be sacrificed to give enterprise more versions to use? Are the version restrictions going to be fixed in the future? For example if a feature is implemented in a community release the executable comment version syntax isn’t suitable for preventing it from being executed in a newer enterprise release because the version scheme doesn’t differentiate between enterprise and community.
I think this is now rock solid proof that there were too many features packed into 5.0 and it was released too early. I hope there will be more major releases in the future with fewer features so these problems are prevented. By the way the advanced vs pro enterprise binaries add a whole new layer to the MySQL version issues.
