fadvise syscall, myisam data file caching, and a lesson learned in debugging

fadvise is a system call that can be used to give Linux hints about how it should be caching files. It has a few options for caching, not caching, read ahead, and random access. I was looking into used fadvise because a client ran into an issue where some infrequently used myisam data files were being pushed out of the filesystem cache by binary logs and other activity. The files are used infrequently for queries but when they are used they need to be fast. When the files weren’t cached the particular query ran in about 30 seconds. When they were cached the query ran in .8 seconds — huge difference. The fix seemed pretty trivial, call fadvise on the myd files, they stay in cache, the queries are consistently faster and the problem is solved. It seemed simple but it wasn’t. I’ll cover the myisam issue in more detail in another post, this is all about fadvise and debugging.

I couldn’t find an rpm that contained a simple tool to control fadvise that I could called from a shell script (aside from the perl with inline C tool) so I decided to roll my own. The client has an existing system for building C tools to be deployed on servers. The system is such that it takes a bit of time to compile, deploy, and test so I wrote a small C app to test my theory which I could later merge into the clients tool system. I wrote the tool, tested it and the files stayed in cache as predicted so I started on the process of merging the tool with the clients system. The system used to compile the tool is 32bit the mysql servers are 64bit. When I originally wrote the test tool I compiled it directly on the 64 bit boxes.

When I rolled the fadvise tool compiled on the 32 bit system to the 64 bit system it failed with an illegal seek error. My original call to fadvise looked something like this:

posix_fadvise(fd, 0, lseek(fd, 0, SEEK_END), POSIX_FADV_WILLNEED);

I’m sure some of you are thinking, “the lseek is pointless, 0 will include the entire file.” I learned that lesson later on and it plays an important part in this story. When I ran strace on the 32bit binary on the 64 bit system the fadvise syscall looked like:

fadvise64(3, 0, 0, 0×594 /* POSIX_FADV_??? */) = -1 EINVAL (Invalid argument)

Strange, right? I checked the code, added some debugging output for the lseek and tried again with the same result. I went back to my small test program and checked to see if it had the same fadvise and lseek calls. It did. I pinged a friend on IRC and he suggested quite a few things to check including removing the unnecessary lseek. After the lseek is removed the strace output changes to:

fadvise64(3, 0, 0, POSIX_FADV_NORMAL) = 0

Notice that the last argument is POSIX_FADV_NORMAL instead of POSIX_FADV_WILLNEED. At this point I had been trying to debug this for a few hours and I started to think I was going insane. Frustration lead to typos and eyestrain. I sent the output back to my friend assuring him that I’m not insane and this has to be another problem. After some more debugging something clicks in his brain and he realizes it’s a kernel bug. (Note: I don’t fully understand this part so if you’re reading this and you do, please leave a comment.). The 32bit binary running on the 64bit kernel with the bug is taking the 3rd argument and splitting it across two registers to make up the third and fourth argument. POSIX_FADV_NORMAL is defined as 0 so the call is really

fadvise64(3,0,0,0)

Which in my case is a no op but it’s a successful no op and since I haven’t checked it with strace I think it’s caching files. I move this fix over to the clients tool, recompile, and retry my myisam performance test only to find that the files aren’t being kept in memory. Rerunning the fadvise tool doesn’t bring them in memory because it’s being called with POSIX_FADV_NORMAL. After more work with my super smart friend we pin down the problem to the register issue above and I resolve to write a test the next day (today) to test the the theory.

The test works.

Now I have two binaries created from the same source code, one that will successfully execute on both 32 and 64 bit architectures but is a no op on the 64 bit architecture. Since the clients build system isn’t setup to compile 64 bit binaries I end up giving them two binaries with a shell script that will detect the architecture and execute the correct binary.

The debugging lesson learned is to stay calm and think through problems. Frustration and disbelief led to the debugging cycle taking a few hours longer than it should have (it was about 10hrs total including all the testing). It also stressed me out. The other lesson learned is that is pays to have really smart friends. Thanks guys, I owe you a beer.

One Comment

  1. Eric Wheeler says:

    Interesting article, I wonder: do you have a simple binary that opens the .MYD, calls fadvise on the read-only fd, and daemonizes, and would this be sufficient to get the kernel to hold the file content cached? Can you or have you released your cache-pinning program?

    I’m looking to use fadvise for an opposite example: I need to set FADV_NOREUSE for block-device backups since it is pointless to cache a sequential block-device read when it will only be read once.

    In this case, I am using the rdiff to replicate LVM snapshots, and was thinking of writing an LD_PRELOAD wrapper for open() and fadvising the resulting file descriptor instead of hacking the rdiff code.

    Since you have dug into the depths of fadvise, do you you have any advice here?

    -Eric

Leave a Reply