MySQL 5.0 can deadlock when flush logs, show processlist, and a slave connection happen at the same time

[ Note: I haven't fully investigated this problem but the post has been hanging around in my queue for months. Rather than have it rot there I am publishing what I know in hopes that it helps someone else. ]

There are a lot of different kinds of locks in MySQL. Some of these locks are exposed to users such as intention locks, table locks, and row locks. There are other locks that aren’t exposed as well. These are mutexes that MySQL uses internally to protect resources from being modified by more than one thread at a time. These locks are numerous and complicated. When these locks deadlock mysql can stop dead in it’s tracks. The last deadlock I found happens when flush logs, show processlist, and a slave reconnect happen at the same time. I don’t have a core from the mysqld process, only stack traces. The breakdowns of stack traces are locks that I’m pretty sure the threads own and ones that they may be stuck trying on. I am working on gathering more data and making a repeatable test case. I will update this post or link to a new one when I do.

Flush logs is responsible for rotating and deleting old logs. It rotates every log under the sun including the relay and binary logs. I created a patch a few years ago to allow users to specify which log to rotate that is now included in MySQL 5.5. Flush logs has a safe guard built in when flushing binary logs to a make sure that it isn’t going to delete  a log that a binlog dump thread hasn’t sent to a slave yet. For this safe guard to work it has to loop through every client thread in MySQL and check if it is a dump thread. If it is then it has to check which log it is on to make sure the log isn’t in use.

Each log file has a few mutexes associated with it. The flush logs thread owns LOCK_log and LOCK_index from the binary log. bl in this case refers to the binary log object. linfo is the log info object which is part of the thread for some reason.

Owns locks
bl->LOCK_log
bl->LOCK_index
Tries to lock
LOCK_thread_count
linfo->lock

Thread 11 (Thread 0x458e6940 (LWP 27068)):
#0 0x000000377560d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003775608e50 in _L_lock_1233 () from /lib64/libpthread.so.0
#2 0x0000003775608dd3 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x000000000066f8f6 in log_in_use(char const*) () <– LOCK_thread_count
#4 0x00000000005f88d0 in MYSQL_LOG::purge_logs_before_date(long) () <– gets bl->LOCK_index
#5 0x00000000005fa09c in MYSQL_LOG::rotate_and_purge(unsigned int) () <– gets bl->LOCK_log
#6 0x000000000058715d in reload_acl_and_cache(THD*, unsigned long, TABLE_LIST*, bool*) ()
#7 0x000000000058c3d3 in mysql_execute_command(THD*) ()
#8 0x00000000005915f1 in mysql_parse(THD*, char const*, unsigned int, char const**) ()
#9 0x000000000059279e in dispatch_command(enum_server_command, THD*, char*, unsigned int) ()
#10 0x000000000059374b in handle_one_connection ()
#11 0x000000377560673d in start_thread () from /lib64/libpthread.so.0
#12 0x0000003774ad3d1d in clone () from /lib64/libc.so.6

 

What I mean by slave connection is what happens on the master when a slave connects. The master will spawn a thread that is responsible for killing any other threads with the same server id and then sending binlogs to the slave. This is why when two slaves are configured with the same server id they will disconnect each other. Slaves will try to reconnect a minute later by default 60 seconds. So every minute each slave will get to download binlogs for a minute. The important part for this deadlock is that when a new slave connects the thread spawned by the master will walk through every thread connected to MySQL looking for an old thread with the same slave server id and try to kill it.

This is the stack from the new slave thread trying to kill the old one

Owns
tmp->LOCK_delete from thread 16.
16->mysys_var->mutex
Tries
bl->LOCK_log from thread 16->mysys_var->current_mutex

Thread 17 (Thread 0x45968940 (LWP 27194)):
#0 0x000000377560d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003775608e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2 0x0000003775608cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x000000000056863f in THD::awake(THD::killed_state) ()
#4 0x000000000066f8a3 in kill_zombie_dump_threads(unsigned int) ()
#5 0x0000000000592a6e in dispatch_command(enum_server_command, THD*, char*, unsigned int) ()
#6 0x000000000059374b in handle_one_connection ()
#7 0x000000377560673d in start_thread () from /lib64/libpthread.so.0
#8 0x0000003774ad3d1d in clone () from /lib64/libc.so.6

The current_mutex deserves a bit of explanation here. In MySQL when a thread wants to wait on a condition it copies the mutex and condition into mysys_var->current_mutex and mysys_var->current_cond. The mysys_var->mutex is the mutex that protects the mysys_var struct. This way if any other thread needs to kill it it knows the condition and mutex it is waiting on so it can be woken up. This is an easy mechanism to kill threads but it makes it more difficult to figure out what a thread is locking on. In this case I think current_mutex is bl->LOCK_log from thread 16 which is coming up.

Thread 16 (Thread 0x45aee940 (LWP 3819)):
#0 0x000000377560d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00000037756101b1 in _L_cond_lock_989 () from /lib64/libpthread.so.0
#2 0x000000377561007f in __pthread_mutex_cond_lock () from /lib64/libpthread.so.0
#3 0x000000377560af84 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#4 0x00000000005f6637 in MYSQL_LOG::wait_for_update(THD*, bool) ()
#5 0x000000000067133a in mysql_binlog_send(THD*, char*, unsigned long long, unsigned short) ()
#6 0x0000000000592639 in dispatch_command(enum_server_command, THD*, char*, unsigned int) ()
#7 0x000000000059374b in handle_one_connection ()
#8 0x000000377560673d in start_thread () from /lib64/libpthread.so.0
#9 0x0000003774ad3d1d in clone () from /lib64/libc.so.6

Thread 16 is the old master thread waiting to dump a binglog. It should be killed by the new thread but isn’t.

Show processlist also loop through the list of client threads to get their id, hostname, user, state and query that they may be running. The common theme between all of these threads is that they all need to loop through the list of client threads in order to do their jobs. When looping through this list they must grab the LOCK_thread_count mutex. Owning this mutex means that no other thread will be able to change the list of client threads making it safe to loop through.

Owns
LOCK_thread_count
Tries
mysys_var->mutex probably from thread 16/17

Thread 3 (Thread 0x45a6c940 (LWP 13482)):
#0 0x000000377560d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003775608e50 in _L_lock_1233 () from /lib64/libpthread.so.0
#2 0x0000003775608dd3 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x000000000065879a in mysqld_list_processes(THD*, char const*, bool) ()
#4 0x000000000058ecf6 in mysql_execute_command(THD*) ()
#5 0x00000000005915f1 in mysql_parse(THD*, char const*, unsigned int, char const**) ()
#6 0x000000000059279e in dispatch_command(enum_server_command, THD*, char*, unsigned int) ()
#7 0x000000000059374b in handle_one_connection ()
#8 0x000000377560673d in start_thread () from /lib64/libpthread.so.0
#9 0x0000003774ad3d1d in clone () from /lib64/libc.so.6

I haven’t quite solved this one yet but I hope the breakdown may help someone else with the same issue. I worked around this one by disabling a script that runs show processlist and records this output. This makes it far less likely that all three conditions will be met to produce the deadlock. Other options are changing the log rotation script to only flush the logs it needs or figure out why the slave reconnect happens. I have a few theories on long running log rotations causing this but I haven’t been able to reproduce them.

4 Comments

  1. JoKoT3 says:

    I opened a bug (#13400021) at oracle for a something similar in MySQL Cluster. This has been corrected in version 7.1.18 (see http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-news-5-1-56-ndb-7-1-18.html).

  2. sbester says:

    http://bugs.mysql.com/bug.php?id=56299
    (mutex deadlock with COM_BINLOG_DUMP, binlog purge, and processlist/kill)

  3. James Day says:

    Eric, there’s been at least one fixed bug with a hang caused by log rotation. … was going to just be on vacation and mention my memory and say no more than that but I decided to do a quick check on bugs.mysql.com and came up with this one: http://bugs.mysql.com/bug.php?id=56299 . There may well be more. I looked for MYSQL_LOG::rotate_and_purge in my bugs search, you might also try some of the close calls on the stack to see what they turn up.

    If you don’t find a definite match please do create a bug report and mention any close bug reports that you rejected as possible matches if you’re feeling kind, it’ll save some time for the person who has to check and reproduce it. But report anyway, that’s optional. If it’s in 5.0 and 5.5 best to mention that you can reproduce with both. A fix in 5.5 is likely to be quite fast, one for 5.0 may never arrive because of its age.

    Thanks for the clear description, a nice job.

    Views are my own, for an official Oracle opinion consult a PR person.

    James Day, MySQL Senior Principal Support Engineer, Oracle

  4. [...] MySQL 5.0 can deadlock when flush logs, show processlist, and a slave connection happen at the same time according to Eric Bergen. [...]

Leave a Reply