Lessons learned about partitioning – p0

My favorite MySQL book covers a lot of the interesting use cases for partitioning and some of the pitfalls. In the past few months a few more pitfalls have been discovered. First let me start with the common use cases for partitioning and why they exist.

Some data becomes less useful over time. Prior to built in partitioning there weren’t very efficient way to get rid of this data without deleting it one record at a time. It was common cron up something like

delete from foo where time_created < unix_timestamp(now() – interval 30 day);

Depending on the rate at which data is inserted the cron may need to run every few minutes so the delete queries don’t clog up replication. This has a very high cost of forcing the database to do a bunch of work to mark records deleted, maintain indexes, and later purge old data. Back in that day purge thread lag was a problem and forcing it to clean up a bunch of deleted records made it worse.

Then came poor man’s partitioning which has been invented several times all around the world. The idea here is to create a table that maps to several specially named partition tables. Then tables can be dropped instead of records being deleted. This saves the purge thread workload but has a high cost on the application developer to write logic to query the correct partitions and merge the results. They also need the logic of which partition to insert into. While this usually made things faster the developer overhead was a nightmare.

Enter MySQL built in partitioning.  It supports different partitioning functions, pruning, and simple syntax for adding and dropping partitions. It works well enough for the cases described in High Performance MySQL. By far the most common use case in my world is dropping old data.

Consider this very simple test case of a summary table for access logs. It is partitioned by day for the previous week of when I was writing this post.

CREATE TABLE `access_logs` (
`page` varchar(25) NOT NULL,
`ts` int(10) unsigned NOT NULL,
`page_views` int(10) unsigned NOT NULL,
PRIMARY KEY (`page`,`ts`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY RANGE (ts)
(PARTITION p0 VALUES LESS THAN (1405839600) ENGINE = InnoDB,
PARTITION p1 VALUES LESS THAN (1405926000) ENGINE = InnoDB,
PARTITION p2 VALUES LESS THAN (1406012400) ENGINE = InnoDB,
PARTITION p3 VALUES LESS THAN (1406098800) ENGINE = InnoDB,
PARTITION p4 VALUES LESS THAN (1406185200) ENGINE = InnoDB,
PARTITION p5 VALUES LESS THAN (1406271600) ENGINE = InnoDB,
PARTITION p6 VALUES LESS THAN (1406358000) ENGINE = InnoDB,
PARTITION p7 VALUES LESS THAN (1406444400) ENGINE = InnoDB) */

It is very common to create new partitions ahead of the current time and drop partitions as they age out. This table can effectively hold a weeks worth of page views that can be accessed by page, timestamp or just page. This is generally a more efficient way to expire old data than deleting rows because dropping an old partition is basically the same thing as dropping a table. When using innodb_file_per_table and xfs this is pretty quick.

For tables with a few partitions this works well enough. When the number of partitions increases it can start to introduce inefficiencies into the system. In some cases partitions are treated like tables. This means that doing an operation on the table will perform the same operation for every underlying partition. If I want to get all of the logs for a page mysql has to look inside every partition to see if there are any logs for that page.

mysql> explain partitions select * from access_logs where page=’foo';
+—-+————-+————-+————————-+——+—————+———+———+——-+——+————-+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | Extra |
+—-+————-+————-+————————-+——+—————+———+———+——-+——+————-+
| 1 | SIMPLE | access_logs | p0,p1,p2,p3,p4,p5,p6,p7 | ref | PRIMARY | PRIMARY | 27 | const | 8 | Using where |
+—-+————-+————-+————————-+——+—————+———+———+——-+——+————-+
1 row in set (0.00 sec)

As the number of partitions increases this can slow down significantly. If most of the partitions contain the page being searched for this work would have been done anyway. If the page being searched for is in few partitions the overhead of looking inside of each partition can be significant. When the data set size exceeds the buffer pool mysql may end up doing several disk reads per partition only to discover that there are no relevant rows contained within that partition. If you extrapolate this out to one partition per day for an entire year it gets very costly.

In order for queries to be efficient on this table they must contain a timestamp. You may end up storing the timestamp of the first time a page was seen in another table to provide hints for partitioning in this table. If your table has a fairly large row size you may end up creating secondary indexes to make the dives into each partition smaller and faster.

Remember before when I said that adding and dropping partitions is usually fast enough? It is fast enough when queries on a table are short lived. It is important to understand when a partitioned table is treated as a whole table and when partitions are like their own tables. When using alter table mysql will create a metadata lock on the entire table. This occurs even when adding a partition. What this means is seemingly unrelated queries can interact with each other. I’ve inserted a bunch of fake data into the table.

mysql> select count(*) from access_logs;
+———-+
| count(*) |
+———-+
| 4096 |
+———-+
1 row in set (0.00 sec)

Now I am going to run a slow query on the first partition of the table. This may happen during backups using mysqldump or by any other plain select query that needs to read a lot of data. I am pinning this query to the first partition to make a point. The result is the same. The slow query reading from partition p0 will be:

select sleep(1) from access_logs partition(p0);

This simulates a slow query reading every row from partition p0. Now I want to add a partition to the end of the table but the alter table will hang until lock_wait_timeout is exceeded which defaults to a year. I have the default set much lower:

mysql> alter table access_logs add partition (partition p8 values less than (1406530800));
ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

So what happened? Taking a look at show processlist reveals that alter table was unable to run because the select query holds the metadata lock on the table.

mysql> select state, info from information_schema.processlist;
+———————————+————————————————————————————+
| state | info |
+———————————+————————————————————————————+
| User sleep | select sleep(1) from access_logs partition(p0) |
| Waiting for table metadata lock | alter table access_logs add partition (partition p8 values less than (1406530800)) |
+———————————+————————————————————————————+
3 rows in set (0.00 sec)

In MySQL 5.6 there is a single metadata lock covering all alter table operations. This means that in order to add or drop a partition from a table all other queries must finish executing so the alter table statement can acquire the metadata lock. The simple act of adding a partition can stall every query accessing a table for quite some time if lock_wait_timeout isn’t set correctly.

I would really like to see better handling of locking for adding new partitions. I think the case can be made for acquiring a MDL for dropping a partition but adding a new one should be seamless. Manual partition management is cumbersome at best. Having the ability to automatically drop and create new partitions without fighting with MDL or writing a bunch of scripts would be very nice.

 

I just wanted to draw a line in R.

I had a list of numbers. I wanted to see what they would look like in a graph. The list was a bit too long for excel and I don’t have gnuplot installed so I thought about using R. I’ve used R plenty of times for making basic graphs from csv files with two columns. This is easy enough and there are dozens of tutorials. What is missing the from the wide world of basic R tutorials is taking a simple list of numbers and drawing a line. Why would one do this? I wanted to get an idea of the outliers in my set of numbers and how the line looked after removing them.

So here it is. The thing I spent over an hour trying to figure out. How to take a list of numbers and make a line in R.

Start with a file called test.csv

ebergen$ cat test.csv
1
2
2
2
3
4
4
5

Load the file into a variable called foo and plot it. Type this into R and press enter after each line

foo <- read.csv(‘/Users/ebergen/tmp/test.csv’)
plot(1:nrow(foo), foo[[1]], type=’l’)

For bonus points save it as a png.

png(‘test.png’)
plot(1:nrow(foo), foo[[1]], type=’l’)
dev.off()

The result with no labels, scales, or anything. It is just a line.

Look at it. Love it!

Special thanks to Jeff Sipek for giving me the answer after google failed me.

pt-online-schema change and row based replication

The way online schema changes have historically worked with statement based replication is to create an empty table on a slave, setup triggers to capture changes to a log table, copy the old table to the new table, apply the changes from the log table and atomic rename. This process breaks when using statement based replication because triggers don’t fire on events replicated to the slave. Triggers will fire on the master and the row events for any modified tables just replicate to the slave. This makes traditional online schema change break with statement based replication.

The workaround I’ve seen a few times at the conference is to run schema changes on the master when using row based replication. This is means either significantly reducing replication capacity during a schema change or having the change run really slowly so replication can keep up. Neither of these are optimal. I would like to submit an alternate method. Rather than rely on triggers on the slave simply run another binlog tailer and capture only events related to the table being changed. Triggers and their DDL mess are out of the picture. Then you can apply changes the same way statement based online schema change would and rename the tables.

None of this is required with online ddl and innodb online index defragmentation.

Enjoy.

Select into outfile and load data infile are not complementary by default

Configuring the character set for MySQL is confusing. It is so confusing that there are roughly 25 different places to configure a character set. Don’t believe me? Add them up. The real number may be closer to 30. I realize a lot of this is due to the age of MySQL and the extent of it’s character set support. MySQL does support character set configuration in many different places which is usually a good thing.

I often complain about defaults that make no sense like lock_wait_timeout=1 year. In this case there is a default that makes absolutely no sense to me. The manual says that select into outfile is the complement of load data infile. It isn’t completely true. They differ in one key aspect, the default character set!. By default select into outfile now does the right thing by using binary character set and dumping the raw bytes to the file. Load data infile defaults to the value of the character_set_database variable which defaults to latin1. Here is an example of how this goes wrong using utf8 snowman ☃.

MariaDB [test]> set names utf8;
Query OK, 0 rows affected (0.00 sec)

MariaDB [test]> show create table t;
+——-+—————————————————————————————-+
| Table | Create Table |
+——-+—————————————————————————————-+
| t | CREATE TABLE `t` (
`t` varchar(10) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
+——-+—————————————————————————————-+
1 row in set (0.00 sec)

MariaDB [test]> insert into t set t =unhex(‘e29883′); — e29883 is the hexcode for utf8 snowman, a useful trick if you suspect double encoding or encoding errors in your terminal or wherever.
Query OK, 1 row affected (0.00 sec)

MariaDB [test]> select * from t;
+——+
| t |
+——+
| ☃ |
+——+
1 row in set (0.00 sec)

MariaDB [test]> select t from t into outfile ‘/var/tmp/snowman';
Query OK, 1 row affected (0.00 sec)

MariaDB [test]> load data infile ‘/var/tmp/snowman’ into table t;
Query OK, 1 row affected (0.03 sec)
Records: 1 Deleted: 0 Skipped: 0 Warnings: 0

MariaDB [test]> select * from t;
+——–+
| t |
+——–+
| ☃ |
| ☃ |
+——–+
2 rows in set (0.00 sec)

MariaDB [test]> show variables like ‘character_set_database';
+————————+——–+
| Variable_name | Value |
+————————+——–+
| character_set_database | latin1 |
+————————+——–+
1 row in set (0.00 sec)

MariaDB [test]> set character_set_database=utf8;
Query OK, 0 rows affected (0.00 sec)

MariaDB [test]> load data infile ‘/var/tmp/snowman’ into table t;
Query OK, 1 row affected (0.03 sec)
Records: 1 Deleted: 0 Skipped: 0 Warnings: 0

MariaDB [test]> select * from t;
+——–+
| t |
+——–+
| ☃ |
| ☃ |
| ☃ |
+——–+
3 rows in set (0.00 sec)

While this technically works I prefer to just set character set on both statements to binary and get the same bytes in and out. The value for the middle snowman is actually just double encoded so we can get the original bytes back by converting the character set to latin1.

MariaDB [test]> select * from t;
+——–+
| t |
+——–+
| ☃ |
| ☃ |
| ☃ |
+——–+
3 rows in set (0.00 sec)

MariaDB [test]> set names latin1;
Query OK, 0 rows affected (0.00 sec)

MariaDB [test]> select * from t;
+——+
| t |
+——+
| ? |
| ☃ |
| ? |
+——+
3 rows in set (0.00 sec)

This gets back the double encoded snowmen but the other two snowman are a ‘?’ which is a utf-8 replacement character. If you see question mark or � then it is a good bet that some string of bytes failed to be encoded and was lost.

MariaDB [test]> truncate table t;
Query OK, 0 rows affected (0.00 sec)

MariaDB [test]> show variables like ‘%character%';
+————————–+—————————-+
| Variable_name | Value |
+————————–+—————————-+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+————————–+—————————-+
8 rows in set (0.00 sec)

MariaDB [test]> insert into t set t =unhex(‘e29883′);
Query OK, 1 row affected (0.03 sec)

MariaDB [test]> select t from t into outfile ‘/var/tmp/snowman’ character set binary;
Query OK, 1 row affected (0.00 sec)

MariaDB [test]> load data infile ‘/var/tmp/snowman’ into table t character set binary;
Query OK, 1 row affected (0.03 sec)
Records: 1 Deleted: 0 Skipped: 0 Warnings: 0

MariaDB [test]> select t, hex(t) from t;
+——+——–+
| t | hex(t) |
+——+——–+
| ☃ | E29883 |
| ☃ | E29883 |
+——+——–+
2 rows in set (0.00 sec)

Perfect! I’m not sure why binary isn’t the default for both. I really don’t understand why a statement that works on a file and a table uses character_set_database as the default. I know this is documented in the manual but it isn’t self documenting or obvious.

How to get from MySQL SQL to C

Occasionally it is useful to know what a MySQL command is doing internally. Just looking into the MySQL source directory can be overwhelming. Knowing the basics of the handler interface and the sql parser can be a great start for reading the source code to understand what MySQL does under the hood. Here I will cover a little bit about how the SQL syntax is defined.

Everything starts with lex.h and sql_yacc.yy in the sql/ dir. lex.h contains all the functions and symbols used to make up the SQL syntax. The sql_yacc.yy file describes the relationships between these symbols and the C functions responsible for executing them. I’m not sure why some symbol definitions end in _SYM and others don’t. Looking in lex.h “FLUSH” is defined as FLUSH_SYM. To see all the places where flush is allowed in the SQL go back to sql_yacc.yy and grep for it.

The first important section looks like this:

/* flush things */                                                                      

flush:                                                                                  
          FLUSH_SYM opt_no_write_to_binlog                                              
          {                                                                             
            LEX *lex=Lex;                                                               
            lex->sql_command= SQLCOM_FLUSH;                                             
            lex->type= 0;                                                               
            lex->no_write_to_binlog= $2;                                                
          }                                                                             
          flush_options                                                                 
          {}                                                                            
        ;

This is where things can get a bit nested and weird. The flush: section is saying that flush can have opt_no_write_to_binlog optionally after it. The first section in curly braces defines the sql command and also sets the flag no_write_to_binlog. SQLCOM_FLUSH is important in the next phase where we get into actual C code.

flush_options used to define all of the possible options for a flush command. Going one step further down flush_options_list basically says that a flush command can contain more than one option.

flush_options_list:
          flush_options_list ',' flush_option
        | flush_option
          {}
        ;

Notice that flush_options_list can contain a flush_options_list. I don’t know the specifics of this but it is the yacc way of saying things can be repeated. In this case the | is saying that there can be multiple flush_option separated by a comma or just one option.

With flush_option: things start to make more sense. This is all of the different types of flush commands. Looking at the first part

flush_option:
          ERROR_SYM LOGS_SYM
          { Lex->type|= REFRESH_ERROR_LOG; }
        | ENGINE_SYM LOGS_SYM
          { Lex->type|= REFRESH_ENGINE_LOG; }
        | GENERAL LOGS_SYM
          { Lex->type|= REFRESH_GENERAL_LOG; }
        | SLOW LOGS_SYM

Reading it in english this is basically saying  “Error logs OR engine logs OR general logs OR slow logs” Combining this with the previous section allowing multiple flush options this is a valid query:

MariaDB [test]> flush error logs, slow logs;
Query OK, 0 rows affected (0.00 sec)

The flush command is quite a bit improved in MariaDB 10. Comparing this to part of the flush_option: section from MariaDB 5.2 shows how much it has improved:

flush_option:
          table_or_tables
          { Lex->type|= REFRESH_TABLES; }
          opt_table_list {}
        | TABLES WITH READ_SYM LOCK_SYM
          { Lex->type|= REFRESH_TABLES | REFRESH_READ_LOCK; }
        | QUERY_SYM CACHE_SYM
          { Lex->type|= REFRESH_QUERY_CACHE_FREE; }
        | HOSTS_SYM
          { Lex->type|= REFRESH_HOSTS; }
        | PRIVILEGES
          { Lex->type|= REFRESH_GRANT; }
        | LOGS_SYM
          { Lex->type|= REFRESH_LOG; }
        | STATUS_SYM
          { Lex->type|= REFRESH_STATUS; }

In 5.2 the LOGS_SYM is alone which makes flush error logs an invalid query. By scanning the grammar in sql_yacc.yy it is easy to see which syntax is and isn’t supported between versions.

Now that a command of SQLCOM_FLUSH has been specified. The flush_option is passed in via Lex->type. Each option is bitwise ORed into type. It is time to switch over to C++ code and see how these are executed.

sql_parse.cc has a huge switch case statement that contains every possible command type that MySQL can process. For this example look for case SQLCOM_FLUSH: The SQLCOM_FLUSH is the same option from the grammar file. There is basically one important function under this section reload_acl_and_cache().

In newer versions reload_acl_and_cache() is in sql_reload.cc. In older versions it is in sql_parse.cc This function basically checks each type of thing that can be flushed one by one to see if their flag has been ORed into Lex-type which is options here. It then calls the C++ code responsible for carrying out the flushing of that type of object.

Other types of calls can be traced the same way. Most of the token names can be easily grepped from sql_yacc.yy. Most of the show commands are handled similar to flush except when it gets into C++ code they are mapped into a special pair of arrays that tell MySQL how to generate a INFORMATION_SCHEMA table.

For the show commands the C++ code is in the sql_yacc.yy file as a call to prepare_schema_table(). There are two arrays that are important when adding a new table. In MariaDB 10 the first is enum_schema_tables in handler.cc. In older versions this array is in table.h. The other array is ST_SCHEMA_TABLE schema_tables in sql_show.cc.

schema_tables has the important information. Among other things it holds the text name of the table, the fields used to make it, and the function called to generate the data in the table. The fields list usually ends with _fields_info and the function used to create the result set used for the table starts with fill_.

Percona Live Conference Notes

This is the required post about things I observed during this years MySQL conference.

Things that are awesome:

  • The tables in sessions. I think these were here last year. They are still awesome this year.
  • The new style power plugs. They solved the problem of people tripping over daisy chained power strips and the strips being accidentally turned off.
  • Massive quantities of coffee and real cream.

Things that can be improved:

  • Lunch tickets. I overheard the same conversation a dozen times about people not being able to find their lunch tickets or not really knowing about them.
  • Make badges reversible. A badge under observation will be facing the wrong way.

Things that just bumped me:

  • The music is different this year. Now it makes me feel like a teenager struggling with a breakup.
  • My secret clean bathroom has been discovered and soiled.
  • The bathrooms by the restaurant were all turned around. I had great fun watching conference veterans try to walk into the wrong bathroom.

 

My new favorite example of why it isn’t a good idea to use reserved words as column names.

Some show commands support a where clause. The column name that can be used in the expression for the where clause depends on the result of the show command. For example in show tables the column is Tables_in_foo where foo is the database name.

MariaDB [test]> show tables where Tables_in_test = ‘t';
+—————-+
| Tables_in_test |
+—————-+
| t |
+—————-+
1 row in set (0.00 sec)

This is a problem with the show databases command because databases aren’t really *in* anything. Database is a reserved word so this happens.

MariaDB [(none)]> show databases;
+——————–+
| Database |
+——————–+
| information_schema |
| mysql |
| test |
+——————–+
3 rows in set (0.00 sec)

MariaDB [(none)]> show databases where database=’test';
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘=’test” at line 1
MariaDB [(none)]> show databases where `database`=’test';
+———-+
| Database |
+———-+
| test |
+———-+
1 row in set (0.00 sec)

I can’t decide if this is a bug or the greatest example of why choosing descriptive non-reserved word column names is very important.

Shutting down with mysqld, mysqladmin, SIGTERM, or SIGKILL.

How do you shutdown mysqld? I tend to use SIGTERM. People where I work tend to use mysqladmin shutdown. When things get bad I use SIGKILL. These three methods will end up with a dead mysqld but the one you choose depends on the situation and can even result in lost data. On Linux the difference between SIGTERM and SIGKILL is significant and often times misunderstood.

Before processes start to die it is important to understand the relationship between mysqld_safe and mysqld. mysqld_safe is the watchdog script for mysqld. It is responsible for starting mysqld and keeping an eye on it. It does this by waiting for mysqld to exit then checking the return code. On a safe shutdown such as one done by mysqldadmin or a SIGTERM mysqld will return zero. When mysqld_safe sees a zero return code it will also exit. If the return code is anything else then mysqld_safe assumes mysqld crashed and starts a new instance of mysqld. This difference can help explain why mysqld sometimes just won’t go away.

The basic shutdown process is fairly well documented so I won’t cover that here. It is important to understand the difference between mysqladmin and sigterm. As described in the manual mysqladmin will create a shutdown thread. Using a TERM signal will also create a shutdown thread. The major difference between the two is that mysqladmin makes a connection to mysql and sends a shutdown packet. This means it can be used from a remote host. It also means that mysqladmin must pass mysql permissions before allowing the shutdown. SIGTERM is delivered through the signal mechanism in linux and must only pass the linux system user rules for delivering a signal. For example if mysqld is running as root then the ebergen user can’t deliver a signal to it without using sudo.

SIGTERM is handy for safely shutting down mysqld when the root password is lost or there are too many connections and the reserved super user connection is also taken up. I tend to use this method the most because I’m usually logged into the server via ssh and SIGTERM usually works. It is also slightly less typing. I do have a bad habit of using killall mysqld instead of kill -TERM mysqld. This will deliver a TERM signal to every mysqld on the system. If you happen to be on a machine that has multiple mysqld instances such as a shared developer workstation don’t do that. There are several different methods to find the mysqld process and send it a term signal. Here are a few examples where nnnn is the pid of the mysqld process. These also work with SIGKILL. Note that killall defaults to a TERM signal

  • kill -TERM nnnn
  • killalll mysqld
  • kill -TERM `pidof mysqld`
  • kill -TERM `cat /var/run/mysqld.pid`

Getting into SIGKILL or kill -KILL is where things get interesting. In Linux SIGKILL isn’t really a signal in that it never actually gets delivered to the process. Since it never gets delivered to the process it can’t be caught or blocked. This is where people usually tell me that SIGKILL can be blocked. My best guess on why this confusion happens is because the concept of a signal that can’t be blocked seems to defeat the purpose of a signal. Also I don’t think a lot of code checks for errors from the signal function call. Last there may be other operating systems that allow a SIGKILL to be blocked.  Here is my proof that it can’t be blocked in linux.

ebergen@randy:(~) cat test.c
#include <unistd.h>
#include <signal.h>
#include <stdio.h>

int main (int argc, char **argv) {
if (signal(SIGKILL, SIG_IGN) == SIG_ERR)
 printf("Oops!!\n");
 else
 printf("Good to go!\n");
sleep(30);
return 0;
}
ebergen@randy:(~) gcc -Wall test.c
ebergen@randy:(~) strace ./a.out
munmap(0x2b90d5119000, 104917) = 0
rt_sigaction(SIGKILL, {0x1, [KILL], SA_RESTORER|SA_RESTART, 0x3fe00302d0}, {0x3fdfe1cc80, ~[HUP INT QUIT ILL ABRT KILL USR1 SEGV PIPE TERM STKFLT TSTP TTIN URG XCPU VTALRM WINCH IO RT_16 RT_17 RT_18 RT_19 RT_20 RT_21 RT_22 RT_23 RT_24 RT_25 RT_26 RT_27 RT_28 RT_29 RT_30 RT_31], SA_RESTORER|SA_INTERRUPT|SA_NODEFER|SA_RESETHAND|0x1272740, 0x1bc6ade2}, 8) = -1 EINVAL (Invalid argument)
write(1, "Oops!!\n", 7Oops!!
) = 7

nanosleep({30, 0}, <unfinished ...>
ebergen@randy:(~) ./a.out
Oops!!

What SIGKILL means is to remove a process from existence. The process never gets another chance to run to attempt to block the signal. Linux simply cleans it up. What this means for MySQL is that it doesn’t get a chance to perform any shutdown tasks like flushing indexes for myisam tables. To MySQL it is effectively the same as pulling the plug on the server except that the filesystem still has a chance to flush modified data to disk.

SIGKILL should really be a last resort or only used when you know your mysqld is safe to shutdown. There is plenty of discussion on this with respect to making consistent backups. The rules are basically the same. When you take a filesystem snapshot or lvm snapshot the way the snapshot looks is effectively the same as running SIGKILL on mysqld to remove it from existence.

Percona Live MySQL Conference

These are things I like that I consider differences from last years conference. There are plenty of other things I like that don’t need to be listed here.

  • The overall tone and feel of the conference was much less marking and much more technical
  • Refreshingly honest keynotes. There was a lot of coming clean about the history of MySQL and the conference.
  • Percona is very technical but it is also a business. They are very good about bringing out the technical and not being pushy about the business.
  • No ice cream social. A thousand people shaking sticky hands with each other is never a good idea.
  • percona.tv
  • The conference was busy but never crowded

Now for the dislike:

  • Only one song before every session.
  • The chairs. Damn the chairs.
  • Wifi failed more often than it worked. Most of the time I was tethered to my phone.
  • smartcity doesn’t work with ssh port forwarded dns/socks poor man vpn.

Just some things to note and tips for next year

  • Classic rock bump in music for the keynotes didn’t really fit.
  • Percona folks are usually really good about having information in the slides. So if you have to choose between skipping a Percona talk for some other talk skip the Percona one because you can go back and read the slides.
  • There is a secret clean bathroom that usually stays clean until the last day. I’m not saying where this is.
  • Monty’s BoF always has black vodka.
  • The black vodka is dangerous so be careful.

I was tempted to skip the conference this year because last year was so depressing. I decided to give Percona a chance at hosting it and I’m glad I did. I look forward to attending and maybe presenting next year.

MySQL 5.0 can deadlock when flush logs, show processlist, and a slave connection happen at the same time

[ Note: I haven't fully investigated this problem but the post has been hanging around in my queue for months. Rather than have it rot there I am publishing what I know in hopes that it helps someone else. ]

There are a lot of different kinds of locks in MySQL. Some of these locks are exposed to users such as intention locks, table locks, and row locks. There are other locks that aren’t exposed as well. These are mutexes that MySQL uses internally to protect resources from being modified by more than one thread at a time. These locks are numerous and complicated. When these locks deadlock mysql can stop dead in it’s tracks. The last deadlock I found happens when flush logs, show processlist, and a slave reconnect happen at the same time. I don’t have a core from the mysqld process, only stack traces. The breakdowns of stack traces are locks that I’m pretty sure the threads own and ones that they may be stuck trying on. I am working on gathering more data and making a repeatable test case. I will update this post or link to a new one when I do.

Flush logs is responsible for rotating and deleting old logs. It rotates every log under the sun including the relay and binary logs. I created a patch a few years ago to allow users to specify which log to rotate that is now included in MySQL 5.5. Flush logs has a safe guard built in when flushing binary logs to a make sure that it isn’t going to delete  a log that a binlog dump thread hasn’t sent to a slave yet. For this safe guard to work it has to loop through every client thread in MySQL and check if it is a dump thread. If it is then it has to check which log it is on to make sure the log isn’t in use.

Each log file has a few mutexes associated with it. The flush logs thread owns LOCK_log and LOCK_index from the binary log. bl in this case refers to the binary log object. linfo is the log info object which is part of the thread for some reason.

Owns locks
bl->LOCK_log
bl->LOCK_index
Tries to lock
LOCK_thread_count
linfo->lock

Thread 11 (Thread 0x458e6940 (LWP 27068)):
#0 0x000000377560d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003775608e50 in _L_lock_1233 () from /lib64/libpthread.so.0
#2 0x0000003775608dd3 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x000000000066f8f6 in log_in_use(char const*) () <– LOCK_thread_count
#4 0x00000000005f88d0 in MYSQL_LOG::purge_logs_before_date(long) () <– gets bl->LOCK_index
#5 0x00000000005fa09c in MYSQL_LOG::rotate_and_purge(unsigned int) () <– gets bl->LOCK_log
#6 0x000000000058715d in reload_acl_and_cache(THD*, unsigned long, TABLE_LIST*, bool*) ()
#7 0x000000000058c3d3 in mysql_execute_command(THD*) ()
#8 0x00000000005915f1 in mysql_parse(THD*, char const*, unsigned int, char const**) ()
#9 0x000000000059279e in dispatch_command(enum_server_command, THD*, char*, unsigned int) ()
#10 0x000000000059374b in handle_one_connection ()
#11 0x000000377560673d in start_thread () from /lib64/libpthread.so.0
#12 0x0000003774ad3d1d in clone () from /lib64/libc.so.6

 

What I mean by slave connection is what happens on the master when a slave connects. The master will spawn a thread that is responsible for killing any other threads with the same server id and then sending binlogs to the slave. This is why when two slaves are configured with the same server id they will disconnect each other. Slaves will try to reconnect a minute later by default 60 seconds. So every minute each slave will get to download binlogs for a minute. The important part for this deadlock is that when a new slave connects the thread spawned by the master will walk through every thread connected to MySQL looking for an old thread with the same slave server id and try to kill it.

This is the stack from the new slave thread trying to kill the old one

Owns
tmp->LOCK_delete from thread 16.
16->mysys_var->mutex
Tries
bl->LOCK_log from thread 16->mysys_var->current_mutex

Thread 17 (Thread 0x45968940 (LWP 27194)):
#0 0x000000377560d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003775608e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2 0x0000003775608cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x000000000056863f in THD::awake(THD::killed_state) ()
#4 0x000000000066f8a3 in kill_zombie_dump_threads(unsigned int) ()
#5 0x0000000000592a6e in dispatch_command(enum_server_command, THD*, char*, unsigned int) ()
#6 0x000000000059374b in handle_one_connection ()
#7 0x000000377560673d in start_thread () from /lib64/libpthread.so.0
#8 0x0000003774ad3d1d in clone () from /lib64/libc.so.6

The current_mutex deserves a bit of explanation here. In MySQL when a thread wants to wait on a condition it copies the mutex and condition into mysys_var->current_mutex and mysys_var->current_cond. The mysys_var->mutex is the mutex that protects the mysys_var struct. This way if any other thread needs to kill it it knows the condition and mutex it is waiting on so it can be woken up. This is an easy mechanism to kill threads but it makes it more difficult to figure out what a thread is locking on. In this case I think current_mutex is bl->LOCK_log from thread 16 which is coming up.

Thread 16 (Thread 0x45aee940 (LWP 3819)):
#0 0x000000377560d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00000037756101b1 in _L_cond_lock_989 () from /lib64/libpthread.so.0
#2 0x000000377561007f in __pthread_mutex_cond_lock () from /lib64/libpthread.so.0
#3 0x000000377560af84 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#4 0x00000000005f6637 in MYSQL_LOG::wait_for_update(THD*, bool) ()
#5 0x000000000067133a in mysql_binlog_send(THD*, char*, unsigned long long, unsigned short) ()
#6 0x0000000000592639 in dispatch_command(enum_server_command, THD*, char*, unsigned int) ()
#7 0x000000000059374b in handle_one_connection ()
#8 0x000000377560673d in start_thread () from /lib64/libpthread.so.0
#9 0x0000003774ad3d1d in clone () from /lib64/libc.so.6

Thread 16 is the old master thread waiting to dump a binglog. It should be killed by the new thread but isn’t.

Show processlist also loop through the list of client threads to get their id, hostname, user, state and query that they may be running. The common theme between all of these threads is that they all need to loop through the list of client threads in order to do their jobs. When looping through this list they must grab the LOCK_thread_count mutex. Owning this mutex means that no other thread will be able to change the list of client threads making it safe to loop through.

Owns
LOCK_thread_count
Tries
mysys_var->mutex probably from thread 16/17

Thread 3 (Thread 0x45a6c940 (LWP 13482)):
#0 0x000000377560d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003775608e50 in _L_lock_1233 () from /lib64/libpthread.so.0
#2 0x0000003775608dd3 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x000000000065879a in mysqld_list_processes(THD*, char const*, bool) ()
#4 0x000000000058ecf6 in mysql_execute_command(THD*) ()
#5 0x00000000005915f1 in mysql_parse(THD*, char const*, unsigned int, char const**) ()
#6 0x000000000059279e in dispatch_command(enum_server_command, THD*, char*, unsigned int) ()
#7 0x000000000059374b in handle_one_connection ()
#8 0x000000377560673d in start_thread () from /lib64/libpthread.so.0
#9 0x0000003774ad3d1d in clone () from /lib64/libc.so.6

I haven’t quite solved this one yet but I hope the breakdown may help someone else with the same issue. I worked around this one by disabling a script that runs show processlist and records this output. This makes it far less likely that all three conditions will be met to produce the deadlock. Other options are changing the log rotation script to only flush the logs it needs or figure out why the slave reconnect happens. I have a few theories on long running log rotations causing this but I haven’t been able to reproduce them.