InnoDB statistics analyze table and long running queries

(if links are broken let me know)

While the most basic table size tracking can be done by looking at the file on disk it is also useful to track the sizes included in InnoDB’s persistent statistics. Keeping these sizes updated without impacting production traffic can be challenging. This is my history with InnoDB statistics over the past few years and how they work today.

Lets take a simple table that has a user id, sequence, and some data.

CREATE TABLE `t` (
`user` int(10) unsigned NOT NULL,
`sequence` int(10) unsigned NOT NULL,
`data` varchar(100) DEFAULT NULL,
PRIMARY KEY (`user`,`sequence`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

The size on disk is 96K

-rw-rw—-. 1 mysql mysql 65 Jun 4 12:39 db.opt
-rw-rw—-. 1 mysql mysql 1.1K Jun 4 12:41 t.frm
-rw-rw—-. 1 mysql mysql 96K Jun 4 21:08 t.ibd

This is the basic size. The size from InnoDB comes from show table status. This is read from the persistent statistics.

mysql> show table status like ‘t’\G
*************************** 1. row ***************************
Name: t
Engine: InnoDB
Version: 10
Row_format: Compact
Rows: 58
Avg_row_length: 282
Data_length: 16384
Max_data_length: 0
Index_length: 0
Data_free: 0
Auto_increment: NULL
Create_time: 2016-07-16 16:01:50
Update_time: NULL
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
Create_options:
Comment:
1 row in set (0.00 sec)

Persistent statistics are great because they keep the stats the same across restarts. The recalculation can be triggered with either analyze table or by modifying more than 10% of the rows in the table. The number of rows modified is not persisted. It is an in memory counter that is reset to zero whenever the server restarts. This is the main problem with persistent statistics is that frequent upgrades can cause the calculation to become stale. On large instances with tables that have a slow mutation rate the stats may be the same for months or never change at all.

In stock MySQL the way to fix this is by running analyze table. Analyze table is a disaster waiting to happen on workloads with mixed short and long running queries. Imagine a system with many small queries and one long query from something like mysqldump. Analyze table forces the internal table version to be incremented. The table t has 20 rows. This is what happens with a mix of slow and long running queries.

Session one is a fake backup query “select sleep(10) from t;” that will sleep for 10 seconds for each of the 20 rows in the table. Session 2 is going to run “analyze table” then “select t from t limit 1”. The second session running analyze gets this:

mysql> analyze table t;

+—————–+———+———-+———-+
| Table | Op | Msg_type | Msg_text |
+—————–+———+———-+———-+
| test.t | analyze | status | OK |
+—————–+———+———-+———-+
1 row in set (0.00 sec)

mysql> select user, sequence from t where user=1;

*hang*

ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction: Timeout on table closing: test.t

The processlist output will look like this:

mysql> select time, state, info from information_schema.processlist where time > 0;
+——+————————-+————————-+
| time | state | info |
+——+————————-+————————-+
| 96 | User sleep | select sleep(10) from t |
| 2 | Waiting for table flush | select t from t limit 1 |
+——+————————-+————————-+
2 rows in set (0.01 sec)

A query that would normally have been very fast because it is reading one row is now stuck behind the long running backup query. What is worse is that the operator who ran the analyze query may not know that anything is going wrong on the database because analyze table does not have to wait for the metadata/table lock. Imagine this scenario on a server running thousands of queries per second. It piles up on max connections until the long running query is killed.

This happens because analyze table bumps the internal version for a table and sets a special flag on it. The table can’t be used by any new queries until all existing queries have closed it. This ensures that new queries will get the new table version info but may have disastrous results in an environment with mixed queries.

So analyze table is out of the picture, now what?

What we really need is the statistics updated but without all the rigid correctness for incoming queries and the hassle of the server going down. When InnoDB does statistics calculation based on row change rate it doesn’t bump the table version. So we can in theory do this without dealing with analyze table at all.

All of the table versioning reopen magic happens in the upper layer of mysql. InnoDB uses the same info_low() function to calculate and return statistics for the natural update and when called by analyze. We just need a way to call this function without the overhead.

MySQL has a way to do stats calculation on metadata operations such as show table status and selecting from information_schema.tables. With innodb_stats_on_metadata is set the stats calculation will be performed whenever these operations are done. That is great but it really slows down metadata operations that don’t care about the most recent statistics.

We needed a way to force statistics calculation only on the session level without all the mess of analyze table. What we did was make innodb_stats_on_metadata a session variable so we can pick which clients force statistics update and which ones don’t.

After filling up the test table t with some variable length rows using this query a bunch of times:

insert into t select 1, @a:=@a+1, repeat(‘foo’, rand() * 10) from t;

The table statistics look like this with a few hundred thousand rows.

mysql> show table status like ‘t’\G
*************************** 1. row ***************************
Name: t
Engine: InnoDB
Version: 10
Row_format: Compact
Rows: 212383
Avg_row_length: 51
Data_length: 11026432
Max_data_length: 0
Index_length: 0
Data_free: 4194304
Auto_increment: NULL
Create_time: 2016-07-27 14:01:52
Update_time: NULL
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
Create_options:
Comment:
1 row in set (0.00 sec)

With innodb_stats_on_metadata set for the session the row estimate changes because stats calculation relies on diving into random pages in the table and counting how many records are in them. These are variable length records so the calculation will change with each update.

mysql> set innodb_stats_on_metadata=1; select table_rows from information_schema.tables where table_name=’t’;
Query OK, 0 rows affected (0.00 sec)

+————+
| table_rows |
+————+
| 212597 |
+————+
1 row in set (0.00 sec)

mysql> set innodb_stats_on_metadata=1; select table_rows from information_schema.tables where table_name=’t’;
Query OK, 0 rows affected (0.00 sec)

+————+
| table_rows |
+————+
| 212475 |
+————+
1 row in set (0.00 sec)

mysql> set innodb_stats_on_metadata=0; select table_rows from information_schema.tables where table_name=’t’;
Query OK, 0 rows affected (0.00 sec)

+————+
| table_rows |
+————+
| 212475 |
+————+
1 row in set (0.00 sec)

mysql> set innodb_stats_on_metadata=0; select table_rows from information_schema.tables where table_name=’t’;
Query OK, 0 rows affected (0.00 sec)

+————+
| table_rows |
+————+
| 212475 |
+————+
1 row in set (0.00 sec)

 

The first two force calculation and the last two use the cached values. Rerunning the previous test we can see that this new calculation works without locking anything up.

mysql> select time, state, info from information_schema.processlist where time > 0;
+——+————+————————-+
| time | state | info |
+——+————+————————-+
| 22 | User sleep | select sleep(10) from t |
| 3 | | NULL |
+——+————+————————-+
2 rows in set (0.00 sec)

The INDEX_LENGTH, DATA_LENGTH, and TABLE_ROWS are all part of the same statistics update as is all the cardinality information. Updating these can be quite slow because by default InnoDB will read 10 random pages per table per index. Running show table status on a database with thousands of indexes on a disk system can take minutes or hours. With this model of choosing which connections update statistics we can walk through them, keep everything refreshed and prevent this performance penalty being paid by user facing queries.

The patch to make innodb_stats_on_metadata available as a session variable is on github here