Select into outfile and load data infile are not complementary by default

Configuring the character set for MySQL is confusing. It is so confusing that there are roughly 25 different places to configure a character set. Don’t believe me? Add them up. The real number may be closer to 30. I realize a lot of this is due to the age of MySQL and the extent of it’s character set support. MySQL does support character set configuration in many different places which is usually a good thing.

I often complain about defaults that make no sense like lock_wait_timeout=1 year. In this case there is a default that makes absolutely no sense to me. The manual says that select into outfile is the complement of load data infile. It isn’t completely true. They differ in one key aspect, the default character set!. By default select into outfile now does the right thing by using binary character set and dumping the raw bytes to the file. Load data infile defaults to the value of the character_set_database variable which defaults to latin1. Here is an example of how this goes wrong using utf8 snowman ☃.

MariaDB [test]> set names utf8;
Query OK, 0 rows affected (0.00 sec)

MariaDB [test]> show create table t;
+——-+—————————————————————————————-+
| Table | Create Table |
+——-+—————————————————————————————-+
| t | CREATE TABLE `t` (
`t` varchar(10) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
+——-+—————————————————————————————-+
1 row in set (0.00 sec)

MariaDB [test]> insert into t set t =unhex(‘e29883′); — e29883 is the hexcode for utf8 snowman, a useful trick if you suspect double encoding or encoding errors in your terminal or wherever.
Query OK, 1 row affected (0.00 sec)

MariaDB [test]> select * from t;
+——+
| t |
+——+
| ☃ |
+——+
1 row in set (0.00 sec)

MariaDB [test]> select t from t into outfile ‘/var/tmp/snowman';
Query OK, 1 row affected (0.00 sec)

MariaDB [test]> load data infile ‘/var/tmp/snowman’ into table t;
Query OK, 1 row affected (0.03 sec)
Records: 1 Deleted: 0 Skipped: 0 Warnings: 0

MariaDB [test]> select * from t;
+——–+
| t |
+——–+
| ☃ |
| ☃ |
+——–+
2 rows in set (0.00 sec)

MariaDB [test]> show variables like ‘character_set_database';
+————————+——–+
| Variable_name | Value |
+————————+——–+
| character_set_database | latin1 |
+————————+——–+
1 row in set (0.00 sec)

MariaDB [test]> set character_set_database=utf8;
Query OK, 0 rows affected (0.00 sec)

MariaDB [test]> load data infile ‘/var/tmp/snowman’ into table t;
Query OK, 1 row affected (0.03 sec)
Records: 1 Deleted: 0 Skipped: 0 Warnings: 0

MariaDB [test]> select * from t;
+——–+
| t |
+——–+
| ☃ |
| ☃ |
| ☃ |
+——–+
3 rows in set (0.00 sec)

While this technically works I prefer to just set character set on both statements to binary and get the same bytes in and out. The value for the middle snowman is actually just double encoded so we can get the original bytes back by converting the character set to latin1.

MariaDB [test]> select * from t;
+——–+
| t |
+——–+
| ☃ |
| ☃ |
| ☃ |
+——–+
3 rows in set (0.00 sec)

MariaDB [test]> set names latin1;
Query OK, 0 rows affected (0.00 sec)

MariaDB [test]> select * from t;
+——+
| t |
+——+
| ? |
| ☃ |
| ? |
+——+
3 rows in set (0.00 sec)

This gets back the double encoded snowmen but the other two snowman are a ‘?’ which is a utf-8 replacement character. If you see question mark or � then it is a good bet that some string of bytes failed to be encoded and was lost.

MariaDB [test]> truncate table t;
Query OK, 0 rows affected (0.00 sec)

MariaDB [test]> show variables like ‘%character%';
+————————–+—————————-+
| Variable_name | Value |
+————————–+—————————-+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+————————–+—————————-+
8 rows in set (0.00 sec)

MariaDB [test]> insert into t set t =unhex(‘e29883′);
Query OK, 1 row affected (0.03 sec)

MariaDB [test]> select t from t into outfile ‘/var/tmp/snowman’ character set binary;
Query OK, 1 row affected (0.00 sec)

MariaDB [test]> load data infile ‘/var/tmp/snowman’ into table t character set binary;
Query OK, 1 row affected (0.03 sec)
Records: 1 Deleted: 0 Skipped: 0 Warnings: 0

MariaDB [test]> select t, hex(t) from t;
+——+——–+
| t | hex(t) |
+——+——–+
| ☃ | E29883 |
| ☃ | E29883 |
+——+——–+
2 rows in set (0.00 sec)

Perfect! I’m not sure why binary isn’t the default for both. I really don’t understand why a statement that works on a file and a table uses character_set_database as the default. I know this is documented in the manual but it isn’t self documenting or obvious.

Leave a Reply