Is group_concat_max_len in bytes or characters?

The manual says bytes but sometimes it is measured in characters. It seems like group_concat_max_len is in bytes when being passed through a temporary table and in characters otherwise. This works fine when using latin1 but when converting to utf8 mysql must reserve 3 bytes per character when setting types in a temporary table. This is yet another reason to dislike group_concat..

mysql> create table group_concat_bug (str1 varchar(255), str2 varchar(255), str3 varchar(255)) charset=utf8 engine=innodb;
Query OK, 0 rows affected (0.02 sec)

mysql> show create table group_concat_bug;
+——————+———————————————————————————————————————————————————————————–+
| Table | Create Table |
+——————+———————————————————————————————————————————————————————————–+
| group_concat_bug | CREATE TABLE `group_concat_bug` (
`str1` varchar(255) default NULL,
`str2` varchar(255) default NULL,
`str3` varchar(255) default NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
+——————+———————————————————————————————————————————————————————————–+
1 row in set (0.00 sec)

mysql> insert into group_concat_bug set str1=’a';
Query OK, 1 row affected (0.01 sec)

mysql> insert into group_concat_bug set str1=’a';
Query OK, 1 row affected (0.00 sec)

mysql> insert into group_concat_bug set str1=’a';
Query OK, 1 row affected (0.00 sec)

mysql> insert into group_concat_bug set str1=’a';
Query OK, 1 row affected (0.00 sec)

mysql> insert into group_concat_bug set str1=’a';
Query OK, 1 row affected (0.00 sec)

mysql> insert into group_concat_bug set str1=’a';
Query OK, 1 row affected (0.00 sec)

mysql> select group_concat(str1) from group_concat_bug group by str2;
+——————–+
| group_concat(str1) |
+——————–+
| a,a,a,a,a,a |
+——————–+
1 row in set (0.00 sec)

mysql> set group_concat_max_len=5;
Query OK, 0 rows affected (0.00 sec)

mysql> explain select group_concat(str1) from group_concat_bug group by str2;
+—-+————-+——————+——+—————+——+———+——+——+—————-+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+—-+————-+——————+——+—————+——+———+——+——+—————-+
| 1 | SIMPLE | group_concat_bug | ALL | NULL | NULL | NULL | NULL | 6 | Using filesort |
+—-+————-+——————+——+—————+——+———+——+——+—————-+
1 row in set (0.00 sec)

mysql> select group_concat(str1) from group_concat_bug group by str2;
+——————–+
| group_concat(str1) |
+——————–+
| a,a,a |
+——————–+
1 row in set, 1 warning (0.00 sec)

mysql> explain select group_concat(str1) from group_concat_bug group by str2 order by str3;
+—-+————-+——————+——+—————+——+———+——+——+———————————+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+—-+————-+——————+——+—————+——+———+——+——+———————————+
| 1 | SIMPLE | group_concat_bug | ALL | NULL | NULL | NULL | NULL | 6 | Using temporary; Using filesort |
+—-+————-+——————+——+—————+——+———+——+——+———————————+
1 row in set (0.00 sec)

mysql> select group_concat(str1) from group_concat_bug group by str2 order by str3;
+——————–+
| group_concat(str1) |
+——————–+
| a |
+——————–+
1 row in set, 1 warning (0.00 sec)

mysql> show variables like ‘group_concat_max_len’;
+———————-+——-+
| Variable_name | Value |
+———————-+——-+
| group_concat_max_len | 5 |
+———————-+——-+
1 row in set (0.00 sec)

I tried to patch item_sum.cc dump_leaf_key() to include the max character length in bytes in the string max length calculation with:

- if (result->length() > item->max_length)
+ if ((result->length() * item->collation.collation->mbmaxlen) > item->max_length)

This didn’t work out as I planned. Repeat runs of queries return different results.

mysql> set group_concat_max_len=5; select group_concat(str1 separator ”) from group_concat_bug group by str2;
Query OK, 0 rows affected (0.00 sec)

+———————————+
| group_concat(str1 separator ”) |
+———————————+
| aa |
+———————————+
1 row in set, 1 warning (0.00 sec)

mysql> set group_concat_max_len=5; select group_concat(str1 separator ”) from group_concat_bug group by str2;
Query OK, 0 rows affected (0.00 sec)

+———————————+
| group_concat(str1 separator ”) |
+———————————+
| aau |
+———————————+
1 row in set, 1 warning (0.00 sec)

At this point I stopped investigating the problem because the application that was using group_concat was just parsing the result back into an array anyway. The developers agreed it was best to just get rid of the group_concat and issue two queries. This is what I recommend for fixing group_concat. The risk of having truncated values or running out of memory is too much to justify saving the cost of issuing a second query.

3 Comments

  1. Hi Eric,

    although I agree there is a lot not to like about GROUP_CONCAT, I find it a very useful tool if used right. There’s one thing that I found a but puzzling in your post:

    “the application that was using group_concat was just parsing the result back into an array anyway. … This is what I recommend for fixing group_concat. The risk of having truncated values or running out of memory is too much to justify saving the cost of issuing a second query.”

    Wouldn’t this just mean you’d run out of memory when you try to create the array? I mean, whether the memory is in a string or in an array, the bytes have to be in memory somewhere, no?

    (BTW, I absolutely agree the truncation is a bad thing, that should be an error. And I think GROUP_CONCAT should get a LIMIT clause, not to prevent truncation but there are cases where one could use that to make queries more efficient and less memory intensive)

  2. Shlomi Noach says:

    Hi,
    Roland is the true master of GROUP_CONCAT, but he understates its importance in his comment. GROUP_CONCAT grants you power that cannot otherwise be achieved within a query. It cannot be mimicked by other syntax or commands.
    It is, in essence, a poor attempt to provide Window Functions. It does not do that well, but it revolves around the same kind of power you get by window functions.
    I have written and encountered numerous queries where GROUP_CONCAT saved the day; made for great speed improvements, or made possible what was considered impossible.
    I don’t want to sound superlative, sorry if I do.

    Sometimes it’s not *two* queries one would have to issue, but rather 10,000 queries. So it is definitely not a rule of thumb, in my opinion, to prefer non-GROUP_CONCAT queries.

    Perhaps not in the specific case you have presented, but generally speaking, and assuming people set their group_concat_max_len param to a reasonable value (and the default is terribly low), I do recommend getting the hang of it.

  3. Eric Bergen says:

    I prefer non group_concat queries because the vast majority of the time I see it used is in some bad attempt to mash together two queries that don’t belong as one. That along with the issue I pointed out above and the surprise truncation are enough to keep it on my list of features that cause more harm than good. I’m not calling to get rid of it, I’m simply pointing out another gotcha that I didn’t know about. I think implementing roland’s limit idea will go a long away towards fixing it.

Leave a Reply