In a table x
, there is a column with the values u
and ü
.
SELECT * FROM x WHERE column='u'
.
This returns u
AND ü
, although I am only looking for the u
.
The table's collation is utf8mb4_unicode_ci
. Wherever I read about similar problems, everyone suggests to use this collation because they say that utf8mb4
really covers ALL CHARACTERS. With this collation, all character set and collation problems should be solved.
I can insert ü
, è
, é
, à
, Chinese characters
, etc. When I make a SELECT *
, they are also retrieved and displayed correctly.
The problem only occurs when I COMPARE two strings as in above example (SELECT WHERE
) or when I use a UNIQUE INDEX
on the column. When I use the UNIQUE INDEX
, a "ü"
is not inserted when I have a "u"
in the column already. So, when SQL compares u
and ü
in order to decide whether the ü is unique, it thinks it is the same as the u
and doesn't insert the ü
.
I changed everything to utf8mb4
because I don't want to worry about character sets and collation anymore. However, it seems that utf8mb4
isn't the solution either when it comes to COMPARING strings.
I also tried this:
SELECT * FROM x WHERE _utf8mb4 'ü' COLLATE utf8mb4_unicode_ci = column
.
This code is executable (looks pretty sophisticated). However, it also returns ü
AND u
.
I have talked to some people in India and here in China about this issue. We haven't found a solution yet.
If anyone could solve the mystery, it would be really great.
Add_On: After reading all the answers and comments below, here is a code sample which solves the problem:
SELECT * FROM x
WHERE 'ü' COLLATE utf8mb4_bin = column
By adding "COLLATE utf8mb4_bin" to the SELECT query, SQL is invited to put the "binary glasses" (ending _bin) on when it looks at the characters in the column. With the binary glasses on, SQL sees now the binary code in the column. And the binary code is different for every letter and character and emoji which one can think of. So, SQL can now also see the difference between u and ü. Therefore, now it only returns the ü when the SELECT query looks for the ü and doesn't also return the u.
In this way, one can leave everything (database collation, table collation) the same, but only add "COLLATE utf8mb4_bin" to a query when exact differentiation is needed.
(Actually, SQL takes all other glasses off (utf8mb4_german_ci, _general_ci, _unicode_ci etc.) and only does what it does when it is not forced to do anything additional. It simply looks at the binary code and doesn't adjust its search to any special cultural background.)
Thanks everybody for the support, especially to Pred.
See Question&Answers more detail:os