Wednesday, August 24, 2016

Equivalence of Unicode strings is strange

Originally HyPer had a very simple model for strings: We made sure that all strings are valid UTF-8, but otherwise did not really care about the intrinsics of Unicode. And that is actually a quite sane model for most applications. Usually we do not care about the precise string structure anyway, and in the few places that we do (e.g., strpos and substr), we add some extra logic to handle UTF-8 correctly.

That was all good until users started complaining about sort orders. When sorting UTF-8 naively, we sort by code point value, which is often ok but not always what users want. So we added support for collations:
select * from foo order by x collate "de_DE";

And that is where life starts getting interesting. The collate statement can either be given explicitly at any place in the query, as shown above, or added in a create table statement, giving a default collation for a column. So basically every value that is ordered or compared within a SQL statement can have an associated collation.

Initially I naively though that this would just affect the comparison operation, like calling strcoll instead of strcmp, but life is unfortunately much more complex. First, the Unicode collation algorithm is quite involved, translating the input sequence into a sequence of 3 (or 4) level of weights before comparisons, and second, some insane database systems default to case-insensitive collations, and some users actually want that behavior.

Why is case-insensitivity insane? Because it leads to strange semantics. First of all, the whole Unicode weighting mechanism is strange. Some of the strangeness can be hidden by using ICU, but you still get strange results. Consider for example the following two strings (\u is a Unicode escape):

abc
abc\u007F

Clearly they are different, right? Well, if you ask the ICU Collation Demo they are not, they are considered equal. The reason for that is the entry in the DUCET table, which for 007F says

007F  ; [.0000.0000.0000] # DELETE (in ISO 6429)
 
So this character must be ignored for comparisons. And there are other code points that are relevant for plain comparisons, but not for accent-insensitive collations, etc. A lot of fun, in particular if you want to implement hash-based algorithms.

But technical problems aside, is that really what users want? Do users expect these two strings to be equal, just as they apparently expect 'abc' and 'äbc' to compare to equal in accent-insensitive collation? And what about queries? Consider for example

select x,count(*)
from (values('abc'),('ABC'),('äbc')) s(x)
group by x collate "de_DE_ci_ai";

which perform the group by in a case-insensitive, accent-insensitive manner, which means that all three strings compare as equal. What is the result of that? abc 3? ABC 3? äbc 3?
All three would be valid answers, as the three are "equal" according to the chosen collation. And the result might even be non-deterministic, if the query is parallelized across threads and the first value wins.

Do users really want that? Well, apparently they do, at least I was told that, and some systems even default to case-insensitive behavior. But I find that strange, the semantic of these queries can be quite puzzling.

No comments:

Post a Comment