Friday, February 20, 2015

Type Inference in SQL

What drives me nuts about SQL is the type system, or rather, the lack of well specified type inference. In principle, SQL is a strongly typed language, which means that every expression has a well defined data type that is known at compile time. In practice however, while the underlying database system might know the type, the human as no idea.

To illustrate that, consider the following very simple C++ fragment

auto a = b / c;

If we look at the C++ standard, Section 5.6, we see that for division the "usual arithmetic conversions" are performed (and that phrase has a well-defined meaning, Section 5 describes precisely how types are promoted), and the result of the division has the same data type as the converted input. Which is reasonable.

If we compare that with the following SQL fragment

select b/c as a;

and we want to know the data type of a, we have no idea. The SQL standard, Section 6.26, says

 The precision and scale of the result of division are implementation-defined.

Very helpful. In practice this means everybody does it different, and when building a new system like HyPer, it is unclear what the correct behavior should be. To avoid language fragmentation we usually try to do what everybody else does, in particular PostgreSQL, but in this particular example PostgreSQL is not even statically typed in the SQL sense, as they maintain scale and precision at runtime.

And this has implications for a lot of real-world queries. Consider this SQL fragment:

create table foo(a integer);
select avg(a) as b, avg(a)/2 as c;

What would you expect the types of b and c to be? As usual the standard is completely useless here (both implementation defined). But what should the type be? integer? numeric? If numeric, with which precision and scale? Database systems do it all differently, there is no census at all. And note that the data type affects further behavior, for example many systems truncate when dividing integers, but round when dividing numerics.

A quick grep of the SQL standard indicates 411 occurrences of implementation-defined behavior. And not in some obscure corner cases, this includes basic language features. For a programming language that would be ridiculous. But for some reason people accept the fact that SQL is incredibly under-specified, and that it is impossible to write even relatively simple analytical queries in a way that is portable across database systems.

Which I find very sad. Before adding more obscure features to SQL standard, the language committee should perhaps first invest some effort in standardizing type inference to allow for truly portable SQL queries. Vendors will not like this, of course, because everybody has already implemented it differently. But the longer we wait, and the more implementation-defined behavior we add to the standard, the more the systems diverge. Which is exactly what a common standard is supposed to avoid.