Friday, December 18, 2015

The price of correctness

When implementing a database system we often have two contradicting goals: Performance and correctness. Being as fast as possible is very attractive, of course. But unfortunately this often means ignoring corner cases, and can thus lead to incorrect results in rare cases.

And usually, incorrect results are unacceptable. Or at least, should be unacceptable. It is very tempting for a prototype system to ignore correctness issues and just aim for maximum performance. But once money is associated with the data, wrong results are very problematic.

A classical example of this is the representation of numerical values in databases. It is tempting to just use doubles for everything. Doubles are well supported by hardware, we can use SIMD instructions with them, and they are often the fastest way to represent non-integer values. On the other hand doubles can get into rounding issues very quickly. This can be seen by when performing the computation 0.1+0.2-0.3:

select ceil((0.1+0.2)-0.3)
 -> 0.0

select ceil((0.1::double precision+0.2::double precision)
            -0.3::double precision)
 -> 1.0

When using a correct NUMERIC implementation (e.g., HyPer or PostgreSQL) we get the correct result of 0.1+0.2-0.3=0. When using DOUBLE PRECISION we get a non-zero result, which is here rounded to 1. Unfortunately some systems like SQLite (and a lot of research systems) do not both to implement NUMERIC correctly, and always use doubles, which leads to wrong results even for the first query.

Implementing NUMERIC correctly means implementing fixed-point arithmetic. The number is usually represented as integer value plus corresponding position of the decimal point, and all operations are then mapped to integer operations. For addition and subtraction that is reasonable easy (as long as both arguments have the decimal point at the same position), but division for example is more complex even if we ignore the decimal point:

int64_t div(int64_t a,int64_t b)
   if (!b) // avoid hardware trap due to division by zero
      throw DivBy0();
   if (b==-1) // avoid hardware trap due to int64min/-1
      return sub(0,a);
   return a/b;

Note that the sub function in there is non-trivial, as we will see in a moment. Plus the extra code needed for handling the decimal point, plus the code needed to handle rounding. A fixed point division operation easily needs 20-30 instructions, compared to a single instruction for a floating point division. This costs performance, of course, but has the undeniable advantage of producing the correct result.

Now why is the subtraction function non-trivial? Because it has to cope with underflows/overflows. And that is not only nice to have, but fundamental, because there are values that are fundamentally non-representably in our usual two's complement representation. Consider for example the following computation:

select (-9223372036854775808)/(-1)
-> div(-9223372036854775808,-1)
-> sub(0,-9223372036854775808)
-> ???

The problem here is that -9223372036854775808 has no corresponding positive number when using 64bit integers for our fixed-point values, the number is non-representable. In fact if we had executed the division without the check for -1 we would have gotten a very unpleasant hardware trap due to that. We avoid the trap by delegating to subtraction, but if we not check for underflows/overflows there we silently produce wrong results.

Checking for overflows manually is quite painful, in particular since signed overflows are undefined in C! We have to break the computation down into unsigned operations, which is both complex and slow. Fortunately recent versions of both gcc and clang added intrinsics to use the CPU flags for overflow checking, which is both much easier and much cheaper:

int64_t sub(int64_t a,int64_t b)
   int64_t c;
   if (__builtin_ssubll_overflow(a,b,&c))
      throw Overflow();
   return c;

Even when using the intrinsics the (correct) fixed-point operations are much more complex than the (inaccurate) double operations. What does that mean for performance? I show the execution time of 100,000 passes over two vectors of 1,000 values each below (i.e., 100,000*1,000 = 100,000,000 arithmetic operation of the indicated type were executed), both for double arithmetic and for fixed point arithmetic. (i7-5820K, gcc 5.2.1, -O3 -march=native)

add sub mul div
double (unchecked) 11ms 11ms 12ms 212ms
fixed point (unchecked) 10ms 10ms 42ms 817ms
fixed point (checked) 57ms 57ms 56ms 912ms

The correct arithmetic (fixed-point, with checks) is quite a bit slower than just using doubles. But that is the price we have to pay for correctness. (Update: in the original posting I forgot -march=native, adding that improved performance of the unchecked versions by another factor 2 due to AVX instructions).
Note that it would have been possible to check for overflows after double operations, too, using fetestexcept. But that is even slower than the checked fixed point arithmetic (>620ms for all cases), and it does not help with the rouding issues of floating point numbers.

So performing numerical operations correctly is difficult and expensive. But still, every system should do it! Users get very unhappy when the result is wrong, in particular if the values correspond to money. If you are building a prototype system, do not use floating point numbers, even if it is fast and tempting. Using doubles is nice for micro benchmarks, but inappropriate for real-world usage when money is involved.

Saturday, September 12, 2015

Trying to speed up Binary Search

It is well known that binary search is not particular fast. For point queries hash tables are much faster, ideally accessing in O(1),  And even when we need range queries n-ary search structures like B-Trees are much faster than binary search or binary search trees.

Still, there is a certain charm to binary search. First, it is easy to implement, And second, it needs no extra memory but can directly operate on sorted data. And sometimes the data is sorted anyway, so we get the binary search more or less for free. And for something we get for free O(log n) lookup time is not that bad. The only question is, how can we implement it efficiently.

Of course there is the text-book way of implementing binary search, as explained for example in Wikipedia. The resulting code looks like this:

while (lower!=upper) {
  unsigned middle=lower+((upper-lower)/2);
  unsigned v=data[middle];
  if (v==needle) {
      return middle;
  } else if (v<needle) {
  } else {
return notFound;

Not particular difficult, but how fast it is? We ran experiments where we performed 1,000,000 random lookups for 32bit integers in data sets of various sizes and measured the performance on a Broadwell i7-5500U (all experiments using -O3):

set size103105107109
classic65 ms119 ms362 ms702 ms

 When we look carefully we notice that the runtime grows a bit faster than the asymptotic complexity of O(log n) would suggest. Or, in other words, the constants for searching in the 4GB data set seem to be about a factor 4 higher than when searching in the 4KB data set. That is not surprising, as the 4KB of the smallest set easily fit into the CPU cache, while the 4GB of the largest set are clearly beyond cache size. What is surprising is that it is only a factor of 4!

The performance of binary search is largely affected by two effects: First, there are cache misses, as every lookup accesses O(log n) elements, which are often not in cache for the large data sets. And second, there are branch mispredictions, as the comparison v<needle has about a 50% chance of being true, which leads to inefficient execution. For small data sets the branch misses dominate, while for the large data sets the branch misses are somewhat hidden by the memory latency. Therefore a slowdown of only 4.

Now the question is, can we do better? Usually people try to speed up binary search by reducing branch misses. There is an interesting talk about binary search, less wrong that described how to implement binary search without the hard to predict branch. Somewhat ironic the example code in that talk is wrong, contrary to its title, but the idea is valid. The corrected code looks like this:

while (auto_t half=n/2) {
   auto middle=lower+half;
return ((*lower)==needle)?lower:notFound;

The code is not only quite short, the loop body is compiled by gcc into completely branch-free assembler code by using a conditional move. The resulting performance is shown below:

set size103105107109
branch free23 ms53 ms320 ms853 ms

This is a somewhat surprising result. For small data sets the branch-free version is indeed faster, but for the largest data set the search is significantly slower. And that is after manual tuning of the generated code, out of the box it needed 920 ms for the largest set.

The speed-up for the small data sizes is easy to explain (no branch mis-predictions), but where does the slow-down for the largest data set come from? We get a clue by compiling the same code with clang, which is uses a branch instead of a conditional move inside the loop. With that we get

set size103105107109
compile with clang67 ms115 ms359 ms694 ms

which is virtual identical to the performance of the text-book implementation. Apparently the conditional move is good to avoid branch mispredictions, but cannot hide memory latency as much as the regular implementation.

So while we found a faster implementation for small data sets, the question is still open if we can do better for large sets (which might be common in databases). We tried using ternary search instead of binary search, with the argument that while ternary search accesses more memory locations than binary search, the memory latency can be hidden by performing the accesses in parallel, and also the number of iterations is smaller. There are many ways to implement ternary search, all with different trade-offs. One alternative that issues parallel memory accesses is shown below:

while (uintptr_t step=n/3) {
  auto t1=lower+step,t2=lower+(step<<1);
  unsigned cmp=((*t1)<=needle)+((*t2)<=needle);
  if (cmp==2) {
      n-=t2-lower; lower=t2;
  } else if (cmp) {
      n-=t1-lower; lower=t1;
  } else {

Unfortunately the results were disappointing:

set size103105107109
ternary search81 ms143 ms400 ms781 ms

The extra memory accesses are just not worth it. In worst case we have now two cache misses per iteration (even if they are done in parallel), and the benefit of the increased fanout is just too small.

Of course there are ways to speed up search in large data sets further. For example by storing the data in a cache-optimized B-Tree rather than a sorted vector. But that forces us to abandon our original mission, namely exploiting sortedness that is available in the data anyway. For small and medium sets we have shown improvements over the text-book approach here, if someone has a good idea for large sets I would be happy to hear about it in the comments.

Update: As Marcin pointed out in the comments, it is possible to use a SIMD-version of binary search to perform 4 searches in parallel. I have used the code from section 5.5.4 of the thesis, it is a bit long to show here. And indeed we get good performance (best of 10 runs):

set size103105107109
SIMD version24 ms63 ms478 ms706 ms

Note that for many data sizes the performance is in the middle between the text-book version and the conditional-move version, regardless of which of the two is actually faster. Puzzling. What is also puzzling is that the variance of the runtime seems to be much higher than for the non-SIMD versions. For the largest data set I have usually seen runtimes around 780ms, with peaks up to 950ms. I have no explanation for that.

If, as Marcin also suggested in the comment below, we replace the manual gather code

int cmpval0 = data_shifted[result_vector[i + 0]];
int cmpval1 = data_shifted[result_vector[i + 1]];
int cmpval2 = data_shifted[result_vector[i + 2]];
int cmpval3 = data_shifted[result_vector[i + 3]];
__m128i xm_cmpvalvec = _mm_set_epi32(cmpval3, cmpval2, cmpval1, cmpval0);

with the AVX2 gather instruction

__m128i xm_cmpvalvec = _mm_i32gather_epi32(data_shifted,_mm_load_si128((__m128i*)(result_vector)),4);

we get as runtime performance

set size103105107109
AVX2 gather26 ms68 ms487 ms708 ms

At a first glance the performance of the AVX2-gather is identical to the manual gather. And indeed I had got exactly that result in previous gather experiments, too. However the variance problem seems to have gone away when using AVX2 gather, the runtime is now relative stable across runs.

Thursday, July 2, 2015

"Open-Sourcing" HyPer queries

We are sometimes contacted by people who would like to inspect the behavior of HyPer. HyPer itself is currently not open source, but it is still possible to look at the source code of queries (which is all that matters for query performance).

To do that, we can use an internal debug interface that we use for profiling. Note that this feature is completely undocumented and unsupported and might go away at any point in time, it is not aimed at end users. But it can still be used to inspect queries. As prerequisite we need the clang compiler in the search path, more precisely a binary called clang-3.5. Then, we can start the server daemon with a debug flag like that:

bin/hyperd mydatabase -xcompilestatic

Now every statement that generates code (e.g., queries, or create table statements) writes the code to the local directory and then call clang-3.5 to compile it into a shared library. Usually we do not use clang but JIT directly for performance reasons, but some debugging and profiling tools can handle shared libraries much better than JITed code. Note that the generated code is probably bested viewed after passing it through c++filt, as that will pretty-print the mangled names.

Consider for example an SQL statement like

create table foo(x integer not null)

It results in three file when run with the debug flag, the file dump1.ll with the source code, a file dump1.debug-ll for debug info purposes, and a file with the compiled code. Here the code is not very exciting, it is basically code for memory management (initPartition_foo etc.), multiversioning (retrieveVersionPartition_foo etc.) and the low-level SQL update primitives (insertPartition_foo).

If we run a query like

select count(*) from foo where x<123

another ll file is generated. Here the query consists basically of a single function (planStart).The first few dozen lines are uninteresting, they just navigate memory until the correct columns pointers are computed. The main query code itself is quote short, I have included it below (ignoring the extra code to handle tuples that are modified by concurrent transactions).

  %tid = phi i64 [ %86, %loopDonescan18 ], [ %66, %loopscan.preheader ]
  %76 = call i64 @hyper::RuntimeFunctions::findUnmodifiedRange(hyper::Transaction::VersionRecord**, unsigned int*, unsigned long, unsigned long)(%"hyper::Transaction::VersionRecord"** %72, i32* %74, i64 %tid, i64 %67)
  %77 = icmp ult i64 %tid, %76
  br i1 %77, label %loopscan11.preheader, label %loopDonescan12

  br label %loopscan11, !dbg !156

  %tid13 = phi i64 [ %84, %scanStep ], [ %tid, %loopscan11.preheader ]
  %.sum = sub i64 %tid13, %66
  %78 = getelementptr i32* %68, i64 %.sum
  %x = load i32* %78, align 4
  %79 = icmp slt i32 %x, 123
  br i1 %79, label %then14, label %scanStep

  %80 = load i64* %0, align 8
  %81 = call { i64, i1 } @llvm.sadd.with.overflow.i64(i64 %80, i64 1)
  %82 = extractvalue { i64, i1 } %81, 1
  br i1 %82, label %overflow.loopexit35, label %cont16

  %83 = extractvalue { i64, i1 } %81, 0
  store i64 %83, i64* %0, align 8
  br label %scanStep

  %84 = add i64 %tid13, 1
  %85 = icmp eq i64 %84, %76
  br i1 %85, label %loopDonescan12.loopexit, label %loopscan11

  br label %loopDonescan12

It first figures out which tuples are unmodified and can be scanned safely. Then (starting from loopscan11) it scans over these tuples, loads the x column, checks the filter condition (before then14), and updates the count for qualifying tuples (including overflow checks).

Note that the query is not parallelized because the query optimized decided that the (empty) table did not warrant parallelization. For more larger data sets the generated code will be more complex due to parallel execution, but it should be quite readable in general.

Note that there is a particular hackish way to exploit that interface that might be interesting for low-level experiments: Instead of providing the real clang compiler, we can put a script called clang-3.5 into your search path that first opens the provided source code with an editor and then calls clang. This allows is to modify the generated code and play with the query as we like. Do do not do that if you value the data that is currently stored inside your database! Bad things can and will happen. But it is great for ad-hoc experiments.

Wednesday, June 10, 2015

First impressions of MemSQL

The MemSQL system uses some interesting techniques like query compilation and lock-free data structures, and aspires to be "The Fastest In-Memory database". Parts of its approach are similar to our HyPer system, therefore we were curious and downloaded the MemSQL Community Edition for testing.

The MemSQL web page is not very explicit about it, but after some testing we tend to believe that MemSQL aims at OLTP workloads, so our tests might be unfavorable for MemSQL, as we tried out TPC-H SF1, an OLAP workload. We used the Comunity Edition 4.0.28 for that, and ran MemSQL out of the box. Which worked very pleasantly by the way, installation was completely hassle free.

What was surprising was that while some queries were fast, some queries were very slow. Below are the runtimes for the first 5 TPC-H queries, running on a six core i7-3930K, compared with HyPer v0.5-332.


After investigating the plans with explain we saw that apparently MemSQL always uses either index-nested-loop-joins (INL) or nested-loop-joins (NL), which is very expensive in large, TPC-H-style join queries. The INL is ok, although still somewhat expensive, as seen in Q4, where only INL is used, but if the system is forced to use NL, performance is very poor.

Therefore our assumption that the system aims at OLTP settings, where large joins do not occur. Unfortunately there is no easy way to run TPC-C on MemSQL (we would have to implement a benchmark driver first), therefore we stoped here with just some first impressions. Perhaps we can get a student to implement an OTLP workload for both systems, we will publish an update then.

An interesting side aspect is the compilation model. HyPer generates machine code using the LLVM compiler backend, while MemSQL generates C++ code and passes it through GCC. That is particular interesting because we can look at the generated C++ code and see what the system is doing (and the generated code is quite readable). A downside of generating C++ code is compile time: While HyPer uses in the order of 100ms to compile a TPC-H query, MemSQL uses in the order of 10s, a factor 100 slower. They mitigate that using a plan cache, but interactive query sessions can become quite painful.

But for the OLTP side the system looks quite interesting, they use fancy data structures like lock-free skip lists, and it will be interesting to see how they perform there.

Update: Robbie Walzer from MemSQL suggested to add foreign key indexes, which eliminates the expensive nested loop joins from the queries. I have included the numbers below, and I have also added results from PostgreSQL 9.4. Note that PostgreSQL is disk-based and single threaded, so the comparison is unfair, but it represents "classical" systems.

QueryMemSQLwith FKPostgreSQLHyPer
Q10.64s 0.64s 10.90s 0.013s
Q245.92s 0.11s 0.38s 0.001s
Q3326.94s 0.83s 1.96s 0.013s
Q40.56s 0.57s 0.48s 0.007s
Q5120.50s 0.79s 0.63s 0.008s

And indeed the foreign key indexes greatly improve performance, nested loop joins are just too expensive. Note, however, that the other two systems do fine without these extra indexes, so investing some effort in a hash join implementation might be worthwhile for MemSQL.

Robbie also mentioned cluster support, which we did not test at all here. I assume MemSQL scales nicely with the number of nodes in a cluster, we will test that at some later point. On the other hand even with perfect scalability you need a decent number of nodes until the single-node performance of HyPer is reached in TPC-H. Things might be very different in OLTP settings like TPC-C, of course, there the index structures of MemSQL can probably shine.

Monday, April 20, 2015

Using HyPer with PostgreSQL drivers

using a database from within an application requires language bindings, database drivers, etc., Support  a large range of languages is requires a lot of effort, therefore we have decided to emulate the PostgreSQL wire protocol in HyPer. The immediate benefit is that it is now possible to use HyPer with many existing tools, including psql and the PostgreSQL JDBC driver.

To try it out, download the updated binary, unpack it, create a new database instance using

bin/hyperdb --initdb mydatabase

And start the database server (note that you need to add -i is you plan to use TCP, for example for the JDBC driver)
bin/hyperdb mydatabase

Now we can connect using the regular PostgerSQL tools, for example by using

psql -h /tmp -p 7483

The PostgreSQL wire protocol emulation is still relatively new, but it should be usable already. Please send us a note if you encounter bugs or missing features.

A more detailed description of how to use that setup together with the DBT-3 setup can be found here. We had to fix some bugs in the original DBT-3 release to get it running, but now we can run nearly exactly the same benchmark script in PostgrSQL and HyPer (we are working on fixing the remaining issues).

Of course while we want to emulate the PostgreSQL interface for compatibility reasons, the actual implementation is completely different. We have some performance numbers for SF-10 on PostgreSQL and on HyPer online, and the composite score of HyPer is roughly a factor 45 better than the score of PostgreSQL.

Friday, February 20, 2015

Type Inference in SQL

What drives me nuts about SQL is the type system, or rather, the lack of well specified type inference. In principle, SQL is a strongly typed language, which means that every expression has a well defined data type that is known at compile time. In practice however, while the underlying database system might know the type, the human as no idea.

To illustrate that, consider the following very simple C++ fragment

auto a = b / c;

If we look at the C++ standard, Section 5.6, we see that for division the "usual arithmetic conversions" are performed (and that phrase has a well-defined meaning, Section 5 describes precisely how types are promoted), and the result of the division has the same data type as the converted input. Which is reasonable.

If we compare that with the following SQL fragment

select b/c as a;

and we want to know the data type of a, we have no idea. The SQL standard, Section 6.26, says

 The precision and scale of the result of division are implementation-defined.

Very helpful. In practice this means everybody does it different, and when building a new system like HyPer, it is unclear what the correct behavior should be. To avoid language fragmentation we usually try to do what everybody else does, in particular PostgreSQL, but in this particular example PostgreSQL is not even statically typed in the SQL sense, as they maintain scale and precision at runtime.

And this has implications for a lot of real-world queries. Consider this SQL fragment:

create table foo(a integer);
select avg(a) as b, avg(a)/2 as c;

What would you expect the types of b and c to be? As usual the standard is completely useless here (both implementation defined). But what should the type be? integer? numeric? If numeric, with which precision and scale? Database systems do it all differently, there is no census at all. And note that the data type affects further behavior, for example many systems truncate when dividing integers, but round when dividing numerics.

A quick grep of the SQL standard indicates 411 occurrences of implementation-defined behavior. And not in some obscure corner cases, this includes basic language features. For a programming language that would be ridiculous. But for some reason people accept the fact that SQL is incredibly under-specified, and that it is impossible to write even relatively simple analytical queries in a way that is portable across database systems.

Which I find very sad. Before adding more obscure features to SQL standard, the language committee should perhaps first invest some effort in standardizing type inference to allow for truly portable SQL queries. Vendors will not like this, of course, because everybody has already implemented it differently. But the longer we wait, and the more implementation-defined behavior we add to the standard, the more the systems diverge. Which is exactly what a common standard is supposed to avoid.

Thursday, January 8, 2015

Fun with CHAR

The CHAR(n) data type is one of the more bizarre features of SQL. It is supposed to represent a fixed-length string, but as we will see in a moment, it behaves odd in all kinds of ways. IMHO it should never be used for anything. There might be use cases for CHAR(1), i.e., for storing a single character, but as a string data type its semantics are so odd and broken that it should never be used.

At a first glance, CHAR is simple. CHAR(n) means that the corresponding value is a string with a fixed length of n, padding with space as needed. Thus

select cast('A' as CHAR(5))
'A    '

An easy concept, except that people apparently noticed that such a data types would not be very useful in practice when comparing values like this:

create table foo(x char(5));
create table bar(y char(10));
select * from foo,bar where x=y;

If we would take the "fixed-length string" semantic literally, we would never find join partners, as x and y will always differ due to their different length.

To fix that, the SQL standard mandated that for comparison purposes CHAR values are implicitly padded with <space> until they reach the length of the other string. Which means that

   'A  '   = 'A    ' (char(3) and char(5))
=> 'A    ' = 'A    ' 
=> true

Which "fixed" the comparison problem, but caused all kinds of other problems. Note that the padding behavior is always active, even when comparing with non-CHAR strings like VARCHAR. This results in absurd semantic issues:

create table foo(x varchar(10),y char(5),z varchar(10));
insert into foo values ('A ','A','A  ');
select x=y, y=z, x=z from foo;

Equality is no longer transitive! The CHAR/VARCHAR comparisons do add the padding to the shorter CHAR string, and thus compare as equal. The VARCHAR comparisons use the literal values, and thus compare as not equal. This kills transitivity, ruins all kinds of query optimizer reasoning, and is a pain in general.

This implicit padding behavior also ruins hash-based approaches, as we cannot know beforehand how many spaces we would have to add, which makes computing hash-values impossible. A common remedy for that is to do the exact opposite of what CHAR is supposed to mean, namely always strip all trailing spaces from CHAR values, and then for comparisons also strip trailing spaces from the comparison partner. We always have a unique representation for CHAR values then, and can use hashing. The only problem is that this affects observable behavior, as there are a few characters that should be sorted before space (depending on the collation, of course):

create table foo(x char(3));
insert into foo values (''),(chr(3));
select * from foo order by x;
'   '
'\x03  '

This is the output order that PostgreSQL and HyPer are producing, but strictly speaking the tuples should be in the other order, as chr(3)<chr(32). Fortuntately most users do not care about characters below space, so we can get away with the trimming trick.

After all these problems, what might be reasons that speak in favor of CHAR(n) (with n>1)? To be honest, I have no idea. In the past one argument in favor of CHAR might have been that fixed-length strings are easier to handle than variable length strings. But that argument is largely moot today. Note that, at least by default, the "n" specifies the number of characters, not the number of bytes. When using Unicode this is not very helpful for determining the length of the string, unless one wants to waste a lot of space. Thus CHAR strings are not really fixed length, anyway. They cause a lot of problems, but have virtually no benefit over VARCHAR.

Fun data point that I noticed when comparing HyPer with other database systems: Even well established systems like PostgreSQL get the CHAR semantics wrong. PostgreSQL 9.4 does it correct when comparing with VARCHAR, but forgets to pad when comparing to TEXT (but trims instead, which completely ruins the comparison):

select 'A'::char(2) ='A '::varchar(10) => T
select 'A'::char(2) ='A '::text        => F
select 'A '::char(2)='A '::text        => F

Don't use CHAR(n). Use VARCHAR(n) or TEXT.