Database Architects: 2014

Wednesday, December 17, 2014

Type Providers in C++

I recently learned about Type Providers in F# which (among other things) allow you to import the result schema of database queries into your source code. Basically, the compiler can check at compile time if your query is correct and can offer you the result columns directly in you program. Which is neat, because it makes database accesses much nicer and more concise.

After reading that, I wonder, can we do that in C++? And of course we can, at least with a compiler plugin or a small pre-processor. I hacked some quick prototype that scans C++ for embedded SQL snippets like this

for (auto t:SQL select a,5*x as b from r,s where r.x=s.y)
cout << t.a() << " " << t.b() << endl;

extracts the SQL statements, asks the database about the result schema of that query, and replaces the SQL with an C++ object that offers just the same interface as the query result. You can take a look at an example program and the translated result.

The current extractor is a bit hack-ish, but I think the concept is very nice, you get very natural access to query results. Both the extractor and the generated code use libpqxx for database access, so in principle it should work for PostgreSQL and HyPer. Unfortunately I could not figure out how to get the result schema of a query without actually executing it in libpqxx when using PostgreSQL as backend. So the while generated code supports both databases, the extractor is HyPer only. In recent HyPer releases you can use explain schema to get the result schema of a query, which is not supported by PostgreSQL. But porting that part to PostgreSQL should be simple, patches are welcome.

Monday, September 8, 2014

Experiments Hurt the Review Process

Experimentally validating (or falsifying) a theory is one of the fundamental aspects of science. As such I have always put a lot of emphasize on experiments, and of course empirical evidence are essential when designing systems. To pick a simple example: merge sort has better asymptotic behavior than quick sort, but in practice quick sort is usually faster. Wall clock is what matters.

However while all that is true in general, in practice experiments can be quite harmful, in particular in the context of peer-reviewed papers. In the rest of this post I will try to illustrate why that is the case and what could potentially be done about that.

It took my a while to realize that experiments hurt the review process, but both real papers that I have seen, and in particular input from my colleague Alfons Kemper have convinced me that experiments in papers are a problem. Alfons suggested to simply ignore the experimental section in papers, which I find a bit extreme, but he has a point.

The first problem is that virtually all papers "lie by omission" in their evaluation section. The authors will include experiments where their approach behaved well, but they will not show results where it has problems. In Computer Science we find that perfectly normal behavior, in fields like Pharmacy we would go to jail for that.

Furthermore this behavior interacts badly with the review process. Of course the reviewers know that are shown only the good cases, therefore these cases have to be really good. If someone writes a paper about an approach that is nicer and cleaner than a previous one, but is 20% slower in experiments, he will have a hard time publishing it (even though the 20% might well just stem from small implementation differences).

Which is a shame. Often reviewers are too keen on good experimental results.

The easiest way to kill a paper is to state that the experiments not conclusive enough. The authors will have a hard time arguing against that, as there is always another experiment that sounds plausible and important and on which the reviewer can insist. Authors therefore have an incentive to make their experiments slick looking and impressive, as that will hopefully guide the reviewer in other directions.

Which brings me to the core of the problem with experiments in papers: There have been papers where the thin line between careful experimental setup and outright cheating has been touched, perhaps even crossed. And that is really bad for science. Now one can claim that this is all the fault of the authors, and that they should be tarred and feathered etc., but that is too short sighted. Of course it is the fault of the authors. But the current system is rigged in a way that makes is very, very attractive to massage experimental results. I once heard the statement that "the reviewers expect good numbers". And unfortunately, that is true.

And this affects not only the highly problematic cases, even papers that stay "legal" and "just" create good looking results by a very careful choice of their experimental setup are quite harmful, in best case we learn little, in worst case we are misled.

Now what can we do about that problem? I don't really know, but I will propose a few alternatives. One extreme would be to say we largely ignore the experimental part of a paper, and evaluate it purely based upon the ideas presented within. Which is not the worst thing to do, and arguably it would be an improvement over the current system, but if we ignore the actual experiments the quick sort mentioned above might have had a hard time against the older merge sort.

The other extreme would be what the SIGMOD Repeatability Effort tried to achieve, namely that all experiments are validated. And this validation should happen during reviewing (SIGMOD did it only after the fact). Then, the reviewer should repeat the experiments, try out different settings, and fully understand and validate the pros and cons of the proposed approach.

Unfortunately, in an ideal world that might actually be the best approach, but that is not going to happen. First, authors will claim IP problems and all kinds of excuses why their approach cannot be validated externally. And second, even more fundamental, reviewers simply do not have the time to spend days on repeating and validating experiments for each paper they review.

So how could a compromise look like? Perhaps a good mode would be to review papers primarily based upon the ideas presented therein, and take only a cursory look at the experiments. The evaluation should look plausible, but that is it, it should not have much impact on the review decision. And in particular authors should not be expected to produce another miracle result, reporting honest number is preferable over a new performance record.

Now if the authors want to they could optionally submit a repeatability set (including binaries, scripts, documentation etc.) together with their paper, and that would give them bonus points during reviewing, in particular for performance numbers, as now the reviewers can verify the experiments if they want. No guarantee that they will do, and papers should still be ranked primarily based upon ideas, but that would allow for more reasonable experimental results.

Experiments are great in principle, but in the competitive review process they have unintended consequences. In the long run, we have to do something about that.

Thursday, August 7, 2014

TPC-DS with Vector Hadoop Edition

My kick-off blog entry for Database Architects contained the first announcement of a product my friends over at the Vectorwise group of Actian in Amsterdam have been working on in the past years: the Hadoop version of what is now called Actian Vector. In this post, I will go behind the scenes of this system, which I think is currently the fastest SQL-on-Hadoop system available.

In my first post, this system was only known under its code name Vortex, though the marketeers of Actian in the end have branded it Actian Vector - Hadoop Edition. They in fact advertise the product mostly under yet another name: Actian Analytics Platform - Hadoop SQL Edition. This platform contains two main products: besides Vortex you also get Actian DataFlow (formerly Pervasive's "DataRush" product) which one can compare to something like Yahoo Pig on steroids, that is, an interface to build ETL dataflows with full relational operators, equipped with a graphical user interface (KNIME). It shares no code with Pig, and I say "on steroids" because it is both faster and has broader functionality than Pig; e.g. it not only is a relational processing pipeline, but also contains a suite of data mining and data integration algorithms as well. Actian DataFlow has an optimized parallel path to load data into Vortex. It is a luxury for Vortex to launch with such an easy-to-use and function Hadoop-design ETL system.

SQL-on-Hadoop is in the spotlight lately since IT users are (i) looking to standardize their cluster hardware and sofware management and Hadoop has become this standard software layer and (ii) because typical Big Data pipelines, while they may start with many messy textual- or log-datasources, in the end spit out cleaner and more structured results, that can be processed by SQL systems. This also allows to directly leverage the many existing SQL business applications over Hadoop based data. From what Actian people tell me, there is a lot of interest for Vortex, and some customers already bought the platform right at announcement. Vortex was announced on June 3 2014 at the Hadoop Summit in San Jose (see the video or flip through the presentation), and it has been released by the end of June 2014.

I started to write this blogpost with the simple goal to provide the details of the performance results that were obtained on a side-by side comparison with Impala version 1.3.1 on the benchmark that the Cloudera folks have released at github.com/cloudera/impala-tpcds-kit. This is a 19-query subset of the normally 99 query TPC-DS benchmark. TPC-DS has been in the making forever and seems to take ages to replace TPC-H, so this may be seen as a good sign of initial industry adoption (TPC-DS still has no official results). Though, on the downside, selecting 19 queries out of 99 may not be a very objective use of the benchmark, and additionally the Impala subset also modified the queries by removing all ROLLUP and window functions (not supported in Impala) and also added extra SQL clauses that help Impala prune partitions. But my understanding of the TPC-DS benchmark is still limited so I still hold my judgement on whether this 19-query subset is representative. It just takes a lot of time to analyze and play with 99 different queries, and I have not yet found that time...

The experiment, at scale factor 3000 (3TB raw data) was run on a small cluster of 5 nodes, each with 64GB RAM and 16 cores (Xeon, 2.4GHz) and hyperthreading disabled. The nodes each have two 1Gb ethernet adapters and two 1TB magnetic disks. Both systems are hot (second run), and the Hadoop compute node configuration has not changed since data import (hence we can assume both systems read local HDFS blocks where possible). All results are in seconds, first line is Vortex, the second line is Impala (with its stock settings as published in the original Impala blogpost, back in January):

	Q03	Q07	Q19	Q27	Q34	Q42	Q43	Q46	Q52	Q53	Q55	Q59	Q63	Q65	Q68	Q73	Q79	Q89	Q98
Vortex	0.98	2.52	2.49	2.71	2.48	0.39	4.15	5.19	0.64	1.13	0.54	30.15	1.10	10.77	2.53	1.20	3.83	2.02	1.00
Impala	18.78	23.85	18.01	20.16	45.30	5.47	62.33	30.53	5.59	22.06	5.33	456.67	21.92	262.67	18.40	9.05	17.72	34.28	7.35

In total, Vortex is 14x faster than Impala on this benchmark. Inferencing from the Impala follow-up blogpost on this benchmark in May, we could even conclude that Vortex also is much faster than Spark, PrestoDB and the "Stinger" release of Hive. Let me state that I highly esteem Impala as an attempt to create a fast analytical database system for Hadoop. It has quite a few nice features, such as support for compressed columnar formats as ORCfile and Parquet, and it can compile queries at run-time to machine code using LLVM. However, its support for SQL is still far from complete and below what in my MonetDB and Vectorwise experience the analytical data market demands. Which is a .. very large amount of features, including not only SQL-99 analytical extensions (window functions, rollup), but also workload management, access control and a further longlist. Hence, Impala still has a long way to go. Actian Vector has been on the market for four years, has been in development for a decade, and its SQL frontend and API have been on the market even longer (they stem from Ingres). By now with many customers in production, I almost dare start to call Vector "mature". So, in my experience it does take at least a decade to reach maturity. The biggest shortfall of Impala in this benchmark, is that it runs each query single-threaded: Impala just uses one core per node for query processing. In an IT environment where each low end-server has at least 16 cores this could mean a 16-fold hit in performance, and the number of cores per server is still rapidly increasing (newer Intel machines can have up to 120 cores), so it will get even worse in the future. I do expect that at some point Impala will get multi-threaded, but as explained in the sequel, exploiting this efficiently on benchmarks like TPC-DS is by no means trivial. On the one hand, one needs to keep all cores busy, while on the other hand all these busy cores cause a large increase in memory cache pressure and network bandwidth, which can easily draw the system into all kinds of bottlenecks.

Digging Deeper..
Writing this blogpost just about the overall results on this benchmark left the database architect in me a bit unsatisfied, so I decided to show an example how the network bottleneck can affect TPC-DS, and what can be done about it. Taking Vectorwise from being the fastest per-core single-server analytical database system on the market into cluster territory required a lot of work on the columnar rewriter (optimizer) of Vectorwise. Now, we will dig deep into optimization for one of the benchmark queries in particular, namely TPC-DS Q98 - Impala flavor:


	select i_item_desc,
	i_category,
	i_class,
	i_current_price,
	sum(ss_ext_sales_price) as itemrevenue
	-- commented out, Impala does not support PARTITION BY -- sum(ss_ext_sales_price)*100/sum(sum(ss_ext_sales_price)) OVER (PARTITION BY i_class)
	from
	store_sales,
	item,
	date_dim
	where
	ss_item_sk = i_item_sk
	and i_category in ('Jewelry', 'Sports', 'Books')
	and ss_sold_date_sk = d_date_sk
	and ss_sold_date_sk between 2451911 and 2451941 -- manually added partition key filter
	and d_date between '2001-01-01' and '2001-01-31'
	group by
	i_item_id,
	i_item_desc,
	i_category,
	i_class,
	i_current_price
	order by
	i_category,
	i_class,
	i_item_id,
	i_item_desc

	limit 1000; -- added limit

To make their 19 query derived subset of TPC-DS run well, the Impala folks modified the queries from the official TPC-DS definitions. In case of Q98 they added the clause:
ss_sold_date_sk between 2451911 and 2451941
in order for it to trigger partition pruning (plus, use of the "partition by" SQL-99 clause was removed, and the result was made smaller with a LIMIT 1000). In the original Impala setup - which we followed - the fact table store_sales is partitioned on date, or rather the foreign key column of the date dimension ss_sold_date_sk. The above selection predicate is phrased directly on this foreign key, which is the table partitioning key, and this allows Impala to conclude that the great majority of the partitions contain irrelevant data and can be skipped (there are multiple years of data in TPC-DS, and this selection corresponds to just one month). This explicit hack had to be placed because Impala's optimizer cannot yet automatically infer this foreign key restriction from the SQL predicate
d_date between '2001-01-01' and '2001-01-31'
which is exactly the same restriction.

As an aside, TPC-DS uses a date dimension that has preemptively been filled with decades of dates, even while the fact tables refer just to a small time period of a few years. Hence selecting 5% of the tuples in the date dimension either means selecting 0 tuples from the fact table (because the fact table just contains tuples from a few years, not intersecting with the selected 5%), or more likely, selecting much more than 5% from the fact table. This may initially be missed by optimizers, who may just use the selection percentage of the dimension as the selection percentage of the fact table. I saw it happen in Vectorwise, but that got fixed ;-), but PostgreSQL makes the same mistake. It is relatively straightforward, however, to fix the cost model estimates for this by looking at the value distribution of the ss_sold_date_sk column in the optimizer statistics and comparing them to the value distribution of the d_date_sk column, and adjusting the plan estimates accordingly.

In the TPC-DS schema design used in the Vortex experiments, the store_sales is partitioned on its primary key, not on the date foreign key, but a clustered index is created on the ss_sold_date_sk column. This means that each partition is organized in date order, yet holds a uniform subset of the table. The "MinMax indexes" which Vectorwise automatically creates on all columns, and some intricate logic in the columnar rewriter of Vectorwise, allows Vortex to skip all irrelevant data straight from the normal TPC-DS schema. No help required.

While the Actian engineers were looking at the TPC-DS benchmark, I got involved as well, and helped provide some ideas for optimization. It was my first confrontation with TPC-DS and now I have looked at 19 out of the 99 queries, so my impression and understanding of its "choke points" is still very partial compared to my understanding of its predecessor TPC-H (see my post on the choke points of TPC-H, and an explanation of what I actually mean with "choke points"). Whereas large joins are an important problem in TPC-H, in case of TPC-DS on cluster systems, it appears that a bigger challenge is handling large-scale aggregations. An aggregation is a GROUP BY query, but if the amount of GROUP BY result tuples is large, and unrelated to the partitioning keys, then all nodes must help in computing this large group (because it is so large and thus expensive), and need to send data to each other (because it is unrelated to the partitioning). This all-to-all communication pattern is the worst that can happen to a parallel query plan, as it can lead to the system becoming network bound, and this also happened in the GROUP BY of Q98, illustrated below:

The above picture shows the un-optimized original query plan that Vortex used. The single-server Vectorwise aggregation normally chooses between various parallel (multi-core) aggregation strategies, where C is the amount of cores involved in the query on a single node. Here are two such strategies being considered:

Scan(tuples)=>Aggr(Local)=>eXchangeUnion XU(C:1)=>Aggr(Global)
Scan(tuples)=>eXchangeHashSplit XHS(C:C)=>Aggr(Partitioned)=>eXchangeUnion XU(C:1)

An eXchangeUnion XU(C:1) runs on C input threads and unions, sending their output to one thread. The eXchangehashSplit XHS(C:C) runs on C threads and distributes all tuples based on hash code, sending the output to C threads. In the first strategy, each thread scans (a supposed different) set of tuples, locally aggregates that data, and then sends the aggregate results to one coordinating thread. This thread re-aggregates the results (e.g. global SUM is a SUM of local SUMs), and returns these. In the second strategy, the tuples are hash-partitioned according to the aggregation keys (XHS): each thread reads some tuples, and sends the tuples to one of the other threads (which thread is determined by a "hash" code computed over the GROUP BY key). Thereafter, each thread only has tuples that hash to the same number: the data has been re-partitioned. Subsequently, each thread can locally compute the aggregate in one go, because all data of each GROUP is in the same thread. The final result is the union of all results (XU).

The initial parallel execution model based on exchange operators originally implemented in single-server Vectorwise was extended in Vortex with distributed exchange operators (XU, XHS), which behave exactly the same, but can exchange data over the network also with threads that run on other compute nodes. In the below distributed variants of the two strategies introduced above, N is the amount of nodes, and C@N means that C threads are running on N nodes, hence C*N threads are running in the total system (**this is just convenience notation and does not imply that Vortex always uses the same amount C of threads on all N nodes):

=>Aggr(Local)=>DXU(C@N:1)=>Aggr(Global)
=>DXHS(C@N:C@N)=>Aggr(Partitioned)=>DXU(C@N:1)

Given the somewhat large size of the GROUP BY (53K results), the second strategy was chosen in the graphically depicted query plan above (DXHS operators). The cost model of Vortex did not choose the first strategy, because in each thread the Aggr(local) would locally aggregate and produces roughly 50K results, and then all 80 threads would exchange 80 * 50K tuples. This would mean that the last Aggr(Global) step of the query is that a single thread needs to aggregate 80*50K = 4M tuples: significant non-parallel work, therefore little speedup. Therefore the second strategy is superior here. The second strategy exchanges the tuple stream based on a hash code computed over the GROUP BY columns, such that each thread receives data for a distinct subset of groups. Therefore, the second strategy only aggregates once, and more importantly, all cores participate in the Aggr(Partitioned) computation. The problem in this second approach, which the trained Vectorwise query performance analyst identifies by the dark red color of the operators in the execution plan, is that these DXHS operations are quite slow. Each compute node is sending and receiving 16*50K tuples, and the tuples are quite wide (45 bytes) so this is close to 40MB of data, and sending that over Ethernet takes time. Essentially, the system is running into the dreaded network bottleneck. We also run Vortex on clusters with Infiniband, where this problem is less severe (but still occurs on some queries). However, many Hadoop clusters just have an Ethernet interconnect.

For this reason, a new strategy was added to the Vortex rewriter, now showing all three:

Aggr(Local)=>DXU(N:1)=>Aggr(Global)
DXHS(C@N:C@N)=>Aggr(Partitioned)=>DXU(C@N:1)
XHS(C@N:C@local)=>Aggr(Local)=>DXHS(C@N:C@N)=>Aggr(Partitioned)=>DXU(C@N:1)

Here C@local means that the local XHS (not DXHS) redistributes the data only inside the same node. In our benchmark, it just exchanges data between each 16 local threads, for which network communication is not required. Each thread then pre-aggregates the data locally, producing independent non-overlapping results among the 16 threads. The effect of this pre-aggregation is that the amount of tuples is reduced; in this case, by a factor 10. The subsequent DXHS operation thus has to send 10x less data and avoids the network bottleneck. All in all, Q98 became twice faster with this optimization.

Note that an alternative variant of the third strategy that would omit =>XHS(C@N:C@local) and just tries to first aggregate locally before sending to the network, would be much less effective. Thanks to the local exchange, each thread gathers relevant keys from 16 neighbors, and finds 100 duplicates. Without the local exchange, one would therefore find on average just 100/16=6 duplicates, hence the subsequent DXHS network communication volume would be much less reduced.

This has been just one example of the many cost-based rewrite rules that were introduced in Vortex. As David DeWitt rightly notes, query optimization is more complicated than rocket science and thus can infinitely evolve, but given the results Actian has seen so far, I think that Actian Vector - Hadoop Edition (Vortex) is already an interesting addition to the sprawling SQL-on-Hadoop market. With my students, we are discovering many scientific challenges as well and look forward to reporting on our results in addressing some of these, in the near future.

Tuesday, July 29, 2014

The LDBC Social Network Benchmark

The Linked Data Benchmark Council is a new benchmarking organization for RDF and graph data management technology (from neo4j to Giraph to Owlim) and the Social Network Benchmark (SNB) is one of its first initiatives. The SNB was created from the LDBC EU project, in which both Thomas and I are active, and was already used by this year's ACM SIGMOD Programming Contest, which was about graph analytics.

SNB is intended to provide the following value to different stakeholders:

For end users facing graph processing tasks, SNB provides a recognizable scenario against which it is possible to compare merits of different products and technologies. By covering a wide variety of scales and price points, SNB can serve as an aid to technology selection.
For vendors of graph database technology, SNB provides a checklist of features and performance characteristics that helps in product positioning and can serve to guide new development.
For researchers, both industrial and academic, the SNB dataset and workload provide interesting challenges in multiple technical areas, such as query optimization, (distributed) graph analysis, transactional throughput, and provides a way to objectively compare the effectiveness and efficiency of new and existing technology in these areas.

We should clarify that even though the data model of SNB resembles Facebook (and we're extending it to also look more like Twitter), the goal of SNB is not to advise Facebook or Twitter what systems to use, they don't need LDBC for that. Rather, we take social network data as a model for the much more broader graph data management problems that IT practitioners face. The particular characteristic of a graph data management problem is that the queries and analysis is not just about finding data by value, but about learning about the connection patterns between data. The scenario of the SNB, a social network, was chosen with the following goals in mind:

the benchmark scenario should be understandable to a large audience, and this audience should also understand the relevance of managing such data.
the scenario in the benchmark should cover the complete range of challenges relevant for graph data management, according to the benchmark scope.
the query challenges in it should be realistic in the sense that, though synthetic, similar data and workloads are encountered in practice.

The SNB is in fact three distinct benchmarks with a common dataset, since there are three different workloads. Each workload produces a single metric for performance at the given scale and a price/performance metric at the scale. The full disclosure further breaks down the composition of the metric into its constituent parts, e.g. single query execution times.

Interactive Workload. The Interactive SNB workload is the first one we are releasing. It is defined in plain text, yet we have example implementations in neo4j's Cypher, SPARQL and SQL. The interactive workloads tests a system's throughput with relatively simple queries with concurrent updates. The system under test (SUT) is expected to run in a steady state, providing durable storage with smooth response times. Inserts are typically small, affecting a few nodes at a time, e.g. uploading of a post and its tags. Transactions may require serializability, e.g. verifying that something does not exist before committing the transaction. Reads do not typically require more than read committed isolation. One could call the Interactive Workload an OLTP workload, but while queries typically touch a small fraction of the database, this can still be up to hundreds of thousands of values (the two-step neighborhood of a person in the social graph, often). Note that in order to support the read-queries, there is a lot of liberty to create indexing structures or materialized views, however such structures need to be maintained with regards to the continues inserts that also part of the workload. This workload is now in draft stage, which means that the data generator and driver software stack is is ready and the purpose is to obtain user feedback, as well as develop good system implementations. The first implementations of this workload are now running on Openlink Virtuoso, Neo4j and Sparsity Sparksee, and we are eager to see people try these, and optimize and involve these.
Business Intelligence Workload. There is a first stab at this workload formulated in SPARQL, tested against Openlink Virtuoso. The BI workload consists of complex structured queries for analyzing online behavior of users for marketing purposes. The workload stresses query execution and optimization. Queries typically touch a large fraction of the data and do not require repeatable read. The queries will be concurrent with trickle load (not out yet). Unlike the interactive workload, the queries touch more data as the database grows.
Graph Analytics Workload. This workload is not yet available. It will test the functionality and scalability of the SUT for graph analytics that typically cannot be expressed in a query language. As such it is the natural domain for graph programming frameworks like Giraph. The workload is still under development, but will consist of algorithms like PageRank, Clustering and Breadth First Search. The analytics is done on most of the data in the graph as a single operation. The analysis itself produces large intermediate results. The analysis is not expected to be transactional or to have isolation from possible concurrent updates.

All the SNB scenarios share a common scalable synthetic data set, generated by a state-of-the art data generator. We strongly believe in a single dataset that makes sense for all workloads, that is, the interactive and BI workloads will traverse data that has sensible PageRank outcomes, and graph clustering structure, etc. This is in contrast to LinkBench, released by the team of Facebook that manages the OLTP workload on the Facebook Graph, which closely tunes to the low-level MySQL query patterns Facebook sees,but whose graph structure does not attempt to be realistic beyond average out degree of the nodes (so, it makes not attempt to create realistic community patterns or correlations) . The authors of LinkBench may be right that the graph structure does not make a difference for simple insert/update/delete/lookup actions which LinkBench itself tests, but for the SNB queries in the Interactive and BI workloads this is not true. Note that Facebook's IT infrastructure does not store all user data in MySQL and its modified memcached ("TAO"), some of it ends up in separate subsystems (using HDFS and HBase), which is outside of the scope of LinkBench. However, for queries like in the SNB Interactive and BI workloads it does matter how people are connected, and how the attribute values of connected people correlate. In fact, the SNB data generator is unique in that it generates a huge graph with correlations, where people who live together, have the same interests or work for the same company have greater chance to be connected, and people from Germany have mostly German names, etc. Correlations frequently occur in practice and can strongly influence the quality of query optimization and execution, therefore LDBC wants to test their effects on graph data management systems (the impact of correlation among values and structure on query optimization and execution are a "choke point" for graph data management system where LDBC wants to stimulate innovation).

As mentioned, we hope SNB will be used on a broad variety of systems, from graph databases (neo4j, Sparksee) to graph programming frameworks (Giraph,GraphLab), RDF databases (Virtuoso,OWLIM), but even relational systems as well as NoSQL systems. The workloads are quite different and not all combinations of systems and workloads are even likely. The below table of SNB workloads versus systems shows our current thinking:

Please take a look at ldbcouncil.org/developer/snb to find all relevant technical information on SNB. We are eager to hear your comments.

Monday, July 14, 2014

Benchmark Design Thoughts

In the context of the LDBC a EU research project, whose aim it is to create the Linked Data Benchmark Council (check its new website!), a benchmarking organization for graph data management systems, Thomas and me are venturing into designing new database benchmarks, rather than our normal practice of using benchmarks for evaluating the performance of the database systems we design and implement. A kind of role reversal, thus.

So, what makes a good benchmark? Many talented people have paved our way in addressing this question and for relational database systems specifically the benchmarks produced by TPC have been very helpful in maturing the technology, and making it succesful. Good benchmarks are relevant and representative (address important challenges encountered in practice), understandable , economical (implementable on simple hardware), fair (such as not to favor a particular product or approach), scalable, accepted by the community and public (e.g. all of its software is available in open source). This list stems from Jim Gray's Benchmark Handbook. In this blogpost, I will share some thoughts on each of these aspects of good benchmark design.

A very important aspect of benchmark development is making sure that the community accepts a certain benchmark, and starts using it. A benchmark without published results and therefore opportunity to compare results, remains irrelevant. A European FP7 project is a good place to start gathering a critical mass of support (and consensus, in the process) for a new benchmark from the core group of benchmark designers in the joint work performed by the consortium. Since in LDBC multiple commercial graph and RDF vendors are on the table (Neo Technologies, Openlink, Ontotext and Sparsity) a minimal consensus on fairness had to be established immediately. The Linked Data Benchmark Council itself is a noncommercial, neutral, entity which releases all its benchmark specifications, software, as well as many materials created during the design, to the public in open source (GPL3). LDBC has spent a lot of time engaging interested parties (mainly through its Technical User Community gatherings) as well as lining up additional organizations as members of the Linked Data Benchmark Council. There is, in other words, a strong non-technical, human factor in getting benchmarks accepted.

The need for understandability for me means that a database benchmark should consist of a limited number of queries and result metrics. Hence I find TPC-H with its 22 queries more understandable than TPC-DS with its 99, because after (quite some) study and experience it is possible to understand the underlying challnges of all queries in TPC-H. It may also be possible for TPC-DS but the amount of effort is just much larger. Understandable also means for me that a particular query should behave similarly, regardless of the query parameters. Often, a particular query needs to be executed many times, and in order not to play into the hands of simple query caching and also enlarge the access footprint of the workload, different query parameters should be used. However, parameters can strongly change the nature of a query but this is not desirable for the understandability of the workload. For instance, we know that TPC-H Q01 tests raw computation power, as its selection predicate eliminates almost nothing from the main fact table (LINEITEM), that it scans and aggregates into a small 4-tuple result. Using a selection parameter that would select only 0.1% of the data instead, would seriously change the nature of Q01, e.g. making it amendable to indexing. This stability of parameter bindings is an interesting challenge for Social Network Benchmark (SNB) of LDBC which is not as uniform and uncorrelated as TPC-H. Addressing the challenge of obtaining parameter bindings that have similar execution characteristics will be the topic of a future blog post.

The economical aspect of benchmarking means that while rewarding high-end benchmark runs with higher scores, it is valuable if a meaningful run can also be done with small hardware. For this reason, it is good practice to use a performance-per-EURO (or $) metric, so small installations despite a lower absolute score can still do well on that metric. The economical aspect is right now hurting the (still) leading relational OLTP benchmark TPC-C. Its implementation rules are such that for higher reported rates of throughput, a higher number of warehouses (i.e. larger data size) is needed. In the current day and age of JIT-compiled machinecode SQL procedures and CPU-cache optimized main memory databases, the OLTP throughput numbers now obtainable on modern transctional systems like Hyper on even a single server (it reaches more than 100.000 transactions per second) are so high that they lead to petabyte storage requirements. Not only does this make TPC-C very expensive to run, just by the sheer amount of hardware needed according to the rules, but it also undermines it representativity, since OLTP data sizes in practice are much smaller than OLAP data sizes and do not run in the petabytes.

Representative benchmarks can be designed by studying or even directly using real workload information, e.g. query logs. A rigorous example of this is the DBpedia benchmark whose workload is based on the query logs of dbpedia.org. However, this SPARQL endpoint is a single public Virtuoso instance that has been configured to interrupt all long running queries, such as to ensure the service remains responsive to as many users as possible. As a result, it is only practical to run small lookup queries on this database service, so the query log only contained solely such light queries. As a consequence, the DBpedia benchmark only tests small SPARQL queries that stress simple B-tree lookups only (and not joins, aggregations, path expressions or inference) and poses almost no technical challenges for either query optimization or execution. The lesson, thus, is to balance representativity with relevance (see later..).

The fact that a benchmark can be scaled in size favors the use of synthetic data (i.e. created by a data generator) because data generators can produce any desired quantity of data. I hereby note that in this day and age, data generators should be parallel. Single-threaded single-machine data generation just becomes unbearable even at terabyte scales. A criticism of synthetic data is that it may not be representative of real data, which e.g. tends to contain highly correlated data with skewed distributions. This may be addressed to a certain extent by injecting specific skew and correlations into synthetic data as well (but: which skew and which correlations?). An alternative is to use real data and somehow blow up or contract the data. This is the approach in the mentioned DBpedia benchmark, though such scaling will distort the original distributions and correlations. Scaling a benchmark is very useful to investigate the effect of data size on the metric, on individual queries, or even in micro-benchmark tests that are not part of the official query set. Typically OLTP database benchmarks have queries whose complexity is O(log(N)) of the data size N, whereas OLAP benchmarks have queries which are linear, O(N) or at most O(N.log(N)) -- otherwise executing the benchmark on large instances is infeasible. OLTP queries thus typically touch little data, in the order of log(N) tuples. In order not to measure fully cold query performance, OLTP benchmarks for that reason need a warmup phase with O(N/log(N)) queries in order to get the system into a representative state.

Now, what makes a benchmark relevant? In LDBC we think that benchmarks should be designed such that crucial areas of functionality are highlighted, and in turn system architects are stimulated to innovate. Either to catch up with competitors and bring the performance and functionality in line with the state-of-the-art but even to innovate and address technical challenges for which until now no good solutions exist, but which can give a decisive performance advantage in the benchmark. Inversely stated, benchmark design can thus be a powerful tool to influence the industry, as a benchmark design may set the agendas for multiple commercial design teams and database architects around the globe. To structure this design process, LDBC introduces the notion of "choke points": by which we mean problems that challenge current technology. These choke points are collected and described early in the LDBC design process, and the workloads developed later are scored in terms of their coverage of relevant choke points. In case of graph data querying, one of the choke points that is unique to the area is recursive Top-N query handling (e.g. shortest path queries). Another choke point that arises is the impact of correlations between attribute value of graph nodes (e.g. both employed by TUM) and the connectivity degree between nodes (the probability to be friends). The notion observed in practice is that people who are direct colleagues, often are in each others friend network. A query that selects people in a social graph that work for the same company, and then does a friendship traversal, may get a bad intermediate result size estimates and therefore suboptimal query plan, if optimizers remain unaware of value/structure correlations. So this is an area of functionality that the Social Network Benchmark (SNB) by LDBC will test.

To illustrate what choke points are in more depth, we wrote a paper in the TPCTC 2013 conference that performs a post-mortem analysis of TPC-H and identified 28 such choke points. The below table lists them all, grouped into six Choke Point (CP) areas (CP1 Agregation, CP2 Join, CP3 Locality, CP4 Calculations, CP5 Subqueries and CP6 Parallelism). The classification also shows CP coverage over each of the 22 TPC-H queries (black is high impac, white is no impact):

I would recommend reading this paper to anyone who is interested in improving the TPC-H score of a relational database system, since this paper contains the collected experience of three database architects who have worked with TPC-H at length: Orri Erling (of Virtuoso), Thomas Neumann (Hyper,RDF-3X), and me (MonetDB,Vectorwise). Recently Orri Erling showed that this paper is not complete as he discovered one more choke-point area for TPC-H: Top-N pushdown. In a detailed blog entry, Orri shows how this technique can trivialize Q18; and this optimization can single handedly improve the overall TPC-score by 10-15%. This is also a lesson for LDBC: even though we design benchmarks with choke points in mind, the queries themselves may bring to light unforeseen opportunities and choke-points that may give rise to yet unknown innovations.