There is a new SPB (Semantic Publishing Benchmark) 256 Mtriple record with Virtuoso.

As before, the result has been measured with the feature/analytics branch of the v7fasttrack open source distribution, and it will soon be available as a preconfigured Amazon EC2 image. The updated benchmarks AMI with this version of the software will be out there within the next week, to be announced on this blog.

On the Cost of RDF Query Optimization

RDF query optimization is harder than the relational equivalent; first, because there are more joins, hence an NP complete explosion of plan search space, and second, because cardinality estimation is harder and usually less reliable. The work on characteristic sets, pioneered by Thomas Neumann in RDF3X, uses regularities in structure for treating properties usually occurring in the same subject as columns of a table. The same idea is applied for tuning physical representation in the joint Virtuoso / MonetDB work published at WWW 2015.

The Virtuoso results discussed here, however, are all based on a single RDF quad table with Virtuoso's default index configuration.

Introducing query plan caching raises the Virtuoso score from 80 qps to 144 qps at the 256 Mtriple scale. The SPB queries are not extremely complex; lookups with many more triple patterns exist in actual workloads, e.g., Open PHACTS. In such applications, query optimization indeed dominates execution times. In SPB, data volumes touched by queries grow near linearly with data scale. At the 256 Mtriple scale, nearly half of CPU cycles are spent deciding a query plan. Below are the CPU cycles for execution and compilation per query type, sorted by descending sum of the times, scaled to milliseconds per execution. These are taken from a one minute sample of running at full throughput.

Test system is the same used before in the TPC-H series: dual Xeon E5-2630 Sandy Bridge, 2 x 6 cores x 2 threads, 2.3GHz, 192 GB RAM.

We measure the compile and execute times, with and without using hash join. When considering hash join, the throughput is 80 qps. When not considering hash join, the throughput is 110 qps. With query plan caching, the throughput is 145 qps whether or not hash join is considered. Using hash join is not significant for the workload but considering its use in query optimization leads to significant extra work.

With hash join

Compile Execute Total Query
3156 ms 1181 ms 4337 ms Total
1327 ms 28 ms 1355 ms query 01
444 ms 460 ms 904 ms query 08
466 ms 54 ms 520 ms query 06
123 ms 268 ms 391 ms query 05
257 ms 5 ms 262 ms query 11
191 ms 59 ms 250 ms query 10
9 ms 179 ms 188 ms query 04
114 ms 26 ms 140 ms query 07
46 ms 62 ms 108 ms query 09
71 ms 25 ms 96 ms query 12
61 ms 13 ms 74 ms query 03
47 ms 2 ms 49 ms query 02
       

Without hash join

Compile Execute Total Query
1816 ms 1019 ms 2835 ms Total
197 ms 466 ms 663 ms query 08
609 ms 32 ms 641 ms query 01
188 ms 293 ms 481 ms query 05
275 ms 61 ms 336 ms query 09
163 ms 10 ms 173 ms query 03
128 ms 38 ms 166 ms query 10
102 ms 5 ms 107 ms query 11
63 ms 27 ms 90 ms query 12
24 ms 57 ms 81 ms query 06
47 ms 1 ms 48 ms query 02
15 ms 24 ms 39 ms query 07
5 ms 5 ms 10 ms query 04

Considering hash join always slows down compilation, and sometimes improves and sometimes worsens execution. Some improvement in cost-model and plan-space traversal-order is possible, but altogether removing compilation via caching is better still. The results are as expected, since a lookup workload such as SPB has little use for hash join by nature.

The rationale for considering hash join in the first place is that analytical workloads rely heavily on this. A good TPC-H score is simply unfeasible without this as previously discussed on this blog. If RDF is to be a serious contender beyond serving lookups, then hash join is indispensable. The decision for using this however depends on accurate cardinality estimates on either side of the join.

Previous work (e.g., papers from FORTH around MonetDB) advocates doing away with a cost model altogether, since one is hard and unreliable with RDF anyway. The idea is not without its attraction but will lead to missing out of analytics or to relying on query hints for hash join.

The present Virtuoso thinking is that going to rule based optimization is not the preferred solution, but rather using characteristic sets for reducing triples into wider tables, which also cuts down on plan search space and increases reliability of cost estimation.

When looking at execution alone, we see that actual database operations are low in the profile, with memory management taking the top 19%. This is due to CONSTRUCT queries allocating small blocks for returning graphs, which is entirely avoidable.