OpenLink Virtuoso (Product Blog)

Details

Virtuoso Data Space Bot

Burlington, United States

FOAF

There is a new SPB (Semantic Publishing Benchmark) 256 Mtriple record with Virtuoso.

As before, the result has been measured with the feature/analytics branch of the v7fasttrack open source distribution, and it will soon be available as a preconfigured Amazon EC2 image. The updated benchmarks AMI with this version of the software will be out there within the next week, to be announced on this blog.

On the Cost of RDF Query Optimization

RDF query optimization is harder than the relational equivalent; first, because there are more joins, hence an NP complete explosion of plan search space, and second, because cardinality estimation is harder and usually less reliable. The work on characteristic sets, pioneered by Thomas Neumann in RDF3X, uses regularities in structure for treating properties usually occurring in the same subject as columns of a table. The same idea is applied for tuning physical representation in the joint Virtuoso / MonetDB work published at WWW 2015.

The Virtuoso results discussed here, however, are all based on a single RDF quad table with Virtuoso's default index configuration.

Introducing query plan caching raises the Virtuoso score from 80 qps to 144 qps at the 256 Mtriple scale. The SPB queries are not extremely complex; lookups with many more triple patterns exist in actual workloads, e.g., Open PHACTS. In such applications, query optimization indeed dominates execution times. In SPB, data volumes touched by queries grow near linearly with data scale. At the 256 Mtriple scale, nearly half of CPU cycles are spent deciding a query plan. Below are the CPU cycles for execution and compilation per query type, sorted by descending sum of the times, scaled to milliseconds per execution. These are taken from a one minute sample of running at full throughput.

Test system is the same used before in the TPC-H series: dual Xeon E5-2630 Sandy Bridge, 2 x 6 cores x 2 threads, 2.3GHz, 192 GB RAM.

We measure the compile and execute times, with and without using hash join. When considering hash join, the throughput is 80 qps. When not considering hash join, the throughput is 110 qps. With query plan caching, the throughput is 145 qps whether or not hash join is considered. Using hash join is not significant for the workload but considering its use in query optimization leads to significant extra work.

With hash join

Compile	Execute	Total	Query
`3156 ms`	`1181 ms`	`4337 ms`	`Total`
`1327 ms`	`28 ms`	`1355 ms`	`query 01`
`444 ms`	`460 ms`	`904 ms`	`query 08`
`466 ms`	`54 ms`	`520 ms`	`query 06`
`123 ms`	`268 ms`	`391 ms`	`query 05`
`257 ms`	`5 ms`	`262 ms`	`query 11`
`191 ms`	`59 ms`	`250 ms`	`query 10`
`9 ms`	`179 ms`	`188 ms`	`query 04`
`114 ms`	`26 ms`	`140 ms`	`query 07`
`46 ms`	`62 ms`	`108 ms`	`query 09`
`71 ms`	`25 ms`	`96 ms`	`query 12`
`61 ms`	`13 ms`	`74 ms`	`query 03`
`47 ms`	`2 ms`	`49 ms`	`query 02`

Without hash join

Compile	Execute	Total	Query
`1816 ms`	`1019 ms`	`2835 ms`	`Total`
`197 ms`	`466 ms`	`663 ms`	`query 08`
`609 ms`	`32 ms`	`641 ms`	`query 01`
`188 ms`	`293 ms`	`481 ms`	`query 05`
`275 ms`	`61 ms`	`336 ms`	`query 09`
`163 ms`	`10 ms`	`173 ms`	`query 03`
`128 ms`	`38 ms`	`166 ms`	`query 10`
`102 ms`	`5 ms`	`107 ms`	`query 11`
`63 ms`	`27 ms`	`90 ms`	`query 12`
`24 ms`	`57 ms`	`81 ms`	`query 06`
`47 ms`	`1 ms`	`48 ms`	`query 02`
`15 ms`	`24 ms`	`39 ms`	`query 07`
`5 ms`	`5 ms`	`10 ms`	`query 04`

Considering hash join always slows down compilation, and sometimes improves and sometimes worsens execution. Some improvement in cost-model and plan-space traversal-order is possible, but altogether removing compilation via caching is better still. The results are as expected, since a lookup workload such as SPB has little use for hash join by nature.

The rationale for considering hash join in the first place is that analytical workloads rely heavily on this. A good TPC-H score is simply unfeasible without this as previously discussed on this blog. If RDF is to be a serious contender beyond serving lookups, then hash join is indispensable. The decision for using this however depends on accurate cardinality estimates on either side of the join.

Previous work (e.g., papers from FORTH around MonetDB) advocates doing away with a cost model altogether, since one is hard and unreliable with RDF anyway. The idea is not without its attraction but will lead to missing out of analytics or to relying on query hints for hash join.

The present Virtuoso thinking is that going to rule based optimization is not the preferred solution, but rather using characteristic sets for reducing triples into wider tables, which also cuts down on plan search space and increases reliability of cost estimation.

When looking at execution alone, we see that actual database operations are low in the profile, with memory management taking the top 19%. This is due to CONSTRUCT queries allocating small blocks for returning graphs, which is entirely avoidable.