In the previous article we saw an unofficial result of running the full workload. Here we will look more closely at the performance profile.

In this article we look at what the server actually does. The execution profiles for all the queries are available for download. To experiment with parallelism, you may download the software and run it locally. An Amazon image may be provided later.

Execution Profile

Below is the top of the oprofile output for a run of the 22 queries with qualification parameters against the 100G database. The operation in TPC-H terms is given under each heading.

CPU: Intel Sandy Bridge microarchitecture, speed 2299.98 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
1406009   9.5117  ce_vec_int_range_hash
 
-- Bloom filter before selective hash join where selective hash join is best or only condition on a scan, e.g., lineitem scan where l_partkey checked against a Bloom filter
730935    4.9448  ce_vec_int_sets_hash
 
-- Bloom filter check where another condition is applied first, e.g., lineitem scan with condition on l_shipdate, then Bloom filter check on l_partkey
617091    4.1746  hash_source_chash_input_1i_n
 
-- Q13 right outer hash join with probe from orders, build from customer, NOT EXISTS test in Q16
586938    3.9706  cs_decode
 
-- Generic reading of a column, all queries
536273    3.6279  ce_intd_range_ltgt
 
-- Date range comparison, most queries, e.g., Q1, 3, 4, 5, 6, 7, 8, 20
479898    3.2465  cha_cmp_1i
 
-- Q13 GROUP BY on c_custkey, Q15 GROUP BY on S_suppkey. Indicates missing the cache on high cardinality GROUP BY single INT grouping key
473721    3.2047  cha_inline_1i_n_int
 
-- Selective hash join after prefiltering with Bloom filter. Check only that key in hash table, no dependent part. For example Q8, Q9, Q17, Q20, with lineitem filtered by part
463723    3.1371  ce_dict_generic_range_filter
 
-- Range condition on low cardinality column (dictionary encoded), e.g., l_quantity, l_discount
425149    2.8761  cha_inline_1i_int
 
-- Hash join check that a single INT key with a dependent part is in a hash table, fetch the dependent part for hits
365040    2.4695  setp_chash_run
 
-- GROUP BY, e.g., GROUP BY on c_custkey in Q13
359645    2.4330  clrg_partition_dc
 
-- Partitioning a vector of values, occurs in all high cardinality GROUP BYs, e.g., Q13, Q15
349473    2.3642  gb_aggregate
 
-- Updating aggregates after the grouping keys are resolved, e.g., Q1
331926    2.2455  ce_dict_any_sets_decode
 
-- Fetching non-contiguous dictionary encoded strings, e.g., l_returnflag, l_linestatus in Q1
316731    2.1427  cha_insert_1i
 
-- Building a hash join hash table from a 1 INT key to a dependent part, e.g., Q14 from l_partkey to lineitems in a time window with a given l_partkey
286865    1.9406  ce_search_rld
 
-- Index lookup for run length delta compressed keys, e.g., l_orderkey, ps_partkey
231390    1.5654  cha_insert
 
-- Build of a hash join hash table for multipart keys, e.g., hash join with partsupp (Q9) or lineitem (Q17, Q20) on build side
224070    1.5158  ce_dict_int64_sets_decode
 
-- Fetching a non-contiguous set of double column values from dictionary encoding, e.g., l_discount, l_quantity
218506    1.4782  ce_intd_any_sets_decode
 
-- Fetching a non-contiguous set of date column values, e.g., l_shipdate in Q9
200686    1.3576  page_wait_access
 
-- Translating page numbers into buffers in buffer pool for column access
198854    1.3452  itc_col_seg
 
-- Generic part of table scan/index access
197645    1.3371  cha_insert_1i_n
 
-- Hash join build for hash tables with a 1 INT key and no dependent, e.g., lineitem to part join in Q9, Q17, Q20
195300    1.3212  cha_bloom_unroll_a
 
-- Bloom filter for hash based in predicate
192309    1.3010  hash_source_chash_input
 
-- Hash join probe with multi-part key, e.g., Q9 against partsupp
191313    1.2942  strstr_sse42
 
-- SSE 4.2 substring match, e.g. NOT LIKE condition in Q13
186325    1.2605  itc_fetch_col_vec
 
-- Translating page numbers into buffers in buffer pool for column access
159198    1.0770  ce_vec_int64_sets_decode
 
-- Fetching non-contiguous 64-bit values from array represented column, e.g., l_extendedprice

So what does TPC-H do? It is all selective hash joins. Then it is table scans with a condition on a DATE column. The DATE column part is because this implementation does not order the big tables (lineitem, orders) on their DATE columns. Third, TPC-H does big GROUP BYs, with Q13 representing most of this; all other GROUP BYs have at least 10x fewer groups. Then it just extracts column values, most often after selecting the rows on some condition; quite often the condition is a foreign key column of the row finding a hit in a hash table. This last is called invisible hash join.

Then there are some index lookups, but there the join is usually a merge pattern, like reading orders in order of o_orderkey and then getting l_orderkey matches from lineitem (e.g., Q21). Then there are quite often partitioning operations, i.e., many threads produce tuples and these tuples go to a consumer thread selected based on a partitioning column on the tuple. This is also called exchange. This is in all the high cardinality GROUP BYs and in the RIGHT OUTER JOIN of Q13.

Whether the implementation is RAM-only or paging from disk makes next to no difference. Virtuoso and Actian Vector both have a buffer pool, with Virtuoso using substantially smaller pages (8K vs. 256K or 512K). The page-number-to-buffer translation and the latching that goes with it is under 3% (itc_fetch_col_vec, page_wait_access). Of course, actually accessing secondary storage will kill the score, but checking that something is in memory is safe as long as this is always a hit.

So, once the query plans are right, the problem resolves into a few different loops. The bulk of the effort in making a TPC-H implementation is in query optimization so that the right loops run in the right order. I will further explain what the loops should contain in the next article.

Many of the functions seen in the profile are instantiations of a template for a specific data type. There are also different templates for variants of a data structure, like a hash table with different keys.

Compilation is a way of generating exactly the right loop for any set of data types. We do not find much need for this here, though. The type-specific operations that are in fact needed are anticipatable and can be predefined based on templates.

Future Gains

There are possible gains in the following domains:

  • Query optimization - Q15, Q17 do some work in duplicate, so reuse will roughly cut the time in half. In addition to reuse, there is a possibility to convert a scalar subquery into a derived table with a GROUP BY (decorrelation). Decorrelation also applies in Q20. Reuse will give some 10-12K more score; decorrelation is probably below measurement noise.

  • Group join - Merging the GROUP BY and the hash probe in Q13 will save 2% of time on throughput, maybe 5K more score.

  • Better parallelism in power test - A good power test takes 42s, and 5 times the work takes 165s, so the platform utilization of the power test is roughly 4/5. This could be slightly better.

  • NUMA - Up to 20% performance gains have been seen on scale-out configurations from binding a process to one CPU socket. This gain will be readily seen in a scale-out setting when running one process per physical CPU. Gains in the NUMA department are rather fickle and fragile and are worth less in real life than in benchmarks. So I would say that work in single-server NUMA optimization is not immediately needed since scale-out configurations will get these gains as soon as one sets affinities for processes. But whether binding processes to CPUs makes sense depends on how even the workload is. TPC-H is even, but reality is less so.

In principle, a composite score of a good run could go from 250K to 280K by diverse incremental improvements. There is some noise in run times, so two consecutive runs can deviate by a few percent, which is already significant when talking of small improvements.

Larger Scales

For a 100 GB run with 5 throughput streams, the peak query memory consumption is 3.5 GB. This includes all the GROUP BY and hash join tables that are made. Q13 is the most expensive in terms of space and allocates a peak of 1 GB for a single execution. This is trivial in comparison with the working set. But at 100x greater scale (10,000 GB or 10 TB) this becomes about 630 GB, as there will be a minimum of 9 concurrent streams instead of 5.

Now 10 TB is clearly a scale-out size, so the 630 GB transient memory is not on a single machine.

Still going to large scale-out will change the nature of some queries and introduce significant data transfer. We will know soon enough.

Problems of the Metric

A small-scale TPC-H, e.g., 100 GB, starts to have features of a lookup workload. This means that there is high variability between consecutive executions, and that the pre-execution state of the system does have large effect on a measurement.

The rules say that power test follows bulk load with maybe some checking of correctness of load in between. The bulk load is basically unregulated and usually will include statistics gathering.

The first power test shows significant variation, anything from 220 K to 240 K, while the second power test is steadily around 255 K. Since the reported score is the lower of the two, the biggest return to the implementor is in making sure the first power test is good.

The second throughput test is usually 5-10 K higher than the first; the throughput test is less sensitive. The difference does not come from I/O, but from system time for memory allocation. The memory and the same quantities and block sizes is reused by the second run.

A good power run is 42s from 100% warm RAM. A power run that gets the data from the OS disk buffers is 70s or so. A power run that gets data from SSD is worse, maybe 120s.

To cite an example, increasing the buffer pool size from 64 GB to 72 GB gets the first post-load power test from 120-150K to 230-240K while having no appreciable effect on subsequent tests. The effect is exacerbated by the fact that the power score is based on a geometric mean of run times. Very short queries (e.g., Q2) vary between consecutive in-memory executions from 120ms to 220ms. A similar variation occurs in Q13 which is on either side of 6s. Due to geometric mean, the same variability has very different impact depending on which query it hits. A Q2 that reads data from out of process can take 2s instead of the expected under 200ms. This kills the score even if a delay of 1.8s as such did not. So increasing the buffer pool in the example just serves to make sure the small supplier table is in memory. Fetching it from the OS is simply not an option in the first Q2 even if it were an option in a longer running query. Remember, the the lower of the two scores is reported, and the first power test will be bad unless it is somehow primed by some trick like bulk load order.

When differences between implementations are small, variation between consecutive runs becomes important. This is why OLTP benchmarks are required to run for a relatively long time and only measure the steady state portion. This would also be appropriate for small TPC-H runs.

Conclusions

Query performance reduces to a few loops when all the conditions are right. Getting the conditions to be right depends on query optimization and the right architecture choices. These results would be unthinkable without vectored execution and a compressed column store design. These in and of themselves guarantee nothing unless all plans are right. A previous Virtuoso 7 takes over 10x longer because of bad plans and missing some key execution techniques like the RIGHT OUTER JOIN in Q13 and partitioned GROUP BY.

Last summer, we did runs of the much simpler Star Schema Benchmark. The results were very good, because the engine had the much smaller set of tricks and the right plans. Repeating these tests now would show some gains from a still better hash join but nothing dramatic.

In the next article we will look at the finer points of hash join. After this we move to larger scales and clusters.

To be continued...

In Hoc Signo Vinces (TPC-H) Series