Details

OpenLink Software
Burlington, United States

Subscribe

Post Categories

Recent Articles

Community Member Blogs

Display Settings

articles per page.
order.

Translate

In Hoc Signo Vinces (part 15 of n): TPC-H and the Science of Hash [ Virtuso Data Space Bot ]

This piece is dedicated to Peter Boncz, architect of Actian Vector and MonetDB.

Query optimization is hard. It is a set of mutually interacting tricks and special cases. Execution is also hard, but there the tricks do not interact quite as much or as unpredictably. So, if there is a few percent of score to be had from optimization of either execution or query, I will take execution first. It is less likely to break things and will probably benefit a larger set of use cases.

As we see from the profile in the previous article, hash join is the main piece of execution in TPC-H. So between the article on late projection and the first result preview, I changed the hash table used in HASH JOIN and GROUP BY from cuckoo to linear.

Let's see how the hash tables work: Cuckoo hash is a scheme where an entry can be in one of two possible places in the table. If a new entry is inserted and either of the possible places is unoccupied, it goes there. If both are occupied, it could be that one contains an entry whose other possible location is free -- and then that entry may be relocated. Thus an insert may push the previous occupant of the place somewhere else, which in turn may push another, and so on. It may happen that an insert is still not possible, in which case the entry to insert goes into an exceptions list.

To look up an entry, you get a hash number, and use different fields of it to pick the two places. Look in one, then the other, then the exceptions. If there is no match and the table is reasonably close to capacity, you will have looked in at least 3 widely-separated places to determine the absence of a match. In practice, the hash table consists on a prime number of distinct arrays of a fixed size (partitions), and each partition has its own exception list. A modulo of the hash number picks the array, then two further modulos of different parts of the number pick the places in the array.

In most cases in TPC-H, the hash joins are selective; i.e., most items on the probe side find no match in the hash table.

So, quite often you have 3 cache misses to show that there is no hit. This is, at least in theory, quite bad.

There are Bloom filters before the hash table. The Bloom filter will prune most of the probes that would miss. A Bloom filter is an array of bits. Given a hash number, the Bloom filter will very efficiently tell you whether the entry is sure not to be in the hash table. If the Bloom filter says it can be in the hash table, you must look.

In the Virtuoso case, for each entry in the hash table, the Bloom filter has 8 bits. The Bloom check uses a field of the hash number to pick a 64-bit word from the Bloom filter. Then different fields of the hash number are used to set up-to 4 bits in a 64-bit bit-mask. When building the hash table, the masks are OR-ed into the Bloom filter. When probing, before looking in the hash table, the system checks to see if the bits corresponding to the hash number are all on in the appropriate word. If they are not, the hash lookup is sure to miss.

Most expositions of Bloom filters talk about setting two bits for every value. With two bits set, we found 8 bits-per-value to work best. More bits makes a larger filter and misses the cache more; fewer bits makes too many collisions, and the Bloom filter produces too many false positives. A significant finding is that with 8 bits-per-value, setting 4 bits instead of 2 causes the filter to be twice as selective. The simple trick of setting 4 bits cuts the number of hash lookups for items that passed the Bloom filter to half in many selective hash joins. Examples are the many joins of lineitem and part or supplier where there is a condition on the smaller table.

Still, even with Bloom filters, a cuckoo hash will make too many cache misses.

So, enter linear hash. The idea is simple: The hash number picks a place in an array. Either the entry being sought is in the vicinity, or it is not in the hash table. If the vicinity is full of other entries, the entry can still be in an exception list.

With this and cuckoo alike, there are 3 different variants of hash table:

  1. A set of single unique integers

  2. A single-integer key with 0 or more dependent values, possibly with a next link if the key is not unique

  3. A key of n arbitrary values, 0 or more dependent values, optional next link if the key is not unique

In the first case, the hash table is an array of values; in the two other cases, it is an array of pointers. But since a pointer is 64 bits, of which the high 16 are not in the address space of x86_64, these high bits can be used to keep a part of the hash number. It will be necessary to dereference the pointer only if the high bits match the hash number. This means that nearly all lookups that do not find a match are handled with a single cache miss.

Each cache miss brings in a cache line of 8 words. The lookup starts at a point given by the hash number and wraps around at the end of the cache line. Only in the case that all 8 words are occupied but do not match does one need to look at the exceptions. There is one exception list for each partition of the hash table, like in the cuckoo scheme.

A hash lookup is always done on a vector of keys; the loop that takes most of the time is in fact the Bloom filter check. It goes as follows:

#define CHB_VAR(n)                           \
   uint64 h##n, w##n, mask##n;


#define CHB_INIT(n, i)                       \
   MHASH_STEP_1 (h##n, i);                   \
   w##n = bf[BF_WORD (h##n, sz)];            \
   mask##n = BF_MASK (h##n);


#define CHB_CK(n)                            \
   { matches[mfill] = inx + n;               \
     mfill += (w##n & mask##n) == mask##n; }

       for (inx = inx; inx < last; inx ++)
         {
           CHB_VAR (0);
           CHB_INIT (0, REF ((ce_first + sizeof (ELT_T) * inx)));
           CHB_CK (0);
         }

This is the perfect loop for out-of-order execution. Now, I have tried every variation you can imagine, and this does not get better. The loop calculates a hash number, fetches the corresponding word from the Bloom filter, calculates a mask, stores the index of the key in a results array, and increments the results counter if all the bits were set. There is no control dependency anywhere, just a data dependency between successive iterations; i.e., to know where the result must go, you must know if the previous was a hit.

You can unroll this loop very easily, so, for example, take 4 keys, do the numbers, fetch the words, and then check them one after the other. One would think this would have more misses in flight at any one time, which it does. But it does not run any faster.

Maybe the loop is too long. Circumstantial evidence suggests that short loops are better for instruction prefetching. So, one can also make a loop that gets any number of words of the Bloom filter and puts them in one local array and the hash numbers in another array. A subsequent loop then reads the hash number, calculates the mask, and checks if there is a hit. In this way one can generate as many misses as one wants and check them as late as one wants. It so happens that doing 8 misses and then checking them is better than either 4 or 16. But 8 is still marginally worse than the loop first mentioned.

One can also vary the test. Instead of adding a truth value to the result counter, one can have

if ((word & mask) == mask) result[fill++] = inx;

There is no clear difference between predication (incrementing the fill by truth value) and a conditional jump. The theory of out-of-order execution would predict predication to be better, but the difference is lost in measurement noise. This is true on both Intel Nehalem (Xeon 55xx) and Sandy Bridge (E5 26xx), but could be different on other architectures.

The multicore scalability of the test will give some information about platform utilization.

This is the ultimately simplified selective hash join:

SELECT  COUNT (*) 
  FROM  lineitem, 
        part 
 WHERE  l_partkey = p_partkey 
   AND     p_size < 15
;

This is the second simplest hash join but misses the cache much more; since this now has a key and a dependent part in the hash table, there is an extra pointer to follow, and the hash entry is two words plus the pointer to these in the hash table array.

SELECT  SUM (p_retailprice) 
  FROM  lineitem, 
        part 
 WHERE  l_partkey = p_partkey 
   AND     p_size < 15
;

By adjusting the number of parts selected, we can vary the Bloom filter selectivity and the size of the hash table. Below, we show the times for the two queries with single-thread and 24-thread execution, with different percentages of the part table on the build side of the hash join. All runs are against warm 100G TPC-H on the same test system as in the rest of the TPC-H series (dual Xeon E5-2630).

This table compares the performance of the linear and cuckoo implementations on the above queries (count vs. sum) on either 24 threads or 1 thread. Four data points are given for different sizes of hash table, given as percentage of the part table (having 400K - 20M entries) in the hash table. The rightmost column, which represents the case where the entire part table is on the build side does not have a Bloom filter; the other cases do. The Bloom bits are 8/4 for linear and 8/2 for cuckoo. The times are all in milliseconds, and the thousands separator is a comma.

Hash type Query type Threads 2%
(ms)
10%
(ms)
30%
(ms)
100%
(ms)
Linear COUNT 24 1,204 1,683 3,100 6,214
Linear SUM 24 1,261 2,447 5,059 13,086
Linear COUNT 1 15,286 22,451 38,863 66,722
Linear SUM 1 17,575 33,664 81,927 179,013
Cuckoo COUNT 24 1,849 2,840 4,105 6,203
Cuckoo SUM 24 2,833 4,903 9,446 19,652
Cuckoo COUNT 1 25,146 39,064 57,383 85,105
Cuckoo SUM 1 33,647 67,089 121,989 240,941

We clearly see cache effects on the two first lines, where the SUM and COUNT run in almost the same time on a small hash table but have a 2x difference on the larger hash table. The instruction path length is not very different for SUM and COUNT, but the memory footprint has a 3x difference.

We note that the SMP scalability of linear is slightly better, contrasting the ratio of 24-thread SUM to single-thread SUM. Both numbers are over 12x, indicating net benefit from core multithreading. (The test system has 12 physical cores.) The linear hash systematically outperforms cuckoo, understandably, since it makes a smaller net number of cache misses. The overall effect on the TPC-H score is noticeable, at around 15-20K units of composite score at 100G.

In conclusion, the Virtuoso hash join implementation is certainly on the level, with only small gains to be expected from further vectoring and prefetching. These results may be reproduced using the v7fasttrack Virtuoso Open Source releases from GitHub; develop/7 for cuckoo and feature/analytics for linear hash.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
05/28/2014 17:12 GMT
In Hoc Signo Vinces (part 15 of n): TPC-H and the Science of Hash [ Orri Erling ]

This piece is dedicated to Peter Boncz, architect of Actian Vector and MonetDB.

Query optimization is hard. It is a set of mutually interacting tricks and special cases. Execution is also hard, but there the tricks do not interact quite as much or as unpredictably. So, if there is a few percent of score to be had from optimization of either execution or query, I will take execution first. It is less likely to break things and will probably benefit a larger set of use cases.

As we see from the profile in the previous article, hash join is the main piece of execution in TPC-H. So between the article on late projection and the first result preview, I changed the hash table used in HASH JOIN and GROUP BY from cuckoo to linear.

Let's see how the hash tables work: Cuckoo hash is a scheme where an entry can be in one of two possible places in the table. If a new entry is inserted and either of the possible places is unoccupied, it goes there. If both are occupied, it could be that one contains an entry whose other possible location is free -- and then that entry may be relocated. Thus an insert may push the previous occupant of the place somewhere else, which in turn may push another, and so on. It may happen that an insert is still not possible, in which case the entry to insert goes into an exceptions list.

To look up an entry, you get a hash number, and use different fields of it to pick the two places. Look in one, then the other, then the exceptions. If there is no match and the table is reasonably close to capacity, you will have looked in at least 3 widely-separated places to determine the absence of a match. In practice, the hash table consists on a prime number of distinct arrays of a fixed size (partitions), and each partition has its own exception list. A modulo of the hash number picks the array, then two further modulos of different parts of the number pick the places in the array.

In most cases in TPC-H, the hash joins are selective; i.e., most items on the probe side find no match in the hash table.

So, quite often you have 3 cache misses to show that there is no hit. This is, at least in theory, quite bad.

There are Bloom filters before the hash table. The Bloom filter will prune most of the probes that would miss. A Bloom filter is an array of bits. Given a hash number, the Bloom filter will very efficiently tell you whether the entry is sure not to be in the hash table. If the Bloom filter says it can be in the hash table, you must look.

In the Virtuoso case, for each entry in the hash table, the Bloom filter has 8 bits. The Bloom check uses a field of the hash number to pick a 64-bit word from the Bloom filter. Then different fields of the hash number are used to set up-to 4 bits in a 64-bit bit-mask. When building the hash table, the masks are OR-ed into the Bloom filter. When probing, before looking in the hash table, the system checks to see if the bits corresponding to the hash number are all on in the appropriate word. If they are not, the hash lookup is sure to miss.

Most expositions of Bloom filters talk about setting two bits for every value. With two bits set, we found 8 bits-per-value to work best. More bits makes a larger filter and misses the cache more; fewer bits makes too many collisions, and the Bloom filter produces too many false positives. A significant finding is that with 8 bits-per-value, setting 4 bits instead of 2 causes the filter to be twice as selective. The simple trick of setting 4 bits cuts the number of hash lookups for items that passed the Bloom filter to half in many selective hash joins. Examples are the many joins of lineitem and part or supplier where there is a condition on the smaller table.

Still, even with Bloom filters, a cuckoo hash will make too many cache misses.

So, enter linear hash. The idea is simple: The hash number picks a place in an array. Either the entry being sought is in the vicinity, or it is not in the hash table. If the vicinity is full of other entries, the entry can still be in an exception list.

With this and cuckoo alike, there are 3 different variants of hash table:

  1. A set of single unique integers

  2. A single-integer key with 0 or more dependent values, possibly with a next link if the key is not unique

  3. A key of n arbitrary values, 0 or more dependent values, optional next link if the key is not unique

In the first case, the hash table is an array of values; in the two other cases, it is an array of pointers. But since a pointer is 64 bits, of which the high 16 are not in the address space of x86_64, these high bits can be used to keep a part of the hash number. It will be necessary to dereference the pointer only if the high bits match the hash number. This means that nearly all lookups that do not find a match are handled with a single cache miss.

Each cache miss brings in a cache line of 8 words. The lookup starts at a point given by the hash number and wraps around at the end of the cache line. Only in the case that all 8 words are occupied but do not match does one need to look at the exceptions. There is one exception list for each partition of the hash table, like in the cuckoo scheme.

A hash lookup is always done on a vector of keys; the loop that takes most of the time is in fact the Bloom filter check. It goes as follows:

#define CHB_VAR(n)                           \
   uint64 h##n, w##n, mask##n;


#define CHB_INIT(n, i)                       \
   MHASH_STEP_1 (h##n, i);                   \
   w##n = bf[BF_WORD (h##n, sz)];            \
   mask##n = BF_MASK (h##n);


#define CHB_CK(n)                            \
   { matches[mfill] = inx + n;               \
     mfill += (w##n & mask##n) == mask##n; }

       for (inx = inx; inx < last; inx ++)
         {
           CHB_VAR (0);
           CHB_INIT (0, REF ((ce_first + sizeof (ELT_T) * inx)));
           CHB_CK (0);
         }

This is the perfect loop for out-of-order execution. Now, I have tried every variation you can imagine, and this does not get better. The loop calculates a hash number, fetches the corresponding word from the Bloom filter, calculates a mask, stores the index of the key in a results array, and increments the results counter if all the bits were set. There is no control dependency anywhere, just a data dependency between successive iterations; i.e., to know where the result must go, you must know if the previous was a hit.

You can unroll this loop very easily, so, for example, take 4 keys, do the numbers, fetch the words, and then check them one after the other. One would think this would have more misses in flight at any one time, which it does. But it does not run any faster.

Maybe the loop is too long. Circumstantial evidence suggests that short loops are better for instruction prefetching. So, one can also make a loop that gets any number of words of the Bloom filter and puts them in one local array and the hash numbers in another array. A subsequent loop then reads the hash number, calculates the mask, and checks if there is a hit. In this way one can generate as many misses as one wants and check them as late as one wants. It so happens that doing 8 misses and then checking them is better than either 4 or 16. But 8 is still marginally worse than the loop first mentioned.

One can also vary the test. Instead of adding a truth value to the result counter, one can have

if ((word & mask) == mask) result[fill++] = inx;

There is no clear difference between predication (incrementing the fill by truth value) and a conditional jump. The theory of out-of-order execution would predict predication to be better, but the difference is lost in measurement noise. This is true on both Intel Nehalem (Xeon 55xx) and Sandy Bridge (E5 26xx), but could be different on other architectures.

The multicore scalability of the test will give some information about platform utilization.

This is the ultimately simplified selective hash join:

SELECT  COUNT (*) 
  FROM  lineitem, 
        part 
 WHERE  l_partkey = p_partkey 
   AND     p_size < 15
;

This is the second simplest hash join but misses the cache much more; since this now has a key and a dependent part in the hash table, there is an extra pointer to follow, and the hash entry is two words plus the pointer to these in the hash table array.

SELECT  SUM (p_retailprice) 
  FROM  lineitem, 
        part 
 WHERE  l_partkey = p_partkey 
   AND     p_size < 15
;

By adjusting the number of parts selected, we can vary the Bloom filter selectivity and the size of the hash table. Below, we show the times for the two queries with single-thread and 24-thread execution, with different percentages of the part table on the build side of the hash join. All runs are against warm 100G TPC-H on the same test system as in the rest of the TPC-H series (dual Xeon E5-2630).

This table compares the performance of the linear and cuckoo implementations on the above queries (count vs. sum) on either 24 threads or 1 thread. Four data points are given for different sizes of hash table, given as percentage of the part table (having 400K - 20M entries) in the hash table. The rightmost column, which represents the case where the entire part table is on the build side does not have a Bloom filter; the other cases do. The Bloom bits are 8/4 for linear and 8/2 for cuckoo. The times are all in milliseconds, and the thousands separator is a comma.

Hash type Query type Threads 2%
(ms)
10%
(ms)
30%
(ms)
100%
(ms)
Linear COUNT 24 1,204 1,683 3,100 6,214
Linear SUM 24 1,261 2,447 5,059 13,086
Linear COUNT 1 15,286 22,451 38,863 66,722
Linear SUM 1 17,575 33,664 81,927 179,013
Cuckoo COUNT 24 1,849 2,840 4,105 6,203
Cuckoo SUM 24 2,833 4,903 9,446 19,652
Cuckoo COUNT 1 25,146 39,064 57,383 85,105
Cuckoo SUM 1 33,647 67,089 121,989 240,941

We clearly see cache effects on the two first lines, where the SUM and COUNT run in almost the same time on a small hash table but have a 2x difference on the larger hash table. The instruction path length is not very different for SUM and COUNT, but the memory footprint has a 3x difference.

We note that the SMP scalability of linear is slightly better, contrasting the ratio of 24-thread SUM to single-thread SUM. Both numbers are over 12x, indicating net benefit from core multithreading. (The test system has 12 physical cores.) The linear hash systematically outperforms cuckoo, understandably, since it makes a smaller net number of cache misses. The overall effect on the TPC-H score is noticeable, at around 15-20K units of composite score at 100G.

In conclusion, the Virtuoso hash join implementation is certainly on the level, with only small gains to be expected from further vectoring and prefetching. These results may be reproduced using the v7fasttrack Virtuoso Open Source releases from GitHub; develop/7 for cuckoo and feature/analytics for linear hash.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
05/28/2014 17:11 GMT
In Hoc Signo Vinces (part 14 of n): Virtuoso TPC-H Implementation Analysis [ Virtuso Data Space Bot ]

In the previous article we saw an unofficial result of running the full workload. Here we will look more closely at the performance profile.

In this article we look at what the server actually does. The execution profiles for all the queries are available for download. To experiment with parallelism, you may download the software and run it locally. An Amazon image may be provided later.

Execution Profile

Below is the top of the oprofile output for a run of the 22 queries with qualification parameters against the 100G database. The operation in TPC-H terms is given under each heading.

CPU: Intel Sandy Bridge microarchitecture, speed 2299.98 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
1406009   9.5117  ce_vec_int_range_hash
 
-- Bloom filter before selective hash join where selective hash join is best or only condition on a scan, e.g., lineitem scan where l_partkey checked against a Bloom filter
730935    4.9448  ce_vec_int_sets_hash
 
-- Bloom filter check where another condition is applied first, e.g., lineitem scan with condition on l_shipdate, then Bloom filter check on l_partkey
617091    4.1746  hash_source_chash_input_1i_n
 
-- Q13 right outer hash join with probe from orders, build from customer, NOT EXISTS test in Q16
586938    3.9706  cs_decode
 
-- Generic reading of a column, all queries
536273    3.6279  ce_intd_range_ltgt
 
-- Date range comparison, most queries, e.g., Q1, 3, 4, 5, 6, 7, 8, 20
479898    3.2465  cha_cmp_1i
 
-- Q13 GROUP BY on c_custkey, Q15 GROUP BY on S_suppkey. Indicates missing the cache on high cardinality GROUP BY single INT grouping key
473721    3.2047  cha_inline_1i_n_int
 
-- Selective hash join after prefiltering with Bloom filter. Check only that key in hash table, no dependent part. For example Q8, Q9, Q17, Q20, with lineitem filtered by part
463723    3.1371  ce_dict_generic_range_filter
 
-- Range condition on low cardinality column (dictionary encoded), e.g., l_quantity, l_discount
425149    2.8761  cha_inline_1i_int
 
-- Hash join check that a single INT key with a dependent part is in a hash table, fetch the dependent part for hits
365040    2.4695  setp_chash_run
 
-- GROUP BY, e.g., GROUP BY on c_custkey in Q13
359645    2.4330  clrg_partition_dc
 
-- Partitioning a vector of values, occurs in all high cardinality GROUP BYs, e.g., Q13, Q15
349473    2.3642  gb_aggregate
 
-- Updating aggregates after the grouping keys are resolved, e.g., Q1
331926    2.2455  ce_dict_any_sets_decode
 
-- Fetching non-contiguous dictionary encoded strings, e.g., l_returnflag, l_linestatus in Q1
316731    2.1427  cha_insert_1i
 
-- Building a hash join hash table from a 1 INT key to a dependent part, e.g., Q14 from l_partkey to lineitems in a time window with a given l_partkey
286865    1.9406  ce_search_rld
 
-- Index lookup for run length delta compressed keys, e.g., l_orderkey, ps_partkey
231390    1.5654  cha_insert
 
-- Build of a hash join hash table for multipart keys, e.g., hash join with partsupp (Q9) or lineitem (Q17, Q20) on build side
224070    1.5158  ce_dict_int64_sets_decode
 
-- Fetching a non-contiguous set of double column values from dictionary encoding, e.g., l_discount, l_quantity
218506    1.4782  ce_intd_any_sets_decode
 
-- Fetching a non-contiguous set of date column values, e.g., l_shipdate in Q9
200686    1.3576  page_wait_access
 
-- Translating page numbers into buffers in buffer pool for column access
198854    1.3452  itc_col_seg
 
-- Generic part of table scan/index access
197645    1.3371  cha_insert_1i_n
 
-- Hash join build for hash tables with a 1 INT key and no dependent, e.g., lineitem to part join in Q9, Q17, Q20
195300    1.3212  cha_bloom_unroll_a
 
-- Bloom filter for hash based in predicate
192309    1.3010  hash_source_chash_input
 
-- Hash join probe with multi-part key, e.g., Q9 against partsupp
191313    1.2942  strstr_sse42
 
-- SSE 4.2 substring match, e.g. NOT LIKE condition in Q13
186325    1.2605  itc_fetch_col_vec
 
-- Translating page numbers into buffers in buffer pool for column access
159198    1.0770  ce_vec_int64_sets_decode
 
-- Fetching non-contiguous 64-bit values from array represented column, e.g., l_extendedprice

So what does TPC-H do? It is all selective hash joins. Then it is table scans with a condition on a DATE column. The DATE column part is because this implementation does not order the big tables (lineitem, orders) on their DATE columns. Third, TPC-H does big GROUP BYs, with Q13 representing most of this; all other GROUP BYs have at least 10x fewer groups. Then it just extracts column values, most often after selecting the rows on some condition; quite often the condition is a foreign key column of the row finding a hit in a hash table. This last is called invisible hash join.

Then there are some index lookups, but there the join is usually a merge pattern, like reading orders in order of o_orderkey and then getting l_orderkey matches from lineitem (e.g., Q21). Then there are quite often partitioning operations, i.e., many threads produce tuples and these tuples go to a consumer thread selected based on a partitioning column on the tuple. This is also called exchange. This is in all the high cardinality GROUP BYs and in the RIGHT OUTER JOIN of Q13.

Whether the implementation is RAM-only or paging from disk makes next to no difference. Virtuoso and Actian Vector both have a buffer pool, with Virtuoso using substantially smaller pages (8K vs. 256K or 512K). The page-number-to-buffer translation and the latching that goes with it is under 3% (itc_fetch_col_vec, page_wait_access). Of course, actually accessing secondary storage will kill the score, but checking that something is in memory is safe as long as this is always a hit.

So, once the query plans are right, the problem resolves into a few different loops. The bulk of the effort in making a TPC-H implementation is in query optimization so that the right loops run in the right order. I will further explain what the loops should contain in the next article.

Many of the functions seen in the profile are instantiations of a template for a specific data type. There are also different templates for variants of a data structure, like a hash table with different keys.

Compilation is a way of generating exactly the right loop for any set of data types. We do not find much need for this here, though. The type-specific operations that are in fact needed are anticipatable and can be predefined based on templates.

Future Gains

There are possible gains in the following domains:

  • Query optimization - Q15, Q17 do some work in duplicate, so reuse will roughly cut the time in half. In addition to reuse, there is a possibility to convert a scalar subquery into a derived table with a GROUP BY (decorrelation). Decorrelation also applies in Q20. Reuse will give some 10-12K more score; decorrelation is probably below measurement noise.

  • Group join - Merging the GROUP BY and the hash probe in Q13 will save 2% of time on throughput, maybe 5K more score.

  • Better parallelism in power test - A good power test takes 42s, and 5 times the work takes 165s, so the platform utilization of the power test is roughly 4/5. This could be slightly better.

  • NUMA - Up to 20% performance gains have been seen on scale-out configurations from binding a process to one CPU socket. This gain will be readily seen in a scale-out setting when running one process per physical CPU. Gains in the NUMA department are rather fickle and fragile and are worth less in real life than in benchmarks. So I would say that work in single-server NUMA optimization is not immediately needed since scale-out configurations will get these gains as soon as one sets affinities for processes. But whether binding processes to CPUs makes sense depends on how even the workload is. TPC-H is even, but reality is less so.

In principle, a composite score of a good run could go from 250K to 280K by diverse incremental improvements. There is some noise in run times, so two consecutive runs can deviate by a few percent, which is already significant when talking of small improvements.

Larger Scales

For a 100 GB run with 5 throughput streams, the peak query memory consumption is 3.5 GB. This includes all the GROUP BY and hash join tables that are made. Q13 is the most expensive in terms of space and allocates a peak of 1 GB for a single execution. This is trivial in comparison with the working set. But at 100x greater scale (10,000 GB or 10 TB) this becomes about 630 GB, as there will be a minimum of 9 concurrent streams instead of 5.

Now 10 TB is clearly a scale-out size, so the 630 GB transient memory is not on a single machine.

Still going to large scale-out will change the nature of some queries and introduce significant data transfer. We will know soon enough.

Problems of the Metric

A small-scale TPC-H, e.g., 100 GB, starts to have features of a lookup workload. This means that there is high variability between consecutive executions, and that the pre-execution state of the system does have large effect on a measurement.

The rules say that power test follows bulk load with maybe some checking of correctness of load in between. The bulk load is basically unregulated and usually will include statistics gathering.

The first power test shows significant variation, anything from 220 K to 240 K, while the second power test is steadily around 255 K. Since the reported score is the lower of the two, the biggest return to the implementor is in making sure the first power test is good.

The second throughput test is usually 5-10 K higher than the first; the throughput test is less sensitive. The difference does not come from I/O, but from system time for memory allocation. The memory and the same quantities and block sizes is reused by the second run.

A good power run is 42s from 100% warm RAM. A power run that gets the data from the OS disk buffers is 70s or so. A power run that gets data from SSD is worse, maybe 120s.

To cite an example, increasing the buffer pool size from 64 GB to 72 GB gets the first post-load power test from 120-150K to 230-240K while having no appreciable effect on subsequent tests. The effect is exacerbated by the fact that the power score is based on a geometric mean of run times. Very short queries (e.g., Q2) vary between consecutive in-memory executions from 120ms to 220ms. A similar variation occurs in Q13 which is on either side of 6s. Due to geometric mean, the same variability has very different impact depending on which query it hits. A Q2 that reads data from out of process can take 2s instead of the expected under 200ms. This kills the score even if a delay of 1.8s as such did not. So increasing the buffer pool in the example just serves to make sure the small supplier table is in memory. Fetching it from the OS is simply not an option in the first Q2 even if it were an option in a longer running query. Remember, the the lower of the two scores is reported, and the first power test will be bad unless it is somehow primed by some trick like bulk load order.

When differences between implementations are small, variation between consecutive runs becomes important. This is why OLTP benchmarks are required to run for a relatively long time and only measure the steady state portion. This would also be appropriate for small TPC-H runs.

Conclusions

Query performance reduces to a few loops when all the conditions are right. Getting the conditions to be right depends on query optimization and the right architecture choices. These results would be unthinkable without vectored execution and a compressed column store design. These in and of themselves guarantee nothing unless all plans are right. A previous Virtuoso 7 takes over 10x longer because of bad plans and missing some key execution techniques like the RIGHT OUTER JOIN in Q13 and partitioned GROUP BY.

Last summer, we did runs of the much simpler Star Schema Benchmark. The results were very good, because the engine had the much smaller set of tricks and the right plans. Repeating these tests now would show some gains from a still better hash join but nothing dramatic.

In the next article we will look at the finer points of hash join. After this we move to larger scales and clusters.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
05/01/2014 13:07 GMT Modified: 05/28/2014 17:13 GMT
In Hoc Signo Vinces (part 14 of n): Virtuoso TPC-H Implementation Analysis [ Orri Erling ]

In the previous article we saw an unofficial result of running the full workload. Here we will look more closely at the performance profile.

In this article we look at what the server actually does. The execution profiles for all the queries are available for download. To experiment with parallelism, you may download the software and run it locally. An Amazon image may be provided later.

Execution Profile

Below is the top of the oprofile output for a run of the 22 queries with qualification parameters against the 100G database. The operation in TPC-H terms is given under each heading.

CPU: Intel Sandy Bridge microarchitecture, speed 2299.98 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
1406009   9.5117  ce_vec_int_range_hash
 
-- Bloom filter before selective hash join where selective hash join is best or only condition on a scan, e.g., lineitem scan where l_partkey checked against a Bloom filter
730935    4.9448  ce_vec_int_sets_hash
 
-- Bloom filter check where another condition is applied first, e.g., lineitem scan with condition on l_shipdate, then Bloom filter check on l_partkey
617091    4.1746  hash_source_chash_input_1i_n
 
-- Q13 right outer hash join with probe from orders, build from customer, NOT EXISTS test in Q16
586938    3.9706  cs_decode
 
-- Generic reading of a column, all queries
536273    3.6279  ce_intd_range_ltgt
 
-- Date range comparison, most queries, e.g., Q1, 3, 4, 5, 6, 7, 8, 20
479898    3.2465  cha_cmp_1i
 
-- Q13 GROUP BY on c_custkey, Q15 GROUP BY on S_suppkey. Indicates missing the cache on high cardinality GROUP BY single INT grouping key
473721    3.2047  cha_inline_1i_n_int
 
-- Selective hash join after prefiltering with Bloom filter. Check only that key in hash table, no dependent part. For example Q8, Q9, Q17, Q20, with lineitem filtered by part
463723    3.1371  ce_dict_generic_range_filter
 
-- Range condition on low cardinality column (dictionary encoded), e.g., l_quantity, l_discount
425149    2.8761  cha_inline_1i_int
 
-- Hash join check that a single INT key with a dependent part is in a hash table, fetch the dependent part for hits
365040    2.4695  setp_chash_run
 
-- GROUP BY, e.g., GROUP BY on c_custkey in Q13
359645    2.4330  clrg_partition_dc
 
-- Partitioning a vector of values, occurs in all high cardinality GROUP BYs, e.g., Q13, Q15
349473    2.3642  gb_aggregate
 
-- Updating aggregates after the grouping keys are resolved, e.g., Q1
331926    2.2455  ce_dict_any_sets_decode
 
-- Fetching non-contiguous dictionary encoded strings, e.g., l_returnflag, l_linestatus in Q1
316731    2.1427  cha_insert_1i
 
-- Building a hash join hash table from a 1 INT key to a dependent part, e.g., Q14 from l_partkey to lineitems in a time window with a given l_partkey
286865    1.9406  ce_search_rld
 
-- Index lookup for run length delta compressed keys, e.g., l_orderkey, ps_partkey
231390    1.5654  cha_insert
 
-- Build of a hash join hash table for multipart keys, e.g., hash join with partsupp (Q9) or lineitem (Q17, Q20) on build side
224070    1.5158  ce_dict_int64_sets_decode
 
-- Fetching a non-contiguous set of double column values from dictionary encoding, e.g., l_discount, l_quantity
218506    1.4782  ce_intd_any_sets_decode
 
-- Fetching a non-contiguous set of date column values, e.g., l_shipdate in Q9
200686    1.3576  page_wait_access
 
-- Translating page numbers into buffers in buffer pool for column access
198854    1.3452  itc_col_seg
 
-- Generic part of table scan/index access
197645    1.3371  cha_insert_1i_n
 
-- Hash join build for hash tables with a 1 INT key and no dependent, e.g., lineitem to part join in Q9, Q17, Q20
195300    1.3212  cha_bloom_unroll_a
 
-- Bloom filter for hash based in predicate
192309    1.3010  hash_source_chash_input
 
-- Hash join probe with multi-part key, e.g., Q9 against partsupp
191313    1.2942  strstr_sse42
 
-- SSE 4.2 substring match, e.g. NOT LIKE condition in Q13
186325    1.2605  itc_fetch_col_vec
 
-- Translating page numbers into buffers in buffer pool for column access
159198    1.0770  ce_vec_int64_sets_decode
 
-- Fetching non-contiguous 64-bit values from array represented column, e.g., l_extendedprice

So what does TPC-H do? It is all selective hash joins. Then it is table scans with a condition on a DATE column. The DATE column part is because this implementation does not order the big tables (lineitem, orders) on their DATE columns. Third, TPC-H does big GROUP BYs, with Q13 representing most of this; all other GROUP BYs have at least 10x fewer groups. Then it just extracts column values, most often after selecting the rows on some condition; quite often the condition is a foreign key column of the row finding a hit in a hash table. This last is called invisible hash join.

Then there are some index lookups, but there the join is usually a merge pattern, like reading orders in order of o_orderkey and then getting l_orderkey matches from lineitem (e.g., Q21). Then there are quite often partitioning operations, i.e., many threads produce tuples and these tuples go to a consumer thread selected based on a partitioning column on the tuple. This is also called exchange. This is in all the high cardinality GROUP BYs and in the RIGHT OUTER JOIN of Q13.

Whether the implementation is RAM-only or paging from disk makes next to no difference. Virtuoso and Actian Vector both have a buffer pool, with Virtuoso using substantially smaller pages (8K vs. 256K or 512K). The page-number-to-buffer translation and the latching that goes with it is under 3% (itc_fetch_col_vec, page_wait_access). Of course, actually accessing secondary storage will kill the score, but checking that something is in memory is safe as long as this is always a hit.

So, once the query plans are right, the problem resolves into a few different loops. The bulk of the effort in making a TPC-H implementation is in query optimization so that the right loops run in the right order. I will further explain what the loops should contain in the next article.

Many of the functions seen in the profile are instantiations of a template for a specific data type. There are also different templates for variants of a data structure, like a hash table with different keys.

Compilation is a way of generating exactly the right loop for any set of data types. We do not find much need for this here, though. The type-specific operations that are in fact needed are anticipatable and can be predefined based on templates.

Future Gains

There are possible gains in the following domains:

  • Query optimization - Q15, Q17 do some work in duplicate, so reuse will roughly cut the time in half. In addition to reuse, there is a possibility to convert a scalar subquery into a derived table with a GROUP BY (decorrelation). Decorrelation also applies in Q20. Reuse will give some 10-12K more score; decorrelation is probably below measurement noise.

  • Group join - Merging the GROUP BY and the hash probe in Q13 will save 2% of time on throughput, maybe 5K more score.

  • Better parallelism in power test - A good power test takes 42s, and 5 times the work takes 165s, so the platform utilization of the power test is roughly 4/5. This could be slightly better.

  • NUMA - Up to 20% performance gains have been seen on scale-out configurations from binding a process to one CPU socket. This gain will be readily seen in a scale-out setting when running one process per physical CPU. Gains in the NUMA department are rather fickle and fragile and are worth less in real life than in benchmarks. So I would say that work in single-server NUMA optimization is not immediately needed since scale-out configurations will get these gains as soon as one sets affinities for processes. But whether binding processes to CPUs makes sense depends on how even the workload is. TPC-H is even, but reality is less so.

In principle, a composite score of a good run could go from 250K to 280K by diverse incremental improvements. There is some noise in run times, so two consecutive runs can deviate by a few percent, which is already significant when talking of small improvements.

Larger Scales

For a 100 GB run with 5 throughput streams, the peak query memory consumption is 3.5 GB. This includes all the GROUP BY and hash join tables that are made. Q13 is the most expensive in terms of space and allocates a peak of 1 GB for a single execution. This is trivial in comparison with the working set. But at 100x greater scale (10,000 GB or 10 TB) this becomes about 630 GB, as there will be a minimum of 9 concurrent streams instead of 5.

Now 10 TB is clearly a scale-out size, so the 630 GB transient memory is not on a single machine.

Still going to large scale-out will change the nature of some queries and introduce significant data transfer. We will know soon enough.

Problems of the Metric

A small-scale TPC-H, e.g., 100 GB, starts to have features of a lookup workload. This means that there is high variability between consecutive executions, and that the pre-execution state of the system does have large effect on a measurement.

The rules say that power test follows bulk load with maybe some checking of correctness of load in between. The bulk load is basically unregulated and usually will include statistics gathering.

The first power test shows significant variation, anything from 220 K to 240 K, while the second power test is steadily around 255 K. Since the reported score is the lower of the two, the biggest return to the implementor is in making sure the first power test is good.

The second throughput test is usually 5-10 K higher than the first; the throughput test is less sensitive. The difference does not come from I/O, but from system time for memory allocation. The memory and the same quantities and block sizes is reused by the second run.

A good power run is 42s from 100% warm RAM. A power run that gets the data from the OS disk buffers is 70s or so. A power run that gets data from SSD is worse, maybe 120s.

To cite an example, increasing the buffer pool size from 64 GB to 72 GB gets the first post-load power test from 120-150K to 230-240K while having no appreciable effect on subsequent tests. The effect is exacerbated by the fact that the power score is based on a geometric mean of run times. Very short queries (e.g., Q2) vary between consecutive in-memory executions from 120ms to 220ms. A similar variation occurs in Q13 which is on either side of 6s. Due to geometric mean, the same variability has very different impact depending on which query it hits. A Q2 that reads data from out of process can take 2s instead of the expected under 200ms. This kills the score even if a delay of 1.8s as such did not. So increasing the buffer pool in the example just serves to make sure the small supplier table is in memory. Fetching it from the OS is simply not an option in the first Q2 even if it were an option in a longer running query. Remember, the the lower of the two scores is reported, and the first power test will be bad unless it is somehow primed by some trick like bulk load order.

When differences between implementations are small, variation between consecutive runs becomes important. This is why OLTP benchmarks are required to run for a relatively long time and only measure the steady state portion. This would also be appropriate for small TPC-H runs.

Conclusions

Query performance reduces to a few loops when all the conditions are right. Getting the conditions to be right depends on query optimization and the right architecture choices. These results would be unthinkable without vectored execution and a compressed column store design. These in and of themselves guarantee nothing unless all plans are right. A previous Virtuoso 7 takes over 10x longer because of bad plans and missing some key execution techniques like the RIGHT OUTER JOIN in Q13 and partitioned GROUP BY.

Last summer, we did runs of the much simpler Star Schema Benchmark. The results were very good, because the engine had the much smaller set of tricks and the right plans. Repeating these tests now would show some gains from a still better hash join but nothing dramatic.

In the next article we will look at the finer points of hash join. After this we move to larger scales and clusters.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
05/01/2014 13:03 GMT Modified: 05/28/2014 17:12 GMT
In Hoc Signo Vinces (part 13 of n): Virtuoso TPC-H Kit Now on V7 Fast Track [ Virtuso Data Space Bot ]

This article shows how you can reproduce the TPC-H experiments from the previous posts in this series.

All the code is in the feature/analytics branch of the v7fasttrack git repository on GitHub.

Prerequisites

Start by checking out and compiling Virtuoso Open Source (VOS).

git clone https://github.com/v7fasttrack/virtuoso-opensource
cd virtuoso-opensource 
git checkout feature/analytics
./autogen.sh
export CFLAGS="-msse4.2 -DSSE42"
./configure 
make -j 24
make install

The system should be an x86_64, Intel Core i7 or later, with SSE 4.2 support. (Running without SSE 4.2 is possible, but for better score you need to define it before doing configure.) The gcc may be any that supports SSE 4.2.

To have a good result, the system should have at least 96 GB of RAM and SSD.

To get a good load time, both the database files and the CSV files made by dbgen should be on SSD.

Running TPC-H

Set up

Copy the binsrc/tests/tpc-h directory from the check-out to an SSD, if it is not already on one. Set $HOME to point to the root directory of the check-out. Rename the virtuoso.ini-100G in the tpc-h directory to virtuoso.ini. Edit the database file paths in the virtuoso.ini. Where it says --

Segment1 = 1024, /1s1/dbs/tpch100cp-1.db = q1, /1s2/dbs/tpch100cp-2.db = q2

-- change the file names, and add or remove files as appropriate, each file with a different = qn, until you have one file per independent device. If this is a RAID, one file per distinct device in the RAID usually brings improvement. Edit the TransactionFile entry, and replace /1s2/dbs/ with a suitable path.

Edit ThreadsPerQuery to be the number of threads on the machine. For i7, this is double the number of cores; your environment may vary. AsyncQueueMaxThreads should be set to double ThreadsPerQuery.

The memory settings (NumberOfBuffers and MaxDirtyBuffers) are OK for 100G scale. For larger scale, make the memory settings correspondingly larger, not to exceed 75% of system memory. Count 8.5 KB per buffer. If you have less memory, you can decrease these. If so, the first power test will be hit the worst, so the scores will not be as good.

The default BIOS settings are usually OK. Disabling prefetch of adjacent cache line does not help, and turning off core threads does not help either.

For 100G, the data files for loading will take 100 GB, and the database files will take 88 GB divided among however many files. Be sure there is enough space before you start.

Generate the data

In the tpc-h directory (copy of binsrc/tests/tpc-h) --

./gen.sh 100 5 2

The first parameter is the scale factor; the second is the number of streams to use in the throughput test; the last is the number of consecutive test runs. The minimum number of streams is 5 for 100G; each successive scale adds one more. A larger number of streams is allowed, but will not make a better result in this case. A test always consists of 2 runs; you could specify more, but the extra tests will not influence the score.

Making the data files takes the longest time. You may run dbgen multithreaded to make the dataset in parallel, but then the load scripts will have to be changed to match.

Run the load

Start Virtuoso.

./load.sh 100

Looking at iostat, you will see a read rate of about 140 MB/s from the source files.

Run the test

./run.sh 100 5 2 

The parameters have the same meaning as in gen.sh, and the same values must be specified.

Outputs

The run produces two main files, report1.txt and report2.txt. These are the numerical quantity summaries for the first and second run. Additionally, there are output files for each query stream. The suppfiles.sh script can be used to collect the supporting files archive for a TPC-H full disclosure report.

The database is left running. To reuse the loaded data for another experiment, kill the Virtuoso process, delete the transaction log, and restart. This will have the data in the post-load state. To get warm cache, use the warm.sql script in the tpc-h directory.

On the test system used in this series, 12 core E5 at 2.3GHz, we expect 240K for the first run and 250K for the second. With a top-of-the-line E5, we expect around 400K. For an 8-core 2.26GHz Nehalem, we expect 150K.

If you get scores that are significantly different, something is broken; we would like to know about this.

If you have this audited according to the TPC rules, you will be allowed to call this a TPC-H result. Without such audit, the result should be labeled Virt-H.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
04/28/2014 12:10 GMT Modified: 05/28/2014 17:13 GMT
In Hoc Signo Vinces (part 13 of n): Virtuoso TPC-H Kit Now on V7 Fast Track [ Orri Erling ]

This article shows how you can reproduce the TPC-H experiments from the previous posts in this series.

All the code is in the feature/analytics branch of the v7fasttrack git repository on GitHub.

Prerequisites

Start by checking out and compiling Virtuoso Open Source (VOS).

git clone https://github.com/v7fasttrack/virtuoso-opensource
cd virtuoso-opensource 
git checkout feature/analytics
./autogen.sh
export CFLAGS="-msse4.2 -DSSE42"
./configure 
make -j 24
make install

The system should be an x86_64, Intel Core i7 or later, with SSE 4.2 support. (Running without SSE 4.2 is possible, but for better score you need to define it before doing configure.) The gcc may be any that supports SSE 4.2.

To have a good result, the system should have at least 96 GB of RAM and SSD.

To get a good load time, both the database files and the CSV files made by dbgen should be on SSD.

Running TPC-H

Set up

Copy the binsrc/tests/tpc-h directory from the check-out to an SSD, if it is not already on one. Set $HOME to point to the root directory of the check-out. Rename the virtuoso.ini-100G in the tpc-h directory to virtuoso.ini. Edit the database file paths in the virtuoso.ini. Where it says --

Segment1 = 1024, /1s1/dbs/tpch100cp-1.db = q1, /1s2/dbs/tpch100cp-2.db = q2

-- change the file names, and add or remove files as appropriate, each file with a different = qn, until you have one file per independent device. If this is a RAID, one file per distinct device in the RAID usually brings improvement. Edit the TransactionFile entry, and replace /1s2/dbs/ with a suitable path.

Edit ThreadsPerQuery to be the number of threads on the machine. For i7, this is double the number of cores; your environment may vary. AsyncQueueMaxThreads should be set to double ThreadsPerQuery.

The memory settings (NumberOfBuffers and MaxDirtyBuffers) are OK for 100G scale. For larger scale, make the memory settings correspondingly larger, not to exceed 75% of system memory. Count 8.5 KB per buffer. If you have less memory, you can decrease these. If so, the first power test will be hit the worst, so the scores will not be as good.

The default BIOS settings are usually OK. Disabling prefetch of adjacent cache line does not help, and turning off core threads does not help either.

For 100G, the data files for loading will take 100 GB, and the database files will take 88 GB divided among however many files. Be sure there is enough space before you start.

Generate the data

In the tpc-h directory (copy of binsrc/tests/tpc-h) --

./gen.sh 100 5 2

The first parameter is the scale factor; the second is the number of streams to use in the throughput test; the last is the number of consecutive test runs. The minimum number of streams is 5 for 100G; each successive scale adds one more. A larger number of streams is allowed, but will not make a better result in this case. A test always consists of 2 runs; you could specify more, but the extra tests will not influence the score.

Making the data files takes the longest time. You may run dbgen multithreaded to make the dataset in parallel, but then the load scripts will have to be changed to match.

Run the load

Start Virtuoso.

./load.sh 100

Looking at iostat, you will see a read rate of about 140 MB/s from the source files.

Run the test

./run.sh 100 5 2 

The parameters have the same meaning as in gen.sh, and the same values must be specified.

Outputs

The run produces two main files, report1.txt and report2.txt. These are the numerical quantity summaries for the first and second run. Additionally, there are output files for each query stream. The suppfiles.sh script can be used to collect the supporting files archive for a TPC-H full disclosure report.

The database is left running. To reuse the loaded data for another experiment, kill the Virtuoso process, delete the transaction log, and restart. This will have the data in the post-load state. To get warm cache, use the warm.sql script in the tpc-h directory.

On the test system used in this series, 12 core E5 at 2.3GHz, we expect 240K for the first run and 250K for the second. With a top-of-the-line E5, we expect around 400K. For an 8-core 2.26GHz Nehalem, we expect 150K.

If you get scores that are significantly different, something is broken; we would like to know about this.

If you have this audited according to the TPC rules, you will be allowed to call this a TPC-H result. Without such audit, the result should be labeled Virt-H.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
04/28/2014 12:09 GMT Modified: 05/28/2014 17:12 GMT
In Hoc Signo Vinces (part 12 of n): TPC-H: Result Preview [ Virtuso Data Space Bot ]

In this article, we look at the 100 GB single-server results for the whole workload. We will call this Virt-H instead of TPC-H in order to comply with the TPC rules: Use of the TPC-H label requires an audit.

The test consists of a bulk load followed by two runs. Each run consists of a single user power test and a multi-user throughput test. The number of users in the throughput test is up to the test sponsor but must be at least 5 for the 100 GB scale. The reported score is the lower of the two scores.

Result Summary

Scale Factor 100 GB
dbgen version 2.15
Lload time 0:15:02
Composite qph 241,482.3
System Availability Date 2014-04-22

The price/performance is left open. The hardware costs about 5000 euros and the software is open source so the cost per performance would be a minimum of 0.02 euros per qph at 100G. This is not compliant with the TPC pricing rules though. These require 3 year maintenance contracts for all parts.

The software configuration did not use RAID. Otherwise the software would be auditable to the best of my knowledge. The hardware would have to be the same from Dell, HP, or other large brand to satisfy the TPC pricing rule.

Executive Summaries of Each Run

Run 1

Report Date 2014-04-21
Database Scale Factor 100
Total Data Storage/Database Size 1 TB / 87,496 MB
Start of Database Load 2014-04-21 21:02:43
End of Database Load 2014-04-21 21:17:45
Database Load Time 0:15:02
Query Streams for Throughput Test 5
Virt-H Power 239,785.1
Virt-H Throughput 243,191.4
Virt-H Composite Query-per-Hour Metric (Qph@100GB) 241,482.3
Measurement Interval in Throughput Test (Ts) 162.935000 seconds

Duration of stream execution

  Start Date/Time End Date/Time Duration
Stream 0 2014-04-21 21:17:46 2014-04-21 21:18:33 0:00:47
Stream 1 2014-04-21 21:18:33 2014-04-21 21:21:13 0:02:40
Stream 2 2014-04-21 21:18:33 2014-04-21 21:21:13 0:02:40
Stream 3 2014-04-21 21:18:33 2014-04-21 21:21:06 0:02:33
Stream 4 2014-04-21 21:18:33 2014-04-21 21:21:10 0:02:37
Stream 5 2014-04-21 21:18:33 2014-04-21 21:21:16 0:02:43
Refresh 0 2014-04-21 21:17:46 2014-04-21 21:17:49 0:00:03
  2014-04-21 21:17:50 2014-04-21 21:17:51 0:00:01
Refresh 1 2014-04-21 21:19:25 2014-04-21 21:19:38 0:00:13
Refresh 2 2014-04-21 21:18:33 2014-04-21 21:18:48 0:00:15
Refresh 3 2014-04-21 21:18:49 2014-04-21 21:19:01 0:00:12
Refresh 4 2014-04-21 21:19:01 2014-04-21 21:19:13 0:00:12
Refresh 5 2014-04-21 21:19:13 2014-04-21 21:19:25 0:00:12

Numerical Quantities Summary -- Timing Intervals in Seconds

  Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Stream 0 2.311882 0.383459 1.143286 0.439926 1.594027 0.736482 1.440826 1.198925
Stream 1 5.192341 0.952574 6.184940 1.194804 6.998207 5.122059 5.962717 6.773401
Stream 2 7.354001 1.191604 4.238262 1.770639 5.782669 1.357578 4.034697 6.354747
Stream 3 6.489788 1.585291 4.645022 3.358926 7.904636 3.220767 5.694622 7.431067
Stream 4 5.609555 1.066582 6.740518 2.503038 9.439980 3.424101 4.404849 4.256317
Stream 5 10.346825 1.787459 4.391000 3.151059 4.974037 2.932079 6.191782 3.619255
Min Qi 5.192341 0.952574 4.238262 1.194804 4.974037 1.357578 4.034697 3.619255
Max Qi 10.346825 1.787459 6.740518 3.358926 9.439980 5.122059 6.191782 7.431067
Avg Qi 6.998502 1.316702 5.239948 2.395693 7.019906 3.211317 5.257733 5.686957
 
  Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Stream 0 4.476940 2.004782 2.070967 1.015134 7.995799 2.142581 1.989357 1.581758
Stream 1 11.351299 6.657059 7.719765 5.157236 25.156379 8.566067 7.028898 8.146883
Stream 2 13.954105 8.341359 10.265949 3.289724 25.249435 6.370577 11.262650 7.684574
Stream 3 13.597277 5.783821 5.944240 5.214661 24.253991 8.742896 7.701709 5.801641
Stream 4 15.612070 6.126494 4.533748 5.733828 23.021583 6.423207 8.358223 6.866477
Stream 5 8.421209 9.040726 7.799425 3.908758 23.342975 9.934672 11.455598 8.258504
Min Qi 8.421209 5.783821 4.533748 3.289724 23.021583 6.370577 7.028898 5.801641
Max Qi 15.612070 9.040726 10.265949 5.733828 25.249435 9.934672 11.455598 8.258504
Avg Qi 12.587192 7.189892 7.252625 4.660841 24.204873 8.007484 9.161416 7.351616
 
  Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2
Stream 0 2.258070 0.981896 1.161602 1.933124 2.203497 1.042949 3.349407 1.296630
Stream 1 8.213340 4.070175 5.662723 12.260503 7.792825 3.323136 9.296430 3.939927
Stream 2 16.754827 3.895688 4.413773 7.529466 6.288539 2.717479 11.222082 4.135510
Stream 3 8.486809 2.615640 7.426936 7.274289 6.706145 3.402654 8.278881 4.260483
Stream 4 12.604905 7.735042 5.627039 6.343302 7.242370 3.492640 6.503095 3.698821
Stream 5 8.221733 2.670036 5.866626 13.108081 9.428098 4.282014 8.213320 4.088321
Min Qi 8.213340 2.615640 4.413773 6.343302 6.288539 2.717479 6.503095 3.698821
Max Qi 16.754827 7.735042 7.426936 13.108081 9.428098 4.282014 11.222082 4.260483
Avg Qi 10.856323 4.197316 5.799419 9.303128 7.491595 3.443585 8.702762 4.024612

Run 2

Report Date 2014-04-21
Database Scale Factor 100
Total Data Storage/Database Size 1 TB / 87,496 MB
Start of Database Load 2014-04-21 21:02:43
End of Database Load 2014-04-21 21:17:45
Database Load Time 0:15:02
Query Streams for Throughput Test 5
Virt-H Power 257,944.7
Virt-H Throughput 240,998.0
Virt-H Composite Query-per-Hour Metric (Qph@100GB) 249,327.4
Measurement Interval in Throughput Test (Ts) 164.417000 seconds

Duration of stream execution

  Start Date/Time End Date/Time Duration
Stream 0 2014-04-21 21:21:20 2014-04-21 21:22:01 0:00:41
Stream 1 2014-04-21 21:22:02 2014-04-21 21:24:41 0:02:39
Stream 2 2014-04-21 21:22:02 2014-04-21 21:24:41 0:02:39
Stream 3 2014-04-21 21:22:02 2014-04-21 21:24:41 0:02:39
Stream 4 2014-04-21 21:22:02 2014-04-21 21:24:44 0:02:42
Stream 5 2014-04-21 21:22:02 2014-04-21 21:24:46 0:02:44
Refresh 0 2014-04-21 21:21:20 2014-04-21 21:21:22 0:00:02
&$160; 2014-04-21 21:21:22 2014-04-21 21:21:23 0:00:01
Refresh 1 2014-04-21 21:22:49 2014-04-21 21:23:04 0:00:15
Refresh 2 2014-04-21 21:22:01 2014-04-21 21:22:14 0:00:13
Refresh 3 2014-04-21 21:22:14 2014-04-21 21:22:27 0:00:13
Refresh 4 2014-04-21 21:22:26 2014-04-21 21:22:39 0:00:13
Refresh 5 2014-04-21 21:22:39 2014-04-21 21:22:49 0:00:10

Numerical Quantities Summary -- Timing Intervals in Seconds

  Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Stream 0 2.437262 0.227516 1.172620 0.541201 1.542084 0.743255 1.459368 1.183166
Stream 1 5.205225 0.499833 4.854558 4.818087 5.920773 3.347414 5.446411 3.723247
Stream 2 5.833803 0.659051 6.023266 3.123523 4.358200 3.371315 6.772453 4.978415
Stream 3 6.308935 0.662744 7.573807 5.000859 5.282467 4.391930 5.280472 7.852718
Stream 4 5.791856 0.421592 5.953592 4.688037 9.949038 3.098282 4.153124 4.824209
Stream 5 13.537098 1.760386 3.308982 2.299178 4.882695 2.652497 5.383128 10.178447
Min Qi 5.205225 0.421592 3.308982 2.299178 4.358200 2.652497 4.153124 3.723247
Max Qi 13.537098 1.760386 7.573807 5.000859 9.949038 4.391930 6.772453 10.178447
Avg Qi 7.335383 0.800721 5.542841 3.985937 6.078635 3.372288 5.407118 6.311407
 
  Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Stream 0 4.441940 1.948770 2.154384 1.148494 6.014453 1.647725 1.437587 1.585284
Stream 1 14.127674 7.824844 7.100679 3.586457 28.216115 7.587547 9.859152 5.829869
Stream 2 16.102880 7.676986 5.887327 2.796729 24.847035 7.146757 11.408922 7.641239
Stream 3 15.678701 5.786427 9.221883 2.692321 28.434916 6.657457 8.219745 7.706585
Stream 4 11.985421 10.182807 5.667618 6.875264 27.547492 7.438075 9.065924 8.895070
Stream 5 6.913707 7.662703 8.657333 3.282895 24.126612 10.963691 12.138564 7.962654
Min Qi 6.913707 5.786427 5.667618 2.692321 24.126612 6.657457 8.219745 5.829869
Max Qi 16.102880 10.182807 9.221883 6.875264 28.434916 10.963691 12.138564 8.895070
Avg Qi 12.961677 7.826753 7.306968 3.846733 26.634434 7.958705 10.138461 7.607083
 
  Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2
Stream 0 2.275267 1.139390 1.165591 2.073658 2.261869 0.703055 2.327755 1.146501
Stream 1 13.720792 4.428528 3.651645 9.841610 6.710473 2.595879 9.783844 3.800103
Stream 2 12.532257 2.312755 6.182661 8.666967 9.383983 1.414853 7.570509 4.539598
Stream 3 7.578779 3.342352 8.155356 4.925493 6.590047 2.612912 8.497542 4.638512
Stream 4 10.967178 2.173935 6.382803 5.082562 8.744671 3.074768 7.577794 4.435140
Stream 5 9.438581 2.551124 8.375607 8.339441 8.201650 1.982935 7.334306 3.404017
Min Qi 7.578779 2.173935 3.651645 4.925493 6.590047 1.414853 7.334306 3.404017
Max Qi 13.720792 4.428528 8.375607 9.841610 9.383983 3.074768 9.783844 4.638512
Avg Qi 10.847517 2.961739 6.549614 7.371215 7.926165 2.336269 8.152799 4.163474

Details of System Under Test (SUT)

Hardware

Chassis Supermicro 2U
Motherboard Supermicro X9DR3-LN4F+
CPU 2 x Intel Xeon E5-2630 @ 2.3 GHz
(6 cores, 12 threads each;
total 12 cores, 24 threads)
RAM 192 GB DDR3 (24 x 8 GB, 1066MHz)
Storage 2 x Crucial 512 GB SSD
 

Software

DBMS Virtuoso Open Source 7.11.3209
(feature/analytics on v7fasttrack on GitHub)
OS CentOS 6.2

Conclusions

This experiment places Virtuoso in the ballpark with Actian Vector (formerly branded Vectorwise), which has dominated the TPC-H score board in recent years. The published Vector results are on more cores and/or faster clock; one would have to run on the exact same platform to make precise comparisons.

Virtuoso ups the ante by providing this level of performance in open source. For a comparison with EXASolution and Actian Matrix (formerly ParAccel), we will have to go to the Virtuoso scale-out configuration, to follow shortly.

The next articles will provide a detailed analysis of performance and instructions for reproducing the results. The run outputs and scripts are available for download.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
04/22/2014 11:33 GMT Modified: 05/28/2014 17:13 GMT
In Hoc Signo Vinces (part 12 of n): TPC-H: Result Preview [ Orri Erling ]

In this article, we look at the 100 GB single-server results for the whole workload. We will call this Virt-H instead of TPC-H in order to comply with the TPC rules: Use of the TPC-H label requires an audit.

The test consists of a bulk load followed by two runs. Each run consists of a single user power test and a multi-user throughput test. The number of users in the throughput test is up to the test sponsor but must be at least 5 for the 100 GB scale. The reported score is the lower of the two scores.

Result Summary

Scale Factor 100 GB
dbgen version 2.15
Lload time 0:15:02
Composite qph 241,482.3
System Availability Date 2014-04-22

The price/performance is left open. The hardware costs about 5000 euros and the software is open source so the cost per performance would be a minimum of 0.02 euros per qph at 100G. This is not compliant with the TPC pricing rules though. These require 3 year maintenance contracts for all parts.

The software configuration did not use RAID. Otherwise the software would be auditable to the best of my knowledge. The hardware would have to be the same from Dell, HP, or other large brand to satisfy the TPC pricing rule.

Executive Summaries of Each Run

Run 1

Report Date 2014-04-21
Database Scale Factor 100
Total Data Storage/Database Size 1 TB / 87,496 MB
Start of Database Load 2014-04-21 21:02:43
End of Database Load 2014-04-21 21:17:45
Database Load Time 0:15:02
Query Streams for Throughput Test 5
Virt-H Power 239,785.1
Virt-H Throughput 243,191.4
Virt-H Composite Query-per-Hour Metric (Qph@100GB) 241,482.3
Measurement Interval in Throughput Test (Ts) 162.935000 seconds

Duration of stream execution

  Start Date/Time End Date/Time Duration
Stream 0 2014-04-21 21:17:46 2014-04-21 21:18:33 0:00:47
Stream 1 2014-04-21 21:18:33 2014-04-21 21:21:13 0:02:40
Stream 2 2014-04-21 21:18:33 2014-04-21 21:21:13 0:02:40
Stream 3 2014-04-21 21:18:33 2014-04-21 21:21:06 0:02:33
Stream 4 2014-04-21 21:18:33 2014-04-21 21:21:10 0:02:37
Stream 5 2014-04-21 21:18:33 2014-04-21 21:21:16 0:02:43
Refresh 0 2014-04-21 21:17:46 2014-04-21 21:17:49 0:00:03
  2014-04-21 21:17:50 2014-04-21 21:17:51 0:00:01
Refresh 1 2014-04-21 21:19:25 2014-04-21 21:19:38 0:00:13
Refresh 2 2014-04-21 21:18:33 2014-04-21 21:18:48 0:00:15
Refresh 3 2014-04-21 21:18:49 2014-04-21 21:19:01 0:00:12
Refresh 4 2014-04-21 21:19:01 2014-04-21 21:19:13 0:00:12
Refresh 5 2014-04-21 21:19:13 2014-04-21 21:19:25 0:00:12

Numerical Quantities Summary -- Timing Intervals in Seconds

  Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Stream 0 2.311882 0.383459 1.143286 0.439926 1.594027 0.736482 1.440826 1.198925
Stream 1 5.192341 0.952574 6.184940 1.194804 6.998207 5.122059 5.962717 6.773401
Stream 2 7.354001 1.191604 4.238262 1.770639 5.782669 1.357578 4.034697 6.354747
Stream 3 6.489788 1.585291 4.645022 3.358926 7.904636 3.220767 5.694622 7.431067
Stream 4 5.609555 1.066582 6.740518 2.503038 9.439980 3.424101 4.404849 4.256317
Stream 5 10.346825 1.787459 4.391000 3.151059 4.974037 2.932079 6.191782 3.619255
Min Qi 5.192341 0.952574 4.238262 1.194804 4.974037 1.357578 4.034697 3.619255
Max Qi 10.346825 1.787459 6.740518 3.358926 9.439980 5.122059 6.191782 7.431067
Avg Qi 6.998502 1.316702 5.239948 2.395693 7.019906 3.211317 5.257733 5.686957
 
  Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Stream 0 4.476940 2.004782 2.070967 1.015134 7.995799 2.142581 1.989357 1.581758
Stream 1 11.351299 6.657059 7.719765 5.157236 25.156379 8.566067 7.028898 8.146883
Stream 2 13.954105 8.341359 10.265949 3.289724 25.249435 6.370577 11.262650 7.684574
Stream 3 13.597277 5.783821 5.944240 5.214661 24.253991 8.742896 7.701709 5.801641
Stream 4 15.612070 6.126494 4.533748 5.733828 23.021583 6.423207 8.358223 6.866477
Stream 5 8.421209 9.040726 7.799425 3.908758 23.342975 9.934672 11.455598 8.258504
Min Qi 8.421209 5.783821 4.533748 3.289724 23.021583 6.370577 7.028898 5.801641
Max Qi 15.612070 9.040726 10.265949 5.733828 25.249435 9.934672 11.455598 8.258504
Avg Qi 12.587192 7.189892 7.252625 4.660841 24.204873 8.007484 9.161416 7.351616
 
  Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2
Stream 0 2.258070 0.981896 1.161602 1.933124 2.203497 1.042949 3.349407 1.296630
Stream 1 8.213340 4.070175 5.662723 12.260503 7.792825 3.323136 9.296430 3.939927
Stream 2 16.754827 3.895688 4.413773 7.529466 6.288539 2.717479 11.222082 4.135510
Stream 3 8.486809 2.615640 7.426936 7.274289 6.706145 3.402654 8.278881 4.260483
Stream 4 12.604905 7.735042 5.627039 6.343302 7.242370 3.492640 6.503095 3.698821
Stream 5 8.221733 2.670036 5.866626 13.108081 9.428098 4.282014 8.213320 4.088321
Min Qi 8.213340 2.615640 4.413773 6.343302 6.288539 2.717479 6.503095 3.698821
Max Qi 16.754827 7.735042 7.426936 13.108081 9.428098 4.282014 11.222082 4.260483
Avg Qi 10.856323 4.197316 5.799419 9.303128 7.491595 3.443585 8.702762 4.024612

Run 2

Report Date 2014-04-21
Database Scale Factor 100
Total Data Storage/Database Size 1 TB / 87,496 MB
Start of Database Load 2014-04-21 21:02:43
End of Database Load 2014-04-21 21:17:45
Database Load Time 0:15:02
Query Streams for Throughput Test 5
Virt-H Power 257,944.7
Virt-H Throughput 240,998.0
Virt-H Composite Query-per-Hour Metric (Qph@100GB) 249,327.4
Measurement Interval in Throughput Test (Ts) 164.417000 seconds

Duration of stream execution

  Start Date/Time End Date/Time Duration
Stream 0 2014-04-21 21:21:20 2014-04-21 21:22:01 0:00:41
Stream 1 2014-04-21 21:22:02 2014-04-21 21:24:41 0:02:39
Stream 2 2014-04-21 21:22:02 2014-04-21 21:24:41 0:02:39
Stream 3 2014-04-21 21:22:02 2014-04-21 21:24:41 0:02:39
Stream 4 2014-04-21 21:22:02 2014-04-21 21:24:44 0:02:42
Stream 5 2014-04-21 21:22:02 2014-04-21 21:24:46 0:02:44
Refresh 0 2014-04-21 21:21:20 2014-04-21 21:21:22 0:00:02
&$160; 2014-04-21 21:21:22 2014-04-21 21:21:23 0:00:01
Refresh 1 2014-04-21 21:22:49 2014-04-21 21:23:04 0:00:15
Refresh 2 2014-04-21 21:22:01 2014-04-21 21:22:14 0:00:13
Refresh 3 2014-04-21 21:22:14 2014-04-21 21:22:27 0:00:13
Refresh 4 2014-04-21 21:22:26 2014-04-21 21:22:39 0:00:13
Refresh 5 2014-04-21 21:22:39 2014-04-21 21:22:49 0:00:10

Numerical Quantities Summary -- Timing Intervals in Seconds

  Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Stream 0 2.437262 0.227516 1.172620 0.541201 1.542084 0.743255 1.459368 1.183166
Stream 1 5.205225 0.499833 4.854558 4.818087 5.920773 3.347414 5.446411 3.723247
Stream 2 5.833803 0.659051 6.023266 3.123523 4.358200 3.371315 6.772453 4.978415
Stream 3 6.308935 0.662744 7.573807 5.000859 5.282467 4.391930 5.280472 7.852718
Stream 4 5.791856 0.421592 5.953592 4.688037 9.949038 3.098282 4.153124 4.824209
Stream 5 13.537098 1.760386 3.308982 2.299178 4.882695 2.652497 5.383128 10.178447
Min Qi 5.205225 0.421592 3.308982 2.299178 4.358200 2.652497 4.153124 3.723247
Max Qi 13.537098 1.760386 7.573807 5.000859 9.949038 4.391930 6.772453 10.178447
Avg Qi 7.335383 0.800721 5.542841 3.985937 6.078635 3.372288 5.407118 6.311407
 
  Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Stream 0 4.441940 1.948770 2.154384 1.148494 6.014453 1.647725 1.437587 1.585284
Stream 1 14.127674 7.824844 7.100679 3.586457 28.216115 7.587547 9.859152 5.829869
Stream 2 16.102880 7.676986 5.887327 2.796729 24.847035 7.146757 11.408922 7.641239
Stream 3 15.678701 5.786427 9.221883 2.692321 28.434916 6.657457 8.219745 7.706585
Stream 4 11.985421 10.182807 5.667618 6.875264 27.547492 7.438075 9.065924 8.895070
Stream 5 6.913707 7.662703 8.657333 3.282895 24.126612 10.963691 12.138564 7.962654
Min Qi 6.913707 5.786427 5.667618 2.692321 24.126612 6.657457 8.219745 5.829869
Max Qi 16.102880 10.182807 9.221883 6.875264 28.434916 10.963691 12.138564 8.895070
Avg Qi 12.961677 7.826753 7.306968 3.846733 26.634434 7.958705 10.138461 7.607083
 
  Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2
Stream 0 2.275267 1.139390 1.165591 2.073658 2.261869 0.703055 2.327755 1.146501
Stream 1 13.720792 4.428528 3.651645 9.841610 6.710473 2.595879 9.783844 3.800103
Stream 2 12.532257 2.312755 6.182661 8.666967 9.383983 1.414853 7.570509 4.539598
Stream 3 7.578779 3.342352 8.155356 4.925493 6.590047 2.612912 8.497542 4.638512
Stream 4 10.967178 2.173935 6.382803 5.082562 8.744671 3.074768 7.577794 4.435140
Stream 5 9.438581 2.551124 8.375607 8.339441 8.201650 1.982935 7.334306 3.404017
Min Qi 7.578779 2.173935 3.651645 4.925493 6.590047 1.414853 7.334306 3.404017
Max Qi 13.720792 4.428528 8.375607 9.841610 9.383983 3.074768 9.783844 4.638512
Avg Qi 10.847517 2.961739 6.549614 7.371215 7.926165 2.336269 8.152799 4.163474

Details of System Under Test (SUT)

Hardware

Chassis Supermicro 2U
Motherboard Supermicro X9DR3-LN4F+
CPU 2 x Intel Xeon E5-2630 @ 2.3 GHz
(6 cores, 12 threads each;
total 12 cores, 24 threads)
RAM 192 GB DDR3 (24 x 8 GB, 1066MHz)
Storage 2 x Crucial 512 GB SSD
 

Software

DBMS Virtuoso Open Source 7.11.3209
(feature/analytics on v7fasttrack on GitHub)
OS CentOS 6.2

Conclusions

This experiment places Virtuoso in the ballpark with Actian Vector (formerly branded Vectorwise), which has dominated the TPC-H score board in recent years. The published Vector results are on more cores and/or faster clock; one would have to run on the exact same platform to make precise comparisons.

Virtuoso ups the ante by providing this level of performance in open source. For a comparison with EXASolution and Actian Matrix (formerly ParAccel), we will have to go to the Virtuoso scale-out configuration, to follow shortly.

The next articles will provide a detailed analysis of performance and instructions for reproducing the results. The run outputs and scripts are available for download.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
04/22/2014 11:30 GMT Modified: 05/28/2014 17:12 GMT
In Hoc Signo Vinces (part 11 of n): TPC-H Q2, Q10 - Late Projection [ Virtuso Data Space Bot ]

Analytics is generally about making something small out of something large. This reduction is obtained by a TOP k operator (i.e., show only the 10 best by some metric) and/or by grouping and aggregation (i.e., for a set of items, show some attributes of these items and a sum, count, or other aggregate of dependent items for each).

In this installment we will look at late projection, also sometimes known as late materialization. If many attributes are returned and there is a cutoff of some sort, then the query does not need to be concerned about attributes on which there are no conditions, except for fetching them at the last moment, only for the entities which in fact will be returned to the user.

We look at TPC-H Q2 and Q10.

Q2:

  SELECT  TOP 100
                   s_acctbal,
                   s_name,
                   n_name,
                   p_partkey,
                   p_mfgr,
                   s_address,
                   s_phone,
                   s_comment
    FROM  part,
          supplier,
          partsupp,
          nation,
          region
   WHERE  p_partkey = ps_partkey
     AND  s_suppkey = ps_suppkey
     AND  p_size = 15
     AND  p_type LIKE '%BRASS'
     AND  s_nationkey = n_nationkey
     AND  n_regionkey = r_regionkey
     AND  r_name = 'EUROPE'
     AND  ps_supplycost = 
            ( SELECT  MIN(ps_supplycost)
                FROM  partsupp,
                      supplier,
                      nation,
                      region
               WHERE  p_partkey = ps_partkey
                 AND  s_suppkey = ps_suppkey
                 AND  s_nationkey = n_nationkey
                 AND  n_regionkey = r_regionkey
                 AND  r_name = 'EUROPE'
            )
ORDER BY  s_acctbal DESC,
          n_name,
          s_name,
          p_partkey

The intent is to return information about parts and suppliers, such that the part is available from a supplier in Europe, and the supplier has the lowest price for the part among all European suppliers.

Q10:

  SELECT  TOP 20
                                                     c_custkey,
                                                     c_name,
          SUM(l_extendedprice * (1 - l_discount)) AS revenue,
                                                     c_acctbal, 
                                                     n_name,
                                                     c_address,
                                                     c_phone,
                                                     c_comment
    FROM  customer,
          orders,
          lineitem,
          nation
   WHERE  c_custkey = o_custkey
     AND  l_orderkey = o_orderkey
     AND  o_orderdate >= CAST ('1993-10-01' AS DATE)
     AND  o_orderdate < DATEADD ('month', 3, CAST ('1993-10-01' AS DATE))
     AND  l_returnflag = 'R'
     AND  c_nationkey = n_nationkey
GROUP BY  c_custkey,
          c_name,
          c_acctbal,
          c_phone,
          n_name,
          c_address,
          c_comment
ORDER BY  revenue DESC

The intent is to list the customers who cause the greatest loss of revenue in a given quarter by returning items ordered in said quarter.

We notice that both queries return many columns on which there are no conditions, and that both have a cap on returned rows. The difference is that in Q2 the major ORDER BY is on a grouping column, and in Q10 it is on the aggregate of the GROUP BY. Thus the TOP k trick discussed in the previous article does apply to Q2 but not to Q10.

The profile for Q2 follows:

{ 
time   6.1e-05% fanout         1 input         1 rows
time       1.1% fanout         1 input         1 rows
{ hash filler
Subquery 27 
{ 
time    0.0012% fanout         1 input         1 rows
REGION         1 rows(t10.R_REGIONKEY)
 R_NAME = <c EUROPE>
time   0.00045% fanout         5 input         1 rows
NATION         5 rows(t9.N_NATIONKEY)
 N_REGIONKEY = t10.R_REGIONKEY
time       1.6% fanout     40107 input         5 rows
SUPPLIER   4.2e+04 rows(t8.S_SUPPKEY)
 S_NATIONKEY = t9.N_NATIONKEY
 
After code:
      0: t8.S_SUPPKEY :=  := artm t8.S_SUPPKEY
      4: BReturn 0
time       0.1% fanout         0 input    200535 rows
Sort hf 49 (t8.S_SUPPKEY)
}
}
time    0.0004% fanout         1 input         1 rows
{ fork
time        21% fanout     79591 input         1 rows
PART     8e+04 rows(.P_PARTKEY)
 P_TYPE LIKE <c %BRASS> LIKE <c > ,  P_SIZE =  15 
time        44% fanout  0.591889 input     79591 rows
 
Precode:
      0: { 
time     0.083% fanout         1 input     79591 rows
time      0.13% fanout         1 input     79591 rows
{ fork
time        24% fanout  0.801912 input     79591 rows
PARTSUPP       3.5 rows(.PS_SUPPKEY, .PS_SUPPLYCOST)
 inlined  PS_PARTKEY = k_.P_PARTKEY
hash partition+bloom by 62 (tmp)hash join merged always card       0.2 -> ()
time       1.3% fanout         0 input     63825 rows
Hash source 49 merged into ts not partitionable       0.2 rows(.PS_SUPPKEY) -> ()
 
After code:
      0:  min min.PS_SUPPLYCOSTset no set_ctr
      5: BReturn 0
}
 
After code:
      0: aggregate :=  := artm min
      4: BReturn 0
time      0.19% fanout         0 input     79591 rows
Subquery Select(aggregate)
}
 
      8: BReturn 0
PARTSUPP     5e-08 rows(.PS_SUPPKEY)
 inlined  PS_PARTKEY = k_.P_PARTKEY PS_SUPPLYCOST = k_scalar
time       5.9% fanout  0.247023 input     47109 rows
SUPPLIER unq       0.9 rows (.S_ACCTBAL, .S_NATIONKEY, .S_NAME, .S_SUPPKEY)
 inlined  S_SUPPKEY = .PS_SUPPKEY
top k on S_ACCTBAL
time     0.077% fanout         1 input     11637 rows
NATION unq         1 rows (.N_REGIONKEY, .N_NAME)
 inlined  N_NATIONKEY = .S_NATIONKEY
time     0.051% fanout         1 input     11637 rows
REGION unq       0.2 rows ()
 inlined  R_REGIONKEY = .N_REGIONKEY R_NAME = <c EUROPE>
time      0.42% fanout         0 input     11637 rows
Sort (.S_ACCTBAL, .N_NAME, .S_NAME, .P_PARTKEY) -> (.S_SUPPKEY)
 
}
time    0.0016% fanout       100 input         1 rows
top order by read (.S_SUPPKEY, .P_PARTKEY, .N_NAME, .S_NAME, .S_ACCTBAL)
time      0.02% fanout         1 input       100 rows
PART unq      0.95 rows (.P_MFGR)
 inlined  P_PARTKEY = .P_PARTKEY
time     0.054% fanout         1 input       100 rows
SUPPLIER unq         1 rows (.S_PHONE, .S_ADDRESS, .S_COMMENT)
 inlined  S_SUPPKEY = k_.S_SUPPKEY
time   6.7e-05% fanout         0 input       100 rows
Select (.S_ACCTBAL, .S_NAME, .N_NAME, .P_PARTKEY, .P_MFGR, .S_ADDRESS, .S_PHONE, .S_COMMENT)
}


 128 msec 1007% cpu,    196992 rnd 2.53367e+07 seq   50.4135% same seg   45.3574% same pg 

The query starts with a scan looking for the qualifying parts. It then looks for the best price for each part from a European supplier. All the European suppliers have been previously put in a hash table by the hash filler subquery at the start of the plan. Thus, to find the minimum price, the query takes the partsupp for the part by index, and then eliminates all non-European suppliers by a selective hash join. After this, there is a second index lookup on partsupp where we look for the part and the price equal to the minimum price found earlier. These operations could in principle be merged, as the minimum price partsupp has already been seen. The gain would not be very large, though.

Here we note that the cost model guesses that very few rows will survive the check of ps_supplycost = minimum cost. It does not know that the minimum is not just any value, but one of the values that do occur in the ps_supplycost column for the part. Because of this, the remainder of the plan is carried out by index, which is just as well. The point is that if very few rows of input are expected, it is not worthwhile to make a hash table for a hash join. The hash table made for the European suppliers could be reused here, maybe with some small gain. It would however need more columns, which might make it not worthwhile. We note that the major order with the TOP k is on the supplier s_acctbal, hence as soon as there are 100 suppliers found, one can add a restriction on the s_acctbal for subsequent ones.

At the end of the plan, after the TOP k ORDER BY and the reading of the results, we have a separate index-based lookup for getting only the columns that are returned. We note that this is done on 100 rows whereas the previous operations are done on tens-of-thousands of rows. The TOP k restriction produces some benefit, but it is relatively late in the plan, and not many operations follow it.

The plan is easily good enough, with only small space for improvement. Q2 is one of the fastest queries of the set.

Let us now consider the execution of Q10:

{ 
time   1.1e-06% fanout         1 input         1 rows
time   4.4e-05% fanout         1 input         1 rows
{ hash filler
time   1.6e-05% fanout        25 input         1 rows
NATION        25 rows(.N_NATIONKEY, .N_NAME)
 
time   6.7e-06% fanout         0 input        25 rows
Sort hf 35 (.N_NATIONKEY) -> (.N_NAME)
 
}
time   1.5e-06% fanout         1 input         1 rows
{ fork
time   2.4e-06% fanout         1 input         1 rows
{ fork
time        13% fanout 5.73038e+06 input         1 rows
ORDERS   5.1e+06 rows(.O_ORDERKEY, .O_CUSTKEY)
 O_ORDERDATE >= <c 1993-10-01> < <c 1994-01-01>
time       4.8% fanout   2.00042 input 5.73038e+06 rows
LINEITEM       1.1 rows(.L_EXTENDEDPRICE, .L_DISCOUNT)
 inlined  L_ORDERKEY = .O_ORDERKEY L_RETURNFLAG = <c R>
time        25% fanout         1 input 1.14632e+07 rows
 
Precode:
      0: temp := artm  1  - .L_DISCOUNT
      4: temp := artm .L_EXTENDEDPRICE * temp
      8: BReturn 0
CUSTOMER unq         1 rows (.C_NATIONKEY, .C_CUSTKEY)
 inlined  C_CUSTKEY = k_.O_CUSTKEY
hash partition+bloom by 39 (tmp)hash join merged always card         1 -> (.N_NAME)
time    0.0023% fanout         1 input 1.14632e+07 rows
Hash source 35 merged into ts          1 rows(.C_NATIONKEY) -> (.N_NAME)
time       2.3% fanout         1 input 1.14632e+07 rows
Stage 2
time       3.6% fanout         0 input 1.14632e+07 rows
Sort (q_.C_CUSTKEY, .N_NAME) -> (temp)
 
}
time       0.6% fanout 3.88422e+06 input         1 rows
group by read node  
(.C_CUSTKEY, .N_NAME, revenue)in each partition slice
time      0.57% fanout         0 input 3.88422e+06 rows
Sort (revenue) -> (.N_NAME, .C_CUSTKEY)
 
}
time   6.9e-06% fanout        20 input         1 rows
top order by read (.N_NAME, revenue, .C_CUSTKEY)
time   0.00036% fanout         1 input        20 rows
CUSTOMER unq         1 rows (.C_PHONE, .C_NAME, .C_ACCTBAL, .C_ADDRESS, .C_COMMENT)
 inlined  C_CUSTKEY = .C_CUSTKEY
time   1.1e-06% fanout         0 input        20 rows
Select (.C_CUSTKEY, .C_NAME, revenue, .C_ACCTBAL, .N_NAME, .C_ADDRESS, .C_PHONE, .C_COMMENT)
}


 2153 msec 2457% cpu, 1.71845e+07 rnd 1.67177e+08 seq   76.3221% same seg   21.1204% same pg 

The plan is by index, except for the lookup of nation name for the customer. The most selective condition is on order date, followed by the returnflag on lineitem. Getting the customer by index turns out to be better than by hash, even though almost all customers are hit. See the input cardinality above the first customer entry in the plan -- over 10M. The key point here is that only the c_custkey and c_nationkey get fetched, which saves a lot of time. In fact the c_custkey is needless since this is anyway equal to the o_custkey, but this makes little difference.

One could argue that customer should be between lineitem and orders in join order. Doing this would lose the ORDER BY on orders and lineitem, but would prevent some customer rows from being hit twice for a single order. The difference would not be large, though. For a scale-out setting, one definitely wants to have orders and lineitem without customer in between if the former are partitioned on the same key.

The c_nationkey is next translated into a n_name by hash, and there is a partitioned GROUP BY on c_custkey. The GROUP BY is partitioned because there are many different c_custkey values (155M for 100G scale).

The most important trick is fetching all the many dependent columns of c_custkey only after the TOP k ORDER BY. The last access to customer in the plan does this and is only executed on 20 rows.

Without the TOP k trick, the plan is identical, except that the dependent columns are fetched for nearly all customers. If this is done, the run time is 16s, which is bad enough to sink the whole score.

There is another approach to the challenge of this query: If foreign keys are declared and enforced, the system will know that every order has an actually existing customer and that every customer has a country. If so, the whole GROUP BY and TOP k can be done without any reference to customer, which is a notch better still, at least for this query. In this implementation, we do not declare foreign keys, thus the database must check that the customer and its country in fact exist before doing the GROUP BY. This makes the late projection trick mandatory, but does save the expense of checking foreign keys on updates. In both cases, the optimizer must recognize that the columns to be fetched at the end (late projected) are functionally dependent on a grouping key (c_custkey).

The late projection trick is generally useful, since almost all applications aside from bulk data export have some sort of limit on result set size. A column store especially benefits from this, since some columns of a row can be read without even coming near to other ones. A row store can also benefit from this in the form of decreased intermediate result size. This is especially good when returning long columns, such as text fields or blobs, on which there are most often no search conditions. If there are conditions of such, then these will most often be implemented via a special text index and not a scan.

*           *           *           *           *

In the next installment we will have a look at the overall 100G single server performance. After this we will recap the tricks so far. Then it will be time to look at implications of scale out for performance and run at larger scales. After the relational ground has been covered, we can look at implications of schema-lastness, i.e., triples for this type of workload.

So, while the most salient tricks have been at least briefly mentioned, we are far from having exhausted this most foundational of database topics.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
04/07/2014 12:30 GMT Modified: 05/28/2014 17:13 GMT
In Hoc Signo Vinces (part 11 of n): TPC-H Q2, Q10 - Late Projection [ Orri Erling ]

Analytics is generally about making something small out of something large. This reduction is obtained by a TOP k operator (i.e., show only the 10 best by some metric) and/or by grouping and aggregation (i.e., for a set of items, show some attributes of these items and a sum, count, or other aggregate of dependent items for each).

In this installment we will look at late projection, also sometimes known as late materialization. If many attributes are returned and there is a cutoff of some sort, then the query does not need to be concerned about attributes on which there are no conditions, except for fetching them at the last moment, only for the entities which in fact will be returned to the user.

We look at TPC-H Q2 and Q10.

Q2:

  SELECT  TOP 100
                   s_acctbal,
                   s_name,
                   n_name,
                   p_partkey,
                   p_mfgr,
                   s_address,
                   s_phone,
                   s_comment
    FROM  part,
          supplier,
          partsupp,
          nation,
          region
   WHERE  p_partkey = ps_partkey
     AND  s_suppkey = ps_suppkey
     AND  p_size = 15
     AND  p_type LIKE '%BRASS'
     AND  s_nationkey = n_nationkey
     AND  n_regionkey = r_regionkey
     AND  r_name = 'EUROPE'
     AND  ps_supplycost = 
            ( SELECT  MIN(ps_supplycost)
                FROM  partsupp,
                      supplier,
                      nation,
                      region
               WHERE  p_partkey = ps_partkey
                 AND  s_suppkey = ps_suppkey
                 AND  s_nationkey = n_nationkey
                 AND  n_regionkey = r_regionkey
                 AND  r_name = 'EUROPE'
            )
ORDER BY  s_acctbal DESC,
          n_name,
          s_name,
          p_partkey

The intent is to return information about parts and suppliers, such that the part is available from a supplier in Europe, and the supplier has the lowest price for the part among all European suppliers.

Q10:

  SELECT  TOP 20
                                                     c_custkey,
                                                     c_name,
          SUM(l_extendedprice * (1 - l_discount)) AS revenue,
                                                     c_acctbal, 
                                                     n_name,
                                                     c_address,
                                                     c_phone,
                                                     c_comment
    FROM  customer,
          orders,
          lineitem,
          nation
   WHERE  c_custkey = o_custkey
     AND  l_orderkey = o_orderkey
     AND  o_orderdate >= CAST ('1993-10-01' AS DATE)
     AND  o_orderdate < DATEADD ('month', 3, CAST ('1993-10-01' AS DATE))
     AND  l_returnflag = 'R'
     AND  c_nationkey = n_nationkey
GROUP BY  c_custkey,
          c_name,
          c_acctbal,
          c_phone,
          n_name,
          c_address,
          c_comment
ORDER BY  revenue DESC

The intent is to list the customers who cause the greatest loss of revenue in a given quarter by returning items ordered in said quarter.

We notice that both queries return many columns on which there are no conditions, and that both have a cap on returned rows. The difference is that in Q2 the major ORDER BY is on a grouping column, and in Q10 it is on the aggregate of the GROUP BY. Thus the TOP k trick discussed in the previous article does apply to Q2 but not to Q10.

The profile for Q2 follows:

{ 
time   6.1e-05% fanout         1 input         1 rows
time       1.1% fanout         1 input         1 rows
{ hash filler
Subquery 27 
{ 
time    0.0012% fanout         1 input         1 rows
REGION         1 rows(t10.R_REGIONKEY)
 R_NAME = <c EUROPE>
time   0.00045% fanout         5 input         1 rows
NATION         5 rows(t9.N_NATIONKEY)
 N_REGIONKEY = t10.R_REGIONKEY
time       1.6% fanout     40107 input         5 rows
SUPPLIER   4.2e+04 rows(t8.S_SUPPKEY)
 S_NATIONKEY = t9.N_NATIONKEY
 
After code:
      0: t8.S_SUPPKEY :=  := artm t8.S_SUPPKEY
      4: BReturn 0
time       0.1% fanout         0 input    200535 rows
Sort hf 49 (t8.S_SUPPKEY)
}
}
time    0.0004% fanout         1 input         1 rows
{ fork
time        21% fanout     79591 input         1 rows
PART     8e+04 rows(.P_PARTKEY)
 P_TYPE LIKE <c %BRASS> LIKE <c > ,  P_SIZE =  15 
time        44% fanout  0.591889 input     79591 rows
 
Precode:
      0: { 
time     0.083% fanout         1 input     79591 rows
time      0.13% fanout         1 input     79591 rows
{ fork
time        24% fanout  0.801912 input     79591 rows
PARTSUPP       3.5 rows(.PS_SUPPKEY, .PS_SUPPLYCOST)
 inlined  PS_PARTKEY = k_.P_PARTKEY
hash partition+bloom by 62 (tmp)hash join merged always card       0.2 -> ()
time       1.3% fanout         0 input     63825 rows
Hash source 49 merged into ts not partitionable       0.2 rows(.PS_SUPPKEY) -> ()
 
After code:
      0:  min min.PS_SUPPLYCOSTset no set_ctr
      5: BReturn 0
}
 
After code:
      0: aggregate :=  := artm min
      4: BReturn 0
time      0.19% fanout         0 input     79591 rows
Subquery Select(aggregate)
}
 
      8: BReturn 0
PARTSUPP     5e-08 rows(.PS_SUPPKEY)
 inlined  PS_PARTKEY = k_.P_PARTKEY PS_SUPPLYCOST = k_scalar
time       5.9% fanout  0.247023 input     47109 rows
SUPPLIER unq       0.9 rows (.S_ACCTBAL, .S_NATIONKEY, .S_NAME, .S_SUPPKEY)
 inlined  S_SUPPKEY = .PS_SUPPKEY
top k on S_ACCTBAL
time     0.077% fanout         1 input     11637 rows
NATION unq         1 rows (.N_REGIONKEY, .N_NAME)
 inlined  N_NATIONKEY = .S_NATIONKEY
time     0.051% fanout         1 input     11637 rows
REGION unq       0.2 rows ()
 inlined  R_REGIONKEY = .N_REGIONKEY R_NAME = <c EUROPE>
time      0.42% fanout         0 input     11637 rows
Sort (.S_ACCTBAL, .N_NAME, .S_NAME, .P_PARTKEY) -> (.S_SUPPKEY)
 
}
time    0.0016% fanout       100 input         1 rows
top order by read (.S_SUPPKEY, .P_PARTKEY, .N_NAME, .S_NAME, .S_ACCTBAL)
time      0.02% fanout         1 input       100 rows
PART unq      0.95 rows (.P_MFGR)
 inlined  P_PARTKEY = .P_PARTKEY
time     0.054% fanout         1 input       100 rows
SUPPLIER unq         1 rows (.S_PHONE, .S_ADDRESS, .S_COMMENT)
 inlined  S_SUPPKEY = k_.S_SUPPKEY
time   6.7e-05% fanout         0 input       100 rows
Select (.S_ACCTBAL, .S_NAME, .N_NAME, .P_PARTKEY, .P_MFGR, .S_ADDRESS, .S_PHONE, .S_COMMENT)
}


 128 msec 1007% cpu,    196992 rnd 2.53367e+07 seq   50.4135% same seg   45.3574% same pg 

The query starts with a scan looking for the qualifying parts. It then looks for the best price for each part from a European supplier. All the European suppliers have been previously put in a hash table by the hash filler subquery at the start of the plan. Thus, to find the minimum price, the query takes the partsupp for the part by index, and then eliminates all non-European suppliers by a selective hash join. After this, there is a second index lookup on partsupp where we look for the part and the price equal to the minimum price found earlier. These operations could in principle be merged, as the minimum price partsupp has already been seen. The gain would not be very large, though.

Here we note that the cost model guesses that very few rows will survive the check of ps_supplycost = minimum cost. It does not know that the minimum is not just any value, but one of the values that do occur in the ps_supplycost column for the part. Because of this, the remainder of the plan is carried out by index, which is just as well. The point is that if very few rows of input are expected, it is not worthwhile to make a hash table for a hash join. The hash table made for the European suppliers could be reused here, maybe with some small gain. It would however need more columns, which might make it not worthwhile. We note that the major order with the TOP k is on the supplier s_acctbal, hence as soon as there are 100 suppliers found, one can add a restriction on the s_acctbal for subsequent ones.

At the end of the plan, after the TOP k ORDER BY and the reading of the results, we have a separate index-based lookup for getting only the columns that are returned. We note that this is done on 100 rows whereas the previous operations are done on tens-of-thousands of rows. The TOP k restriction produces some benefit, but it is relatively late in the plan, and not many operations follow it.

The plan is easily good enough, with only small space for improvement. Q2 is one of the fastest queries of the set.

Let us now consider the execution of Q10:

{ 
time   1.1e-06% fanout         1 input         1 rows
time   4.4e-05% fanout         1 input         1 rows
{ hash filler
time   1.6e-05% fanout        25 input         1 rows
NATION        25 rows(.N_NATIONKEY, .N_NAME)
 
time   6.7e-06% fanout         0 input        25 rows
Sort hf 35 (.N_NATIONKEY) -> (.N_NAME)
 
}
time   1.5e-06% fanout         1 input         1 rows
{ fork
time   2.4e-06% fanout         1 input         1 rows
{ fork
time        13% fanout 5.73038e+06 input         1 rows
ORDERS   5.1e+06 rows(.O_ORDERKEY, .O_CUSTKEY)
 O_ORDERDATE >= <c 1993-10-01> < <c 1994-01-01>
time       4.8% fanout   2.00042 input 5.73038e+06 rows
LINEITEM       1.1 rows(.L_EXTENDEDPRICE, .L_DISCOUNT)
 inlined  L_ORDERKEY = .O_ORDERKEY L_RETURNFLAG = <c R>
time        25% fanout         1 input 1.14632e+07 rows
 
Precode:
      0: temp := artm  1  - .L_DISCOUNT
      4: temp := artm .L_EXTENDEDPRICE * temp
      8: BReturn 0
CUSTOMER unq         1 rows (.C_NATIONKEY, .C_CUSTKEY)
 inlined  C_CUSTKEY = k_.O_CUSTKEY
hash partition+bloom by 39 (tmp)hash join merged always card         1 -> (.N_NAME)
time    0.0023% fanout         1 input 1.14632e+07 rows
Hash source 35 merged into ts          1 rows(.C_NATIONKEY) -> (.N_NAME)
time       2.3% fanout         1 input 1.14632e+07 rows
Stage 2
time       3.6% fanout         0 input 1.14632e+07 rows
Sort (q_.C_CUSTKEY, .N_NAME) -> (temp)
 
}
time       0.6% fanout 3.88422e+06 input         1 rows
group by read node  
(.C_CUSTKEY, .N_NAME, revenue)in each partition slice
time      0.57% fanout         0 input 3.88422e+06 rows
Sort (revenue) -> (.N_NAME, .C_CUSTKEY)
 
}
time   6.9e-06% fanout        20 input         1 rows
top order by read (.N_NAME, revenue, .C_CUSTKEY)
time   0.00036% fanout         1 input        20 rows
CUSTOMER unq         1 rows (.C_PHONE, .C_NAME, .C_ACCTBAL, .C_ADDRESS, .C_COMMENT)
 inlined  C_CUSTKEY = .C_CUSTKEY
time   1.1e-06% fanout         0 input        20 rows
Select (.C_CUSTKEY, .C_NAME, revenue, .C_ACCTBAL, .N_NAME, .C_ADDRESS, .C_PHONE, .C_COMMENT)
}


 2153 msec 2457% cpu, 1.71845e+07 rnd 1.67177e+08 seq   76.3221% same seg   21.1204% same pg 

The plan is by index, except for the lookup of nation name for the customer. The most selective condition is on order date, followed by the returnflag on lineitem. Getting the customer by index turns out to be better than by hash, even though almost all customers are hit. See the input cardinality above the first customer entry in the plan -- over 10M. The key point here is that only the c_custkey and c_nationkey get fetched, which saves a lot of time. In fact the c_custkey is needless since this is anyway equal to the o_custkey, but this makes little difference.

One could argue that customer should be between lineitem and orders in join order. Doing this would lose the ORDER BY on orders and lineitem, but would prevent some customer rows from being hit twice for a single order. The difference would not be large, though. For a scale-out setting, one definitely wants to have orders and lineitem without customer in between if the former are partitioned on the same key.

The c_nationkey is next translated into a n_name by hash, and there is a partitioned GROUP BY on c_custkey. The GROUP BY is partitioned because there are many different c_custkey values (155M for 100G scale).

The most important trick is fetching all the many dependent columns of c_custkey only after the TOP k ORDER BY. The last access to customer in the plan does this and is only executed on 20 rows.

Without the TOP k trick, the plan is identical, except that the dependent columns are fetched for nearly all customers. If this is done, the run time is 16s, which is bad enough to sink the whole score.

There is another approach to the challenge of this query: If foreign keys are declared and enforced, the system will know that every order has an actually existing customer and that every customer has a country. If so, the whole GROUP BY and TOP k can be done without any reference to customer, which is a notch better still, at least for this query. In this implementation, we do not declare foreign keys, thus the database must check that the customer and its country in fact exist before doing the GROUP BY. This makes the late projection trick mandatory, but does save the expense of checking foreign keys on updates. In both cases, the optimizer must recognize that the columns to be fetched at the end (late projected) are functionally dependent on a grouping key (c_custkey).

The late projection trick is generally useful, since almost all applications aside from bulk data export have some sort of limit on result set size. A column store especially benefits from this, since some columns of a row can be read without even coming near to other ones. A row store can also benefit from this in the form of decreased intermediate result size. This is especially good when returning long columns, such as text fields or blobs, on which there are most often no search conditions. If there are conditions of such, then these will most often be implemented via a special text index and not a scan.

*           *           *           *           *

In the next installment we will have a look at the overall 100G single server performance. After this we will recap the tricks so far. Then it will be time to look at implications of scale out for performance and run at larger scales. After the relational ground has been covered, we can look at implications of schema-lastness, i.e., triples for this type of workload.

So, while the most salient tricks have been at least briefly mentioned, we are far from having exhausted this most foundational of database topics.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
04/07/2014 12:28 GMT Modified: 05/28/2014 17:12 GMT
 <<     | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |     >>
Powered by OpenLink Virtuoso Universal Server
Running on Linux platform