This piece is dedicated to Peter Boncz, architect of Actian Vector and MonetDB.

Query optimization is hard. It is a set of mutually interacting tricks and special cases. Execution is also hard, but there the tricks do not interact quite as much or as unpredictably. So, if there is a few percent of score to be had from optimization of either execution or query, I will take execution first. It is less likely to break things and will probably benefit a larger set of use cases.

As we see from the profile in the previous article, hash join is the main piece of execution in TPC-H. So between the article on late projection and the first result preview, I changed the hash table used in HASH JOIN and GROUP BY from cuckoo to linear.

Let's see how the hash tables work: Cuckoo hash is a scheme where an entry can be in one of two possible places in the table. If a new entry is inserted and either of the possible places is unoccupied, it goes there. If both are occupied, it could be that one contains an entry whose other possible location is free -- and then that entry may be relocated. Thus an insert may push the previous occupant of the place somewhere else, which in turn may push another, and so on. It may happen that an insert is still not possible, in which case the entry to insert goes into an exceptions list.

To look up an entry, you get a hash number, and use different fields of it to pick the two places. Look in one, then the other, then the exceptions. If there is no match and the table is reasonably close to capacity, you will have looked in at least 3 widely-separated places to determine the absence of a match. In practice, the hash table consists on a prime number of distinct arrays of a fixed size (partitions), and each partition has its own exception list. A modulo of the hash number picks the array, then two further modulos of different parts of the number pick the places in the array.

In most cases in TPC-H, the hash joins are selective; i.e., most items on the probe side find no match in the hash table.

So, quite often you have 3 cache misses to show that there is no hit. This is, at least in theory, quite bad.

There are Bloom filters before the hash table. The Bloom filter will prune most of the probes that would miss. A Bloom filter is an array of bits. Given a hash number, the Bloom filter will very efficiently tell you whether the entry is sure not to be in the hash table. If the Bloom filter says it can be in the hash table, you must look.

In the Virtuoso case, for each entry in the hash table, the Bloom filter has 8 bits. The Bloom check uses a field of the hash number to pick a 64-bit word from the Bloom filter. Then different fields of the hash number are used to set up-to 4 bits in a 64-bit bit-mask. When building the hash table, the masks are OR-ed into the Bloom filter. When probing, before looking in the hash table, the system checks to see if the bits corresponding to the hash number are all on in the appropriate word. If they are not, the hash lookup is sure to miss.

Most expositions of Bloom filters talk about setting two bits for every value. With two bits set, we found 8 bits-per-value to work best. More bits makes a larger filter and misses the cache more; fewer bits makes too many collisions, and the Bloom filter produces too many false positives. A significant finding is that with 8 bits-per-value, setting 4 bits instead of 2 causes the filter to be twice as selective. The simple trick of setting 4 bits cuts the number of hash lookups for items that passed the Bloom filter to half in many selective hash joins. Examples are the many joins of lineitem and part or supplier where there is a condition on the smaller table.

Still, even with Bloom filters, a cuckoo hash will make too many cache misses.

So, enter linear hash. The idea is simple: The hash number picks a place in an array. Either the entry being sought is in the vicinity, or it is not in the hash table. If the vicinity is full of other entries, the entry can still be in an exception list.

With this and cuckoo alike, there are 3 different variants of hash table:

  1. A set of single unique integers

  2. A single-integer key with 0 or more dependent values, possibly with a next link if the key is not unique

  3. A key of n arbitrary values, 0 or more dependent values, optional next link if the key is not unique

In the first case, the hash table is an array of values; in the two other cases, it is an array of pointers. But since a pointer is 64 bits, of which the high 16 are not in the address space of x86_64, these high bits can be used to keep a part of the hash number. It will be necessary to dereference the pointer only if the high bits match the hash number. This means that nearly all lookups that do not find a match are handled with a single cache miss.

Each cache miss brings in a cache line of 8 words. The lookup starts at a point given by the hash number and wraps around at the end of the cache line. Only in the case that all 8 words are occupied but do not match does one need to look at the exceptions. There is one exception list for each partition of the hash table, like in the cuckoo scheme.

A hash lookup is always done on a vector of keys; the loop that takes most of the time is in fact the Bloom filter check. It goes as follows:

#define CHB_VAR(n)                           \
   uint64 h##n, w##n, mask##n;


#define CHB_INIT(n, i)                       \
   MHASH_STEP_1 (h##n, i);                   \
   w##n = bf[BF_WORD (h##n, sz)];            \
   mask##n = BF_MASK (h##n);


#define CHB_CK(n)                            \
   { matches[mfill] = inx + n;               \
     mfill += (w##n & mask##n) == mask##n; }

       for (inx = inx; inx < last; inx ++)
         {
           CHB_VAR (0);
           CHB_INIT (0, REF ((ce_first + sizeof (ELT_T) * inx)));
           CHB_CK (0);
         }

This is the perfect loop for out-of-order execution. Now, I have tried every variation you can imagine, and this does not get better. The loop calculates a hash number, fetches the corresponding word from the Bloom filter, calculates a mask, stores the index of the key in a results array, and increments the results counter if all the bits were set. There is no control dependency anywhere, just a data dependency between successive iterations; i.e., to know where the result must go, you must know if the previous was a hit.

You can unroll this loop very easily, so, for example, take 4 keys, do the numbers, fetch the words, and then check them one after the other. One would think this would have more misses in flight at any one time, which it does. But it does not run any faster.

Maybe the loop is too long. Circumstantial evidence suggests that short loops are better for instruction prefetching. So, one can also make a loop that gets any number of words of the Bloom filter and puts them in one local array and the hash numbers in another array. A subsequent loop then reads the hash number, calculates the mask, and checks if there is a hit. In this way one can generate as many misses as one wants and check them as late as one wants. It so happens that doing 8 misses and then checking them is better than either 4 or 16. But 8 is still marginally worse than the loop first mentioned.

One can also vary the test. Instead of adding a truth value to the result counter, one can have

if ((word & mask) == mask) result[fill++] = inx;

There is no clear difference between predication (incrementing the fill by truth value) and a conditional jump. The theory of out-of-order execution would predict predication to be better, but the difference is lost in measurement noise. This is true on both Intel Nehalem (Xeon 55xx) and Sandy Bridge (E5 26xx), but could be different on other architectures.

The multicore scalability of the test will give some information about platform utilization.

This is the ultimately simplified selective hash join:

SELECT  COUNT (*) 
  FROM  lineitem, 
        part 
 WHERE  l_partkey = p_partkey 
   AND     p_size < 15
;

This is the second simplest hash join but misses the cache much more; since this now has a key and a dependent part in the hash table, there is an extra pointer to follow, and the hash entry is two words plus the pointer to these in the hash table array.

SELECT  SUM (p_retailprice) 
  FROM  lineitem, 
        part 
 WHERE  l_partkey = p_partkey 
   AND     p_size < 15
;

By adjusting the number of parts selected, we can vary the Bloom filter selectivity and the size of the hash table. Below, we show the times for the two queries with single-thread and 24-thread execution, with different percentages of the part table on the build side of the hash join. All runs are against warm 100G TPC-H on the same test system as in the rest of the TPC-H series (dual Xeon E5-2630).

This table compares the performance of the linear and cuckoo implementations on the above queries (count vs. sum) on either 24 threads or 1 thread. Four data points are given for different sizes of hash table, given as percentage of the part table (having 400K - 20M entries) in the hash table. The rightmost column, which represents the case where the entire part table is on the build side does not have a Bloom filter; the other cases do. The Bloom bits are 8/4 for linear and 8/2 for cuckoo. The times are all in milliseconds, and the thousands separator is a comma.

Hash type Query type Threads 2%
(ms)
10%
(ms)
30%
(ms)
100%
(ms)
Linear COUNT 24 1,204 1,683 3,100 6,214
Linear SUM 24 1,261 2,447 5,059 13,086
Linear COUNT 1 15,286 22,451 38,863 66,722
Linear SUM 1 17,575 33,664 81,927 179,013
Cuckoo COUNT 24 1,849 2,840 4,105 6,203
Cuckoo SUM 24 2,833 4,903 9,446 19,652
Cuckoo COUNT 1 25,146 39,064 57,383 85,105
Cuckoo SUM 1 33,647 67,089 121,989 240,941

We clearly see cache effects on the two first lines, where the SUM and COUNT run in almost the same time on a small hash table but have a 2x difference on the larger hash table. The instruction path length is not very different for SUM and COUNT, but the memory footprint has a 3x difference.

We note that the SMP scalability of linear is slightly better, contrasting the ratio of 24-thread SUM to single-thread SUM. Both numbers are over 12x, indicating net benefit from core multithreading. (The test system has 12 physical cores.) The linear hash systematically outperforms cuckoo, understandably, since it makes a smaller net number of cache misses. The overall effect on the TPC-H score is noticeable, at around 15-20K units of composite score at 100G.

In conclusion, the Virtuoso hash join implementation is certainly on the level, with only small gains to be expected from further vectoring and prefetching. These results may be reproduced using the v7fasttrack Virtuoso Open Source releases from GitHub; develop/7 for cuckoo and feature/analytics for linear hash.

To be continued...

In Hoc Signo Vinces (TPC-H) Series