This piece is dedicated to Peter Boncz, architect of Actian Vector and MonetDB.
Query optimization is hard. It is a set of mutually interacting tricks and special cases. Execution is also hard, but there the tricks do not interact quite as much or as unpredictably. So, if there is a few percent of score to be had from optimization of either execution or query, I will take execution first. It is less likely to break things and will probably benefit a larger set of use cases.
As we see from the profile in the previous article, hash join is the main piece of execution in TPCH. So between the article on late projection and the first result preview, I changed the hash table used in HASH JOIN
and GROUP BY
from cuckoo to linear.
Let's see how the hash tables work: Cuckoo hash is a scheme where an entry can be in one of two possible places in the table. If a new entry is inserted and either of the possible places is unoccupied, it goes there. If both are occupied, it could be that one contains an entry whose other possible location is free  and then that entry may be relocated. Thus an insert may push the previous occupant of the place somewhere else, which in turn may push another, and so on. It may happen that an insert is still not possible, in which case the entry to insert goes into an exceptions list.
To look up an entry, you get a hash number, and use different fields of it to pick the two places. Look in one, then the other, then the exceptions. If there is no match and the table is reasonably close to capacity, you will have looked in at least 3 widelyseparated places to determine the absence of a match. In practice, the hash table consists on a prime number of distinct arrays of a fixed size (partitions), and each partition has its own exception list. A modulo of the hash number picks the array, then two further modulos of different parts of the number pick the places in the array.
In most cases in TPCH, the hash joins are selective; i.e., most items on the probe side find no match in the hash table.
So, quite often you have 3 cache misses to show that there is no hit. This is, at least in theory, quite bad.
There are Bloom filters before the hash table. The Bloom filter will prune most of the probes that would miss. A Bloom filter is an array of bits. Given a hash number, the Bloom filter will very efficiently tell you whether the entry is sure not to be in the hash table. If the Bloom filter says it can be in the hash table, you must look.
In the Virtuoso case, for each entry in the hash table, the Bloom filter has 8 bits. The Bloom check uses a field of the hash number to pick a 64bit word from the Bloom filter. Then different fields of the hash number are used to set upto 4 bits in a 64bit bitmask. When building the hash table, the masks are ORed into the Bloom filter. When probing, before looking in the hash table, the system checks to see if the bits corresponding to the hash number are all on in the appropriate word. If they are not, the hash lookup is sure to miss.
Most expositions of Bloom filters talk about setting two bits for every value. With two bits set, we found 8 bitspervalue to work best. More bits makes a larger filter and misses the cache more; fewer bits makes too many collisions, and the Bloom filter produces too many false positives. A significant finding is that with 8 bitspervalue, setting 4 bits instead of 2 causes the filter to be twice as selective. The simple trick of setting 4 bits cuts the number of hash lookups for items that passed the Bloom filter to half in many selective hash joins. Examples are the many joins of lineitem
and part
or supplier
where there is a condition on the smaller table.
Still, even with Bloom filters, a cuckoo hash will make too many cache misses.
So, enter linear hash. The idea is simple: The hash number picks a place in an array. Either the entry being sought is in the vicinity, or it is not in the hash table. If the vicinity is full of other entries, the entry can still be in an exception list.
With this and cuckoo alike, there are 3 different variants of hash table:

A set of single unique integers

A singleinteger key with 0 or more dependent values, possibly with a next link if the key is not unique

A key of n arbitrary values, 0 or more dependent values, optional next link if the key is not unique
In the first case, the hash table is an array of values; in the two other cases, it is an array of pointers. But since a pointer is 64 bits, of which the high 16 are not in the address space of x86_64, these high bits can be used to keep a part of the hash number. It will be necessary to dereference the pointer only if the high bits match the hash number. This means that nearly all lookups that do not find a match are handled with a single cache miss.
Each cache miss brings in a cache line of 8 words. The lookup starts at a point given by the hash number and wraps around at the end of the cache line. Only in the case that all 8 words are occupied but do not match does one need to look at the exceptions. There is one exception list for each partition of the hash table, like in the cuckoo scheme.
A hash lookup is always done on a vector of keys; the loop that takes most of the time is in fact the Bloom filter check. It goes as follows:
#define CHB_VAR(n) \
uint64 h##n, w##n, mask##n;
#define CHB_INIT(n, i) \
MHASH_STEP_1 (h##n, i); \
w##n = bf[BF_WORD (h##n, sz)]; \
mask##n = BF_MASK (h##n);
#define CHB_CK(n) \
{ matches[mfill] = inx + n; \
mfill += (w##n & mask##n) == mask##n; }
for (inx = inx; inx < last; inx ++)
{
CHB_VAR (0);
CHB_INIT (0, REF ((ce_first + sizeof (ELT_T) * inx)));
CHB_CK (0);
}
This is the perfect loop for outoforder execution. Now, I have tried every variation you can imagine, and this does not get better. The loop calculates a hash number, fetches the corresponding word from the Bloom filter, calculates a mask, stores the index of the key in a results array, and increments the results counter if all the bits were set. There is no control dependency anywhere, just a data dependency between successive iterations; i.e., to know where the result must go, you must know if the previous was a hit.
You can unroll this loop very easily, so, for example, take 4 keys, do the numbers, fetch the words, and then check them one after the other. One would think this would have more misses in flight at any one time, which it does. But it does not run any faster.
Maybe the loop is too long. Circumstantial evidence suggests that short loops are better for instruction prefetching. So, one can also make a loop that gets any number of words of the Bloom filter and puts them in one local array and the hash numbers in another array. A subsequent loop then reads the hash number, calculates the mask, and checks if there is a hit. In this way one can generate as many misses as one wants and check them as late as one wants. It so happens that doing 8 misses and then checking them is better than either 4 or 16. But 8 is still marginally worse than the loop first mentioned.
One can also vary the test. Instead of adding a truth value to the result counter, one can have
if ((word & mask) == mask) result[fill++] = inx;
There is no clear difference between predication (incrementing the fill by truth value) and a conditional jump. The theory of outoforder execution would predict predication to be better, but the difference is lost in measurement noise. This is true on both Intel Nehalem (Xeon 55xx) and Sandy Bridge (E5 26xx), but could be different on other architectures.
The multicore scalability of the test will give some information about platform utilization.
This is the ultimately simplified selective hash join:
SELECT COUNT (*)
FROM lineitem,
part
WHERE l_partkey = p_partkey
AND p_size < 15
;
This is the second simplest hash join but misses the cache much more; since this now has a key and a dependent part in the hash table, there is an extra pointer to follow, and the hash entry is two words plus the pointer to these in the hash table array.
SELECT SUM (p_retailprice)
FROM lineitem,
part
WHERE l_partkey = p_partkey
AND p_size < 15
;
By adjusting the number of parts selected, we can vary the Bloom filter selectivity and the size of the hash table. Below, we show the times for the two queries with singlethread and 24thread execution, with different percentages of the part
table on the build side of the hash join. All runs are against warm 100G TPCH on the same test system as in the rest of the TPCH series (dual Xeon E52630).
This table compares the performance of the linear and cuckoo implementations on the above queries (count vs. sum) on either 24 threads or 1 thread. Four data points are given for different sizes of hash table, given as percentage of the part
table (having 400K  20M entries) in the hash table. The rightmost column, which represents the case where the entire part
table is on the build side does not have a Bloom filter; the other cases do. The Bloom bits are 8/4 for linear and 8/2 for cuckoo. The times are all in milliseconds, and the thousands separator is a comma.
Hash type 
Query type 
Threads 
2% (ms) 
10% (ms) 
30% (ms) 
100% (ms) 
Linear 
COUNT 
24 
1,204 
1,683 
3,100 
6,214 
Linear 
SUM 
24 
1,261 
2,447 
5,059 
13,086 
Linear 
COUNT 
1 
15,286 
22,451 
38,863 
66,722 
Linear 
SUM 
1 
17,575 
33,664 
81,927 
179,013 
Cuckoo 
COUNT 
24 
1,849 
2,840 
4,105 
6,203 
Cuckoo 
SUM 
24 
2,833 
4,903 
9,446 
19,652 
Cuckoo 
COUNT 
1 
25,146 
39,064 
57,383 
85,105 
Cuckoo 
SUM 
1 
33,647 
67,089 
121,989 
240,941 
We clearly see cache effects on the two first lines, where the SUM
and COUNT
run in almost the same time on a small hash table but have a 2x difference on the larger hash table. The instruction path length is not very different for SUM
and COUNT
, but the memory footprint has a 3x difference.
We note that the SMP scalability of linear is slightly better, contrasting the ratio of 24thread SUM
to singlethread SUM
. Both numbers are over 12x, indicating net benefit from core multithreading. (The test system has 12 physical cores.) The linear hash systematically outperforms cuckoo, understandably, since it makes a smaller net number of cache misses. The overall effect on the TPCH score is noticeable, at around 1520K units of composite score at 100G.
In conclusion, the Virtuoso hash join implementation is certainly on the level, with only small gains to be expected from further vectoring and prefetching. These results may be reproduced using the v7fasttrack Virtuoso Open Source releases from GitHub; develop/7 for cuckoo and feature/analytics for linear hash.
To be continued...
In Hoc Signo Vinces (TPCH) Series