TPC-H is a hash join game. The rules do allow indices, but maintaining these takes time, and indices will quickly result in non-local access patterns. Indices also take space. Besides, somebody must know what indices to create, which is not obvious. Thus, it is best if a BI data warehouse works without.

Once you go to hash join, one side of the join will be materialized, which takes space, which ipso facto is bad. So, the predicate games are about moving conditions so that the hash table made for the hash join will be as small as possible. Only items that may in fact be retrieved should be put in the hash table. If you know that the query deals with shipments of green parts, putting lineitems of parts that are not green in a hash table makes no sense since only green ones are being looked for.

So, let's consider Q9. The query is:

SELECT                 nation,
                       o_year,
        SUM(amount) AS sum_profit
 FROM  ( SELECT
                                                                          n_name AS nation,
                                               EXTRACT ( YEAR FROM o_orderdate ) AS o_year,
                 l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity AS amount
           FROM
                 part,
                 supplier,
                 lineitem,
                 partsupp,
                 orders,
                 nation
          WHERE  s_suppkey = l_suppkey
            AND  ps_suppkey = l_suppkey
            AND  ps_partkey = l_partkey
            AND  p_partkey = l_partkey
            AND  o_orderkey = l_orderkey
            AND  s_nationkey = n_nationkey
            AND  p_name like '%green%'
       ) AS profit
GROUP BY  nation,
          o_year
ORDER BY  nation,
          o_year DESC
;

The intent is to calculate profit from the sale of a type of part, broken down by year and supplier nation. All orders, lineitems, partsupps, and suppliers involving the parts of interest are visited. This is one of the longest running of the queries. The query is restricted by part only, and the condition selects 1/17 of all parts.

The execution plan is below. First the plan builds hash tables of all nations and suppliers. We expect to do frequent lookups, thus making a hash is faster than using the index. Partsupp is the 3rd largest table in the database. This has a primary key of ps_partkey, ps_suppkey, referenced by the compound foreign key l_partkey, l_suppkey in lineitem. This could be accessed by index, but we expect to hit each partsupp row multiple times, hence hash is better. We further note that only partsupp rows where the part satisfies the condition will contribute to the result. Thus we import the join with part into the hash build. The ps_partkey is not directly joined to p_partkey, but rather the system must understand that this follows from l_partkey = ps_partkey and l_partkey = p_partkey. In this way, the hash table is 1/17th of the size it would otherwise be, which is a crucial gain.

Looking further into the plan, we note a scan of lineitem followed by a hash join with part. Restricting the build of the partsupp hash would have the same effect, hence part is here used twice while it occurs only once in the query. This is deliberate, since the selective hash join with part restricts lineitem faster than the more complex hash join with a 2 part key (l_partkey, l_suppkey). Both joins perform the identical restriction, but doing the part first is faster since this becomes a single-key, invisible hash join, merged into the lineitem scan, done before even accessing the l_suppkey and other columns.

{ 
time   3.9e-06% fanout         1 input         1 rows
time   4.7e-05% fanout         1 input         1 rows
{ hash filler
time   3.6e-05% fanout        25 input         1 rows
NATION        25 rows(.N_NATIONKEY, nation)
 
time   8.8e-06% fanout         0 input        25 rows
Sort hf 35 (.N_NATIONKEY) -> (nation)
 
}
time      0.16% fanout         1 input         1 rows
{ hash filler
time     0.011% fanout     1e+06 input         1 rows
SUPPLIER     1e+06 rows(.S_SUPPKEY, .S_NATIONKEY)
 
time      0.03% fanout         0 input     1e+06 rows
Sort hf 49 (.S_SUPPKEY) -> (.S_NATIONKEY)
 
}
time      0.57% fanout         1 input         1 rows
{ hash filler
Subquery 58 
{ 
time       1.6% fanout 1.17076e+06 input         1 rows
PART   1.2e+06 rows(t1.P_PARTKEY)
 P_NAME LIKE  <c %green%> LIKE  <c >
time       1.1% fanout         4 input 1.17076e+06 rows
PARTSUPP       3.9 rows(t4.PS_SUPPKEY, t4.PS_PARTKEY, t4.PS_SUPPLYCOST)
 inlined  PS_PARTKEY = t1.P_PARTKEY
 
After code:
      0: t4.PS_SUPPKEY :=  := artm t4.PS_SUPPKEY
      4: t4.PS_PARTKEY :=  := artm t4.PS_PARTKEY
      8: t1.P_PARTKEY :=  := artm t1.P_PARTKEY
      12: t4.PS_SUPPLYCOST :=  := artm t4.PS_SUPPLYCOST
      16: BReturn 0
time      0.33% fanout         0 input 4.68305e+06 rows
Sort hf 82 (t4.PS_SUPPKEY, t4.PS_PARTKEY) -> (t1.P_PARTKEY, t4.PS_SUPPLYCOST)
 
}
}
time      0.18% fanout         1 input         1 rows
{ hash filler
time       1.6% fanout 1.17076e+06 input         1 rows
PART   1.2e+06 rows(.P_PARTKEY)
 P_NAME LIKE  <c %green%> LIKE  <c >
time     0.017% fanout         0 input 1.17076e+06 rows
Sort hf 101 (.P_PARTKEY)
}
time   5.1e-06% fanout         1 input         1 rows
{ fork
time   4.1e-06% fanout         1 input         1 rows
{ fork
time        59% fanout 3.51125e+07 input         1 rows
LINEITEM     6e+08 rows(.L_PARTKEY, .L_ORDERKEY, .L_SUPPKEY, .L_EXTENDEDPRICE, .L_DISCOUNT, .L_QUANTITY)
 
hash partition+bloom by 108 (tmp)hash join merged always card     0.058 -> ()
hash partition+bloom by 56 (tmp)hash join merged always card         1 -> (.S_NATIONKEY)
time      0.18% fanout         1 input 3.51125e+07 rows
 
Precode:
      0: temp := artm  1  - .L_DISCOUNT
      4: temp := artm .L_EXTENDEDPRICE * temp
      8: BReturn 0
Hash source 101 merged into ts      0.058 rows(.L_PARTKEY) -> ()
time        17% fanout         1 input 3.51125e+07 rows
Hash source 82       0.057 rows(.L_SUPPKEY, .L_PARTKEY) -> (  <none> , .PS_SUPPLYCOST)
time       6.2% fanout         1 input 3.51125e+07 rows
 
Precode:
      0: temp := artm .PS_SUPPLYCOST * .L_QUANTITY
      4: temp := artm temp - temp
      8: BReturn 0
ORDERS unq         1 rows (.O_ORDERDATE)
 inlined  O_ORDERKEY = k_.L_ORDERKEY
time    0.0055% fanout         1 input 3.51125e+07 rows
Hash source 49 merged into ts          1 rows(k_.L_SUPPKEY) -> (.S_NATIONKEY)
time       3.5% fanout         1 input 3.51125e+07 rows
Hash source 35           1 rows(k_.S_NATIONKEY) -> (nation)
time       8.8% fanout         0 input 3.51125e+07 rows
 
Precode:
      0: o_year := Call year (.O_ORDERDATE)
      5: BReturn 0
Sort (nation, o_year) -> (temp)
 
}
time   4.7e-05% fanout       175 input         1 rows
group by read node  
(nation, o_year, sum_profit)
time   0.00028% fanout         0 input       175 rows
Sort (nation, o_year) -> (sum_profit)
 
}
time   2.2e-05% fanout       175 input         1 rows
Key from temp (nation, o_year, sum_profit)
 
time   1.6e-06% fanout         0 input       175 rows
Select (nation, o_year, sum_profit)
}


 6114 msec 1855% cpu, 3.62624e+07 rnd 6.44384e+08 seq   99.6068% same seg  0.357328% same pg 

6.1s is a good score for this query. When executing the same in 5 parallel invocations, the fastest ends in 13.7s and the slowest in 27.6s. For five concurrent executions, the peak transient memory utilization is 4.7 GB for the hash tables, which is very reasonable.

*           *           *           *           *

Let us next consider Q17.

SELECT
        SUM(l_extendedprice) / 7.0 AS avg_yearly
FROM
        lineitem,
        part
WHERE
        p_partkey = l_partkey
   AND  p_brand = 'Brand#23'
   AND  p_container = 'MED BOX'
   AND  l_quantity 
           < (
                   SELECT
                            2e-1 * AVG(l_quantity)
                   FROM
                            lineitem
                   WHERE
                            l_partkey = p_partkey
                )

Deceptively simple? This calculates the total value of small orders (below 1/5 of average quantity for the part) for all parts of a given brand with a specific container.

If there is an index on l_partkey, the plan is easy enough: Take the parts, look up the average quantity for each, then recheck lineitem and add up the small lineitems. This takes about 1s. But we do not want indices for this workload.

If we made a hash from l_partkey to l_quantity for all lineitems, we could run out of space, and this would take so long the race would be automatically lost on this point alone. The trick is to import the restriction on l_partkey into the hash build. This gives us a plan that does a scan of lineitem twice, doing a very selective hash join (few parts). There is a lookup for the average for each lineitem with the part. The average is calculated potentially several times.

The below plan is workable but better is possible: We notice that the very selective join need be done just once; it is cheaper to remember the result than to do it twice, and the result is not large. The other trick is that the correlated subquery can be rewritten as

SELECT
        ... 
  FROM
        lineitem, 
        part, 
        ( SELECT 
                                           l_partkey, 
                 0.2 * AVG (l_quantity) AS qty 
            FROM 
                 lineitem, 
                 part 
            ...
        ) f 
 WHERE
        l_partkey = f.l_partkey 
 ...

In this form, one can put the entire derived table f on the build side of a hash join. In this way, the average is never done more than once per part.

{ 
time   7.9e-06% fanout         1 input         1 rows
time    0.0031% fanout         1 input         1 rows
{ hash filler
time      0.27% fanout     20031 input         1 rows
PART     2e+04 rows(.P_PARTKEY)
 P_BRAND =  <c Brand#23> ,  P_CONTAINER =  <c MED BOX>
time   0.00047% fanout         0 input     20031 rows
Sort hf 34 (.P_PARTKEY)
}
time       0.1% fanout         1 input         1 rows
{ hash filler
Subquery 40 
{ 
time        46% fanout    600982 input         1 rows
LINEITEM     6e+08 rows(t4.L_PARTKEY, t4.L_QUANTITY)
 
hash partition+bloom by 38 (tmp)hash join merged always card     0.001 -> ()
time    0.0042% fanout         1 input    600982 rows
Hash source 34 merged into ts not partitionable     0.001 rows(t4.L_PARTKEY) -> ()
 
After code:
      0: t4.L_PARTKEY :=  := artm t4.L_PARTKEY
      4: t4.L_QUANTITY :=  := artm t4.L_QUANTITY
      8: BReturn 0
time     0.059% fanout         0 input    600982 rows
Sort hf 62 (t4.L_PARTKEY) -> (t4.L_QUANTITY)
 
}
}
time   6.8e-05% fanout         1 input         1 rows
{ fork
time        46% fanout    600982 input         1 rows
LINEITEM     6e+08 rows(.L_PARTKEY, .L_QUANTITY, .L_EXTENDEDPRICE)
 
hash partition+bloom by 38 (tmp)hash join merged always card   0.00052 -> ()
time   0.00021% fanout         1 input    600982 rows
Hash source 34 merged into ts    0.00052 rows(.L_PARTKEY) -> ()
 
Precode:
      0: .P_PARTKEY :=  := artm .L_PARTKEY
      4: BReturn 0
END Node
After test:
      0: { 
time     0.038% fanout         1 input    600982 rows
time      0.17% fanout         1 input    600982 rows
{ fork
time       6.8% fanout         0 input    600982 rows
Hash source 62  not partitionable      0.03 rows(k_.P_PARTKEY) -> (.L_QUANTITY)
 
After code:
      0:  sum sum.L_QUANTITYset no set_ctr
      5:  sum count 1 set no set_ctr
      10: BReturn 0
}
 
After code:
      0: temp := artm sum / count
      4: temp := artm  0.2  * temp
      8: aggregate :=  := artm temp
      12: BReturn 0
time     0.042% fanout         0 input    600982 rows
Subquery Select(aggregate)
}
 
      8: if (.L_QUANTITY < scalar) then 12 else 13 unkn 13
      12: BReturn 1
      13: BReturn 0
 
After code:
      0:  sum sum.L_EXTENDEDPRICE
      5: BReturn 0
}
 
After code:
      0: avg_yearly := artm sum /  7 
      4: BReturn 0
time   4.6e-06% fanout         0 input         1 rows
Select (avg_yearly)
}


 2695 msec 1996% cpu,         3 rnd 1.18242e+09 seq         0% same seg         0% same pg 

2.7s is tolerable, but if this drags down the overall score by too much, we know that a 2+x improvement is readily available. Playing the rest of the tricks would result in the hash plan almost catching up with the 1s execution time of the index-based plan.

*           *           *           *           *

Q20 is not very long-running, but it is maybe the hardest to optimize of the lot. But as usual, failure to recognize its most salient traps will automatically lose the race, so pay attention.

SELECT TOP 100
           s_name,
           s_address
     FROM
           supplier,
           nation
    WHERE
           s_suppkey IN 
             ( SELECT  
                       ps_suppkey
                 FROM  
                       partsupp
                WHERE  
                       ps_partkey IN 
                         ( SELECT  
                                   p_partkey
                             FROM  
                                   part
                            WHERE  
                                   p_name LIKE 'forest%'
                         )
                  AND  ps_availqty > 
                         ( SELECT  
                                   0.5 * SUM(l_quantity)
                             FROM  
                                   lineitem
                            WHERE  
                                   l_partkey = ps_partkey
                              AND  l_suppkey = ps_suppkey
                              AND  l_shipdate >= CAST ('1994-01-01' AS DATE)
                              AND  l_shipdate < DATEADD ('year', 1, CAST ('1994-01-01' AS DATE))
                         )
             )
      AND  s_nationkey = n_nationkey
      AND  n_name = 'CANADA'
 ORDER BY  s_name

This identifies suppliers that have parts in stock in excess of half a year's shipments of said part.

The use of IN to denote a join is the first catch. The second is joining to lineitem by hash without building an overly large hash table. We know that IN becomes EXISTS which in turn can become a join as follows:

SELECT 
        l_suppkey 
FROM
        lineitem 
WHERE
        l_partkey IN  
          ( SELECT  
                    p_partkey 
              FROM  
                    part 
             WHERE  
                    p_name LIKE 'forest%'
       )
;

-- is --

SELECT  
        l_suppkey 
  FROM  
        lineitem 
 WHERE  EXISTS  
          ( SELECT  
                    p_partkey 
              FROM  
                    part 
             WHERE  
                    p_partkey = l_partkey 
               AND  p_name LIKE 'forest%')
;

-- is --

SELECT  
        l_suppkey 
  FROM  
        lineitem, 
        ( SELECT  
                    DISTINCT p_partkey 
            FROM  
                    part 
           WHERE  
                    p_name LIKE 'forest%') f 
 WHERE  
        l_partkey = f.p_partkey
;

But since p_partkey is unique, the DISTINCT drops off and we have.

SELECT  
        l_suppkey 
  FROM  
        lineitem, 
        part 
 WHERE  
        p_name LIKE 'forest% 
   AND  l_partkey = f.p_partkey
;

You see, the innermost IN with the ps_partkey goes through all these changes, and just becomes a join. The outermost IN stays as a distinct derived table, since ps_suppkey is not unique, and the meaning of IN is not to return a given supplier more than once.

The derived table is flattened and the DISTINCT is done partitioned; hence the stage node in front of the distinct. A DISTINCT can be multithreaded, if each thread gets a specific subset of all the keys. The stage node is an exchange of tuples between several threads. Each thread then does a TOP k sort. The TOP k trick we saw in Q18 is used, but does not contribute much here.

{ 
time   8.2e-06% fanout         1 input         1 rows
time   0.00017% fanout         1 input         1 rows
{ hash filler
time   6.1e-05% fanout         1 input         1 rows
NATION         1 rows(.N_NATIONKEY)
 N_NAME =  <c CANADA>
time   1.2e-05% fanout         0 input         1 rows
Sort hf 34 (.N_NATIONKEY)
}
time     0.073% fanout         1 input         1 rows
{ hash filler
time       4.1% fanout    240672 input         1 rows
PART   2.4e+05 rows(t74.P_PARTKEY)
 P_NAME LIKE  <c forest%> LIKE  <c >
time     0.011% fanout         0 input    240672 rows
Sort hf 47 (t74.P_PARTKEY)
}
time      0.69% fanout         1 input         1 rows
{ hash filler
Subquery 56 
{ 
time        42% fanout 1.09657e+06 input         1 rows
LINEITEM   9.1e+07 rows(t76.L_PARTKEY, t76.L_SUPPKEY, t76.L_QUANTITY)
 L_SHIPDATE >= <c 1994-01-01> < <c 1995-01-01>
hash partition+bloom by 54 (tmp)hash join merged always card     0.012 -> ()
time     0.022% fanout         1 input 1.09657e+06 rows
Hash source 47 merged into ts not partitionable     0.012 rows(t76.L_PARTKEY) -> ()
 
After code:
      0: t76.L_PARTKEY :=  := artm t76.L_PARTKEY
      4: t76.L_SUPPKEY :=  := artm t76.L_SUPPKEY
      8: t76.L_QUANTITY :=  := artm t76.L_QUANTITY
      12: BReturn 0
time      0.22% fanout         0 input 1.09657e+06 rows
Sort hf 80 (t76.L_PARTKEY, t76.L_SUPPKEY) -> (t76.L_QUANTITY)
 
}
}
time   2.1e-05% fanout         1 input         1 rows
time   3.2e-05% fanout         1 input         1 rows
{ fork
time       5.3% fanout    240672 input         1 rows
PART   2.4e+05 rows(t6.P_PARTKEY)
 P_NAME LIKE  <c forest%> LIKE  <c >
time       1.9% fanout         4 input    240672 rows
PARTSUPP       1.2 rows(t4.PS_AVAILQTY, t4.PS_PARTKEY, t4.PS_SUPPKEY)
 inlined  PS_PARTKEY = t6.P_PARTKEY
time        16% fanout  0.680447 input    962688 rows
END Node
After test:
      0: { 
time      0.08% fanout         1 input    962688 rows
time       9.4% fanout         1 input    962688 rows
{ fork
time       3.6% fanout         0 input    962688 rows
Hash source 80       0.013 rows(k_t4.PS_PARTKEY, k_t4.PS_SUPPKEY) -> (t8.L_QUANTITY)
 
After code:
      0:  sum sumt8.L_QUANTITYset no set_ctr
      5: BReturn 0
}
 
After code:
      0: temp := artm  0.5  * sum
      4: aggregate :=  := artm temp
      8: BReturn 0
time      0.85% fanout         0 input    962688 rows
Subquery Select(aggregate)
}
 
      8: if (t4.PS_AVAILQTY > scalar) then 12 else 13 unkn 13
      12: BReturn 1
      13: BReturn 0
time         1% fanout         1 input    655058 rows
Stage 2
time     0.071% fanout         1 input    655058 rows
Distinct (q_t4.PS_SUPPKEY)
 
After code:
      0: PS_SUPPKEY :=  := artm t4.PS_SUPPKEY
      4: BReturn 0
time     0.016% fanout         1 input    655058 rows
Subquery Select(PS_SUPPKEY)
time       3.2% fanout 0.0112845 input    655058 rows
SUPPLIER unq     0.075 rows (.S_NAME, .S_NATIONKEY, .S_ADDRESS)
 inlined  S_SUPPKEY = PS_SUPPKEY
hash partition+bloom by 38 (tmp)hash join merged always card      0.04 -> ()
top k on S_NAME
time    0.0012% fanout         1 input      7392 rows
Hash source 34 merged into ts       0.04 rows(.S_NATIONKEY) -> ()
time     0.074% fanout         0 input      7392 rows
Sort (.S_NAME) -> (.S_ADDRESS)
 
}
time   0.00013% fanout       100 input         1 rows
top order by read (.S_NAME, .S_ADDRESS)
time     5e-06% fanout         0 input       100 rows
Select (.S_NAME, .S_ADDRESS)
}


 1777 msec 1355% cpu,    894483 rnd 6.39422e+08 seq   79.1214% same seg   19.3093% same pg 

1.8s is sufficient, and in the ballpark with VectorWise. Some further gain is possible, as the lineitem hash table can also be restricted by supplier; after all, only 1/25 of all suppliers are in the end considered. Further simplifications are possible. Another 20% of time could be saved. The tricks are however quite complex and specific, and there are easier gains to be had -- for example, in reusing intermediates in Q17 and Q15.

The next installment will discuss late projection and some miscellaneous tricks not mentioned so far. After this, we are ready to take an initial look at the performance of the system as a whole.

To be continued...

In Hoc Signo Vinces (TPC-H) Series