TPC-H is a hash join game. The rules do allow indices, but maintaining these takes time, and indices will quickly result in non-local access patterns. Indices also take space. Besides, somebody must know what indices to create, which is not obvious. Thus, it is best if a BI data warehouse works without.
Once you go to hash join, one side of the join will be materialized, which takes space, which ipso facto is bad. So, the predicate games are about moving conditions so that the hash table made for the hash join will be as small as possible. Only items that may in fact be retrieved should be put in the hash table. If you know that the query deals with shipments of green parts
, putting lineitems
of parts
that are not green in a hash table makes no sense since only green ones are being looked for.
So, let's consider Q9. The query is:
SELECT nation,
o_year,
SUM(amount) AS sum_profit
FROM ( SELECT
n_name AS nation,
EXTRACT ( YEAR FROM o_orderdate ) AS o_year,
l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity AS amount
FROM
part,
supplier,
lineitem,
partsupp,
orders,
nation
WHERE s_suppkey = l_suppkey
AND ps_suppkey = l_suppkey
AND ps_partkey = l_partkey
AND p_partkey = l_partkey
AND o_orderkey = l_orderkey
AND s_nationkey = n_nationkey
AND p_name like '%green%'
) AS profit
GROUP BY nation,
o_year
ORDER BY nation,
o_year DESC
;
The intent is to calculate profit from the sale of a type of part
, broken down by year
and supplier nation
. All orders
, lineitems
, partsupps
, and suppliers
involving the parts
of interest are visited. This is one of the longest running of the queries. The query is restricted by part
only, and the condition selects 1/17 of all parts
.
The execution plan is below. First the plan builds hash tables of all nations
and suppliers
. We expect to do frequent lookups, thus making a hash is faster than using the index. Partsupp
is the 3rd largest table in the database. This has a primary key of ps_partkey, ps_suppkey
, referenced by the compound foreign key l_partkey, l_suppkey
in lineitem
. This could be accessed by index, but we expect to hit each partsupp
row multiple times, hence hash is better. We further note that only partsupp
rows where the part
satisfies the condition will contribute to the result. Thus we import the join with part
into the hash build. The ps_partkey
is not directly joined to p_partkey
, but rather the system must understand that this follows from l_partkey = ps_partkey
and l_partkey = p_partkey
. In this way, the hash table is 1/17th of the size it would otherwise be, which is a crucial gain.
Looking further into the plan, we note a scan of lineitem
followed by a hash join with part
. Restricting the build of the partsupp
hash would have the same effect, hence part
is here used twice while it occurs only once in the query. This is deliberate, since the selective hash join with part
restricts lineitem
faster than the more complex hash join with a 2 part key (l_partkey, l_suppkey)
. Both joins perform the identical restriction, but doing the part
first is faster since this becomes a single-key, invisible hash join, merged into the lineitem
scan, done before even accessing the l_suppkey
and other columns.
{
time 3.9e-06% fanout 1 input 1 rows
time 4.7e-05% fanout 1 input 1 rows
{ hash filler
time 3.6e-05% fanout 25 input 1 rows
NATION 25 rows(.N_NATIONKEY, nation)
time 8.8e-06% fanout 0 input 25 rows
Sort hf 35 (.N_NATIONKEY) -> (nation)
}
time 0.16% fanout 1 input 1 rows
{ hash filler
time 0.011% fanout 1e+06 input 1 rows
SUPPLIER 1e+06 rows(.S_SUPPKEY, .S_NATIONKEY)
time 0.03% fanout 0 input 1e+06 rows
Sort hf 49 (.S_SUPPKEY) -> (.S_NATIONKEY)
}
time 0.57% fanout 1 input 1 rows
{ hash filler
Subquery 58
{
time 1.6% fanout 1.17076e+06 input 1 rows
PART 1.2e+06 rows(t1.P_PARTKEY)
P_NAME LIKE <c %green%> LIKE <c >
time 1.1% fanout 4 input 1.17076e+06 rows
PARTSUPP 3.9 rows(t4.PS_SUPPKEY, t4.PS_PARTKEY, t4.PS_SUPPLYCOST)
inlined PS_PARTKEY = t1.P_PARTKEY
After code:
0: t4.PS_SUPPKEY := := artm t4.PS_SUPPKEY
4: t4.PS_PARTKEY := := artm t4.PS_PARTKEY
8: t1.P_PARTKEY := := artm t1.P_PARTKEY
12: t4.PS_SUPPLYCOST := := artm t4.PS_SUPPLYCOST
16: BReturn 0
time 0.33% fanout 0 input 4.68305e+06 rows
Sort hf 82 (t4.PS_SUPPKEY, t4.PS_PARTKEY) -> (t1.P_PARTKEY, t4.PS_SUPPLYCOST)
}
}
time 0.18% fanout 1 input 1 rows
{ hash filler
time 1.6% fanout 1.17076e+06 input 1 rows
PART 1.2e+06 rows(.P_PARTKEY)
P_NAME LIKE <c %green%> LIKE <c >
time 0.017% fanout 0 input 1.17076e+06 rows
Sort hf 101 (.P_PARTKEY)
}
time 5.1e-06% fanout 1 input 1 rows
{ fork
time 4.1e-06% fanout 1 input 1 rows
{ fork
time 59% fanout 3.51125e+07 input 1 rows
LINEITEM 6e+08 rows(.L_PARTKEY, .L_ORDERKEY, .L_SUPPKEY, .L_EXTENDEDPRICE, .L_DISCOUNT, .L_QUANTITY)
hash partition+bloom by 108 (tmp)hash join merged always card 0.058 -> ()
hash partition+bloom by 56 (tmp)hash join merged always card 1 -> (.S_NATIONKEY)
time 0.18% fanout 1 input 3.51125e+07 rows
Precode:
0: temp := artm 1 - .L_DISCOUNT
4: temp := artm .L_EXTENDEDPRICE * temp
8: BReturn 0
Hash source 101 merged into ts 0.058 rows(.L_PARTKEY) -> ()
time 17% fanout 1 input 3.51125e+07 rows
Hash source 82 0.057 rows(.L_SUPPKEY, .L_PARTKEY) -> ( <none> , .PS_SUPPLYCOST)
time 6.2% fanout 1 input 3.51125e+07 rows
Precode:
0: temp := artm .PS_SUPPLYCOST * .L_QUANTITY
4: temp := artm temp - temp
8: BReturn 0
ORDERS unq 1 rows (.O_ORDERDATE)
inlined O_ORDERKEY = k_.L_ORDERKEY
time 0.0055% fanout 1 input 3.51125e+07 rows
Hash source 49 merged into ts 1 rows(k_.L_SUPPKEY) -> (.S_NATIONKEY)
time 3.5% fanout 1 input 3.51125e+07 rows
Hash source 35 1 rows(k_.S_NATIONKEY) -> (nation)
time 8.8% fanout 0 input 3.51125e+07 rows
Precode:
0: o_year := Call year (.O_ORDERDATE)
5: BReturn 0
Sort (nation, o_year) -> (temp)
}
time 4.7e-05% fanout 175 input 1 rows
group by read node
(nation, o_year, sum_profit)
time 0.00028% fanout 0 input 175 rows
Sort (nation, o_year) -> (sum_profit)
}
time 2.2e-05% fanout 175 input 1 rows
Key from temp (nation, o_year, sum_profit)
time 1.6e-06% fanout 0 input 175 rows
Select (nation, o_year, sum_profit)
}
6114 msec 1855% cpu, 3.62624e+07 rnd 6.44384e+08 seq 99.6068% same seg 0.357328% same pg
6.1s is a good score for this query. When executing the same in 5 parallel invocations, the fastest ends in 13.7s and the slowest in 27.6s. For five concurrent executions, the peak transient memory utilization is 4.7 GB for the hash tables, which is very reasonable.
* * * * *
Let us next consider Q17.
SELECT
SUM(l_extendedprice) / 7.0 AS avg_yearly
FROM
lineitem,
part
WHERE
p_partkey = l_partkey
AND p_brand = 'Brand#23'
AND p_container = 'MED BOX'
AND l_quantity
< (
SELECT
2e-1 * AVG(l_quantity)
FROM
lineitem
WHERE
l_partkey = p_partkey
)
Deceptively simple? This calculates the total value of small orders
(below 1/5 of average quantity for the part
) for all parts
of a given brand
with a specific container
.
If there is an index on l_partkey
, the plan is easy enough: Take the parts
, look up the average quantity
for each, then recheck lineitem
and add up the small lineitems
. This takes about 1s. But we do not want indices for this workload.
If we made a hash from l_partkey
to l_quantity
for all lineitems
, we could run out of space, and this would take so long the race would be automatically lost on this point alone. The trick is to import the restriction on l_partkey
into the hash build. This gives us a plan that does a scan of lineitem
twice, doing a very selective hash join (few parts
). There is a lookup for the average
for each lineitem
with the part
. The average
is calculated potentially several times.
The below plan is workable but better is possible: We notice that the very selective join need be done just once; it is cheaper to remember the result than to do it twice, and the result is not large. The other trick is that the correlated subquery can be rewritten as
SELECT
...
FROM
lineitem,
part,
( SELECT
l_partkey,
0.2 * AVG (l_quantity) AS qty
FROM
lineitem,
part
...
) f
WHERE
l_partkey = f.l_partkey
...
In this form, one can put the entire derived table f
on the build side of a hash join. In this way, the average
is never done more than once per part
.
{
time 7.9e-06% fanout 1 input 1 rows
time 0.0031% fanout 1 input 1 rows
{ hash filler
time 0.27% fanout 20031 input 1 rows
PART 2e+04 rows(.P_PARTKEY)
P_BRAND = <c Brand#23> , P_CONTAINER = <c MED BOX>
time 0.00047% fanout 0 input 20031 rows
Sort hf 34 (.P_PARTKEY)
}
time 0.1% fanout 1 input 1 rows
{ hash filler
Subquery 40
{
time 46% fanout 600982 input 1 rows
LINEITEM 6e+08 rows(t4.L_PARTKEY, t4.L_QUANTITY)
hash partition+bloom by 38 (tmp)hash join merged always card 0.001 -> ()
time 0.0042% fanout 1 input 600982 rows
Hash source 34 merged into ts not partitionable 0.001 rows(t4.L_PARTKEY) -> ()
After code:
0: t4.L_PARTKEY := := artm t4.L_PARTKEY
4: t4.L_QUANTITY := := artm t4.L_QUANTITY
8: BReturn 0
time 0.059% fanout 0 input 600982 rows
Sort hf 62 (t4.L_PARTKEY) -> (t4.L_QUANTITY)
}
}
time 6.8e-05% fanout 1 input 1 rows
{ fork
time 46% fanout 600982 input 1 rows
LINEITEM 6e+08 rows(.L_PARTKEY, .L_QUANTITY, .L_EXTENDEDPRICE)
hash partition+bloom by 38 (tmp)hash join merged always card 0.00052 -> ()
time 0.00021% fanout 1 input 600982 rows
Hash source 34 merged into ts 0.00052 rows(.L_PARTKEY) -> ()
Precode:
0: .P_PARTKEY := := artm .L_PARTKEY
4: BReturn 0
END Node
After test:
0: {
time 0.038% fanout 1 input 600982 rows
time 0.17% fanout 1 input 600982 rows
{ fork
time 6.8% fanout 0 input 600982 rows
Hash source 62 not partitionable 0.03 rows(k_.P_PARTKEY) -> (.L_QUANTITY)
After code:
0: sum sum.L_QUANTITYset no set_ctr
5: sum count 1 set no set_ctr
10: BReturn 0
}
After code:
0: temp := artm sum / count
4: temp := artm 0.2 * temp
8: aggregate := := artm temp
12: BReturn 0
time 0.042% fanout 0 input 600982 rows
Subquery Select(aggregate)
}
8: if (.L_QUANTITY < scalar) then 12 else 13 unkn 13
12: BReturn 1
13: BReturn 0
After code:
0: sum sum.L_EXTENDEDPRICE
5: BReturn 0
}
After code:
0: avg_yearly := artm sum / 7
4: BReturn 0
time 4.6e-06% fanout 0 input 1 rows
Select (avg_yearly)
}
2695 msec 1996% cpu, 3 rnd 1.18242e+09 seq 0% same seg 0% same pg
2.7s is tolerable, but if this drags down the overall score by too much, we know that a 2+x improvement is readily available. Playing the rest of the tricks would result in the hash plan almost catching up with the 1s execution time of the index-based plan.
* * * * *
Q20 is not very long-running, but it is maybe the hardest to optimize of the lot. But as usual, failure to recognize its most salient traps will automatically lose the race, so pay attention.
SELECT TOP 100
s_name,
s_address
FROM
supplier,
nation
WHERE
s_suppkey IN
( SELECT
ps_suppkey
FROM
partsupp
WHERE
ps_partkey IN
( SELECT
p_partkey
FROM
part
WHERE
p_name LIKE 'forest%'
)
AND ps_availqty >
( SELECT
0.5 * SUM(l_quantity)
FROM
lineitem
WHERE
l_partkey = ps_partkey
AND l_suppkey = ps_suppkey
AND l_shipdate >= CAST ('1994-01-01' AS DATE)
AND l_shipdate < DATEADD ('year', 1, CAST ('1994-01-01' AS DATE))
)
)
AND s_nationkey = n_nationkey
AND n_name = 'CANADA'
ORDER BY s_name
This identifies suppliers
that have parts
in stock in excess of half a year's shipments of said part
.
The use of IN
to denote a join
is the first catch. The second is joining to lineitem
by hash without building an overly large hash table. We know that IN
becomes EXISTS
which in turn can become a join
as follows:
SELECT
l_suppkey
FROM
lineitem
WHERE
l_partkey IN
( SELECT
p_partkey
FROM
part
WHERE
p_name LIKE 'forest%'
)
;
-- is --
SELECT
l_suppkey
FROM
lineitem
WHERE EXISTS
( SELECT
p_partkey
FROM
part
WHERE
p_partkey = l_partkey
AND p_name LIKE 'forest%')
;
-- is --
SELECT
l_suppkey
FROM
lineitem,
( SELECT
DISTINCT p_partkey
FROM
part
WHERE
p_name LIKE 'forest%') f
WHERE
l_partkey = f.p_partkey
;
But since p_partkey
is unique, the DISTINCT
drops off and we have.
SELECT
l_suppkey
FROM
lineitem,
part
WHERE
p_name LIKE 'forest%
AND l_partkey = f.p_partkey
;
You see, the innermost IN
with the ps_partkey
goes through all these changes, and just becomes a join
. The outermost IN
stays as a distinct derived table, since ps_suppkey
is not unique, and the meaning of IN
is not to return a given supplier
more than once.
The derived table is flattened and the DISTINCT
is done partitioned; hence the stage node in front of the distinct. A DISTINCT
can be multithreaded, if each thread gets a specific subset of all the keys. The stage node is an exchange of tuples between several threads. Each thread then does a TOP k
sort. The TOP k
trick we saw in Q18 is used, but does not contribute much here.
{
time 8.2e-06% fanout 1 input 1 rows
time 0.00017% fanout 1 input 1 rows
{ hash filler
time 6.1e-05% fanout 1 input 1 rows
NATION 1 rows(.N_NATIONKEY)
N_NAME = <c CANADA>
time 1.2e-05% fanout 0 input 1 rows
Sort hf 34 (.N_NATIONKEY)
}
time 0.073% fanout 1 input 1 rows
{ hash filler
time 4.1% fanout 240672 input 1 rows
PART 2.4e+05 rows(t74.P_PARTKEY)
P_NAME LIKE <c forest%> LIKE <c >
time 0.011% fanout 0 input 240672 rows
Sort hf 47 (t74.P_PARTKEY)
}
time 0.69% fanout 1 input 1 rows
{ hash filler
Subquery 56
{
time 42% fanout 1.09657e+06 input 1 rows
LINEITEM 9.1e+07 rows(t76.L_PARTKEY, t76.L_SUPPKEY, t76.L_QUANTITY)
L_SHIPDATE >= <c 1994-01-01> < <c 1995-01-01>
hash partition+bloom by 54 (tmp)hash join merged always card 0.012 -> ()
time 0.022% fanout 1 input 1.09657e+06 rows
Hash source 47 merged into ts not partitionable 0.012 rows(t76.L_PARTKEY) -> ()
After code:
0: t76.L_PARTKEY := := artm t76.L_PARTKEY
4: t76.L_SUPPKEY := := artm t76.L_SUPPKEY
8: t76.L_QUANTITY := := artm t76.L_QUANTITY
12: BReturn 0
time 0.22% fanout 0 input 1.09657e+06 rows
Sort hf 80 (t76.L_PARTKEY, t76.L_SUPPKEY) -> (t76.L_QUANTITY)
}
}
time 2.1e-05% fanout 1 input 1 rows
time 3.2e-05% fanout 1 input 1 rows
{ fork
time 5.3% fanout 240672 input 1 rows
PART 2.4e+05 rows(t6.P_PARTKEY)
P_NAME LIKE <c forest%> LIKE <c >
time 1.9% fanout 4 input 240672 rows
PARTSUPP 1.2 rows(t4.PS_AVAILQTY, t4.PS_PARTKEY, t4.PS_SUPPKEY)
inlined PS_PARTKEY = t6.P_PARTKEY
time 16% fanout 0.680447 input 962688 rows
END Node
After test:
0: {
time 0.08% fanout 1 input 962688 rows
time 9.4% fanout 1 input 962688 rows
{ fork
time 3.6% fanout 0 input 962688 rows
Hash source 80 0.013 rows(k_t4.PS_PARTKEY, k_t4.PS_SUPPKEY) -> (t8.L_QUANTITY)
After code:
0: sum sumt8.L_QUANTITYset no set_ctr
5: BReturn 0
}
After code:
0: temp := artm 0.5 * sum
4: aggregate := := artm temp
8: BReturn 0
time 0.85% fanout 0 input 962688 rows
Subquery Select(aggregate)
}
8: if (t4.PS_AVAILQTY > scalar) then 12 else 13 unkn 13
12: BReturn 1
13: BReturn 0
time 1% fanout 1 input 655058 rows
Stage 2
time 0.071% fanout 1 input 655058 rows
Distinct (q_t4.PS_SUPPKEY)
After code:
0: PS_SUPPKEY := := artm t4.PS_SUPPKEY
4: BReturn 0
time 0.016% fanout 1 input 655058 rows
Subquery Select(PS_SUPPKEY)
time 3.2% fanout 0.0112845 input 655058 rows
SUPPLIER unq 0.075 rows (.S_NAME, .S_NATIONKEY, .S_ADDRESS)
inlined S_SUPPKEY = PS_SUPPKEY
hash partition+bloom by 38 (tmp)hash join merged always card 0.04 -> ()
top k on S_NAME
time 0.0012% fanout 1 input 7392 rows
Hash source 34 merged into ts 0.04 rows(.S_NATIONKEY) -> ()
time 0.074% fanout 0 input 7392 rows
Sort (.S_NAME) -> (.S_ADDRESS)
}
time 0.00013% fanout 100 input 1 rows
top order by read (.S_NAME, .S_ADDRESS)
time 5e-06% fanout 0 input 100 rows
Select (.S_NAME, .S_ADDRESS)
}
1777 msec 1355% cpu, 894483 rnd 6.39422e+08 seq 79.1214% same seg 19.3093% same pg
1.8s is sufficient, and in the ballpark with VectorWise. Some further gain is possible, as the lineitem
hash table can also be restricted by supplier
; after all, only 1/25 of all suppliers
are in the end considered. Further simplifications are possible. Another 20% of time could be saved. The tricks are however quite complex and specific, and there are easier gains to be had -- for example, in reusing intermediates in Q17 and Q15.
The next installment will discuss late projection and some miscellaneous tricks not mentioned so far. After this, we are ready to take an initial look at the performance of the system as a whole.
To be continued...
In Hoc Signo Vinces (TPC-H) Series