Details

OpenLink Software
Burlington, United States

Subscribe

Post Categories

Recent Articles

Community Member Blogs

Display Settings

articles per page.
order.

Translate

In Hoc Signo Vinces (part 11 of n): TPC-H Q2, Q10 - Late Projection [ Virtuso Data Space Bot ]

Analytics is generally about making something small out of something large. This reduction is obtained by a TOP k operator (i.e., show only the 10 best by some metric) and/or by grouping and aggregation (i.e., for a set of items, show some attributes of these items and a sum, count, or other aggregate of dependent items for each).

In this installment we will look at late projection, also sometimes known as late materialization. If many attributes are returned and there is a cutoff of some sort, then the query does not need to be concerned about attributes on which there are no conditions, except for fetching them at the last moment, only for the entities which in fact will be returned to the user.

We look at TPC-H Q2 and Q10.

Q2:

  SELECT  TOP 100
                   s_acctbal,
                   s_name,
                   n_name,
                   p_partkey,
                   p_mfgr,
                   s_address,
                   s_phone,
                   s_comment
    FROM  part,
          supplier,
          partsupp,
          nation,
          region
   WHERE  p_partkey = ps_partkey
     AND  s_suppkey = ps_suppkey
     AND  p_size = 15
     AND  p_type LIKE '%BRASS'
     AND  s_nationkey = n_nationkey
     AND  n_regionkey = r_regionkey
     AND  r_name = 'EUROPE'
     AND  ps_supplycost = 
            ( SELECT  MIN(ps_supplycost)
                FROM  partsupp,
                      supplier,
                      nation,
                      region
               WHERE  p_partkey = ps_partkey
                 AND  s_suppkey = ps_suppkey
                 AND  s_nationkey = n_nationkey
                 AND  n_regionkey = r_regionkey
                 AND  r_name = 'EUROPE'
            )
ORDER BY  s_acctbal DESC,
          n_name,
          s_name,
          p_partkey

The intent is to return information about parts and suppliers, such that the part is available from a supplier in Europe, and the supplier has the lowest price for the part among all European suppliers.

Q10:

  SELECT  TOP 20
                                                     c_custkey,
                                                     c_name,
          SUM(l_extendedprice * (1 - l_discount)) AS revenue,
                                                     c_acctbal, 
                                                     n_name,
                                                     c_address,
                                                     c_phone,
                                                     c_comment
    FROM  customer,
          orders,
          lineitem,
          nation
   WHERE  c_custkey = o_custkey
     AND  l_orderkey = o_orderkey
     AND  o_orderdate >= CAST ('1993-10-01' AS DATE)
     AND  o_orderdate < DATEADD ('month', 3, CAST ('1993-10-01' AS DATE))
     AND  l_returnflag = 'R'
     AND  c_nationkey = n_nationkey
GROUP BY  c_custkey,
          c_name,
          c_acctbal,
          c_phone,
          n_name,
          c_address,
          c_comment
ORDER BY  revenue DESC

The intent is to list the customers who cause the greatest loss of revenue in a given quarter by returning items ordered in said quarter.

We notice that both queries return many columns on which there are no conditions, and that both have a cap on returned rows. The difference is that in Q2 the major ORDER BY is on a grouping column, and in Q10 it is on the aggregate of the GROUP BY. Thus the TOP k trick discussed in the previous article does apply to Q2 but not to Q10.

The profile for Q2 follows:

{ 
time   6.1e-05% fanout         1 input         1 rows
time       1.1% fanout         1 input         1 rows
{ hash filler
Subquery 27 
{ 
time    0.0012% fanout         1 input         1 rows
REGION         1 rows(t10.R_REGIONKEY)
 R_NAME = <c EUROPE>
time   0.00045% fanout         5 input         1 rows
NATION         5 rows(t9.N_NATIONKEY)
 N_REGIONKEY = t10.R_REGIONKEY
time       1.6% fanout     40107 input         5 rows
SUPPLIER   4.2e+04 rows(t8.S_SUPPKEY)
 S_NATIONKEY = t9.N_NATIONKEY
 
After code:
      0: t8.S_SUPPKEY :=  := artm t8.S_SUPPKEY
      4: BReturn 0
time       0.1% fanout         0 input    200535 rows
Sort hf 49 (t8.S_SUPPKEY)
}
}
time    0.0004% fanout         1 input         1 rows
{ fork
time        21% fanout     79591 input         1 rows
PART     8e+04 rows(.P_PARTKEY)
 P_TYPE LIKE <c %BRASS> LIKE <c > ,  P_SIZE =  15 
time        44% fanout  0.591889 input     79591 rows
 
Precode:
      0: { 
time     0.083% fanout         1 input     79591 rows
time      0.13% fanout         1 input     79591 rows
{ fork
time        24% fanout  0.801912 input     79591 rows
PARTSUPP       3.5 rows(.PS_SUPPKEY, .PS_SUPPLYCOST)
 inlined  PS_PARTKEY = k_.P_PARTKEY
hash partition+bloom by 62 (tmp)hash join merged always card       0.2 -> ()
time       1.3% fanout         0 input     63825 rows
Hash source 49 merged into ts not partitionable       0.2 rows(.PS_SUPPKEY) -> ()
 
After code:
      0:  min min.PS_SUPPLYCOSTset no set_ctr
      5: BReturn 0
}
 
After code:
      0: aggregate :=  := artm min
      4: BReturn 0
time      0.19% fanout         0 input     79591 rows
Subquery Select(aggregate)
}
 
      8: BReturn 0
PARTSUPP     5e-08 rows(.PS_SUPPKEY)
 inlined  PS_PARTKEY = k_.P_PARTKEY PS_SUPPLYCOST = k_scalar
time       5.9% fanout  0.247023 input     47109 rows
SUPPLIER unq       0.9 rows (.S_ACCTBAL, .S_NATIONKEY, .S_NAME, .S_SUPPKEY)
 inlined  S_SUPPKEY = .PS_SUPPKEY
top k on S_ACCTBAL
time     0.077% fanout         1 input     11637 rows
NATION unq         1 rows (.N_REGIONKEY, .N_NAME)
 inlined  N_NATIONKEY = .S_NATIONKEY
time     0.051% fanout         1 input     11637 rows
REGION unq       0.2 rows ()
 inlined  R_REGIONKEY = .N_REGIONKEY R_NAME = <c EUROPE>
time      0.42% fanout         0 input     11637 rows
Sort (.S_ACCTBAL, .N_NAME, .S_NAME, .P_PARTKEY) -> (.S_SUPPKEY)
 
}
time    0.0016% fanout       100 input         1 rows
top order by read (.S_SUPPKEY, .P_PARTKEY, .N_NAME, .S_NAME, .S_ACCTBAL)
time      0.02% fanout         1 input       100 rows
PART unq      0.95 rows (.P_MFGR)
 inlined  P_PARTKEY = .P_PARTKEY
time     0.054% fanout         1 input       100 rows
SUPPLIER unq         1 rows (.S_PHONE, .S_ADDRESS, .S_COMMENT)
 inlined  S_SUPPKEY = k_.S_SUPPKEY
time   6.7e-05% fanout         0 input       100 rows
Select (.S_ACCTBAL, .S_NAME, .N_NAME, .P_PARTKEY, .P_MFGR, .S_ADDRESS, .S_PHONE, .S_COMMENT)
}


 128 msec 1007% cpu,    196992 rnd 2.53367e+07 seq   50.4135% same seg   45.3574% same pg 

The query starts with a scan looking for the qualifying parts. It then looks for the best price for each part from a European supplier. All the European suppliers have been previously put in a hash table by the hash filler subquery at the start of the plan. Thus, to find the minimum price, the query takes the partsupp for the part by index, and then eliminates all non-European suppliers by a selective hash join. After this, there is a second index lookup on partsupp where we look for the part and the price equal to the minimum price found earlier. These operations could in principle be merged, as the minimum price partsupp has already been seen. The gain would not be very large, though.

Here we note that the cost model guesses that very few rows will survive the check of ps_supplycost = minimum cost. It does not know that the minimum is not just any value, but one of the values that do occur in the ps_supplycost column for the part. Because of this, the remainder of the plan is carried out by index, which is just as well. The point is that if very few rows of input are expected, it is not worthwhile to make a hash table for a hash join. The hash table made for the European suppliers could be reused here, maybe with some small gain. It would however need more columns, which might make it not worthwhile. We note that the major order with the TOP k is on the supplier s_acctbal, hence as soon as there are 100 suppliers found, one can add a restriction on the s_acctbal for subsequent ones.

At the end of the plan, after the TOP k ORDER BY and the reading of the results, we have a separate index-based lookup for getting only the columns that are returned. We note that this is done on 100 rows whereas the previous operations are done on tens-of-thousands of rows. The TOP k restriction produces some benefit, but it is relatively late in the plan, and not many operations follow it.

The plan is easily good enough, with only small space for improvement. Q2 is one of the fastest queries of the set.

Let us now consider the execution of Q10:

{ 
time   1.1e-06% fanout         1 input         1 rows
time   4.4e-05% fanout         1 input         1 rows
{ hash filler
time   1.6e-05% fanout        25 input         1 rows
NATION        25 rows(.N_NATIONKEY, .N_NAME)
 
time   6.7e-06% fanout         0 input        25 rows
Sort hf 35 (.N_NATIONKEY) -> (.N_NAME)
 
}
time   1.5e-06% fanout         1 input         1 rows
{ fork
time   2.4e-06% fanout         1 input         1 rows
{ fork
time        13% fanout 5.73038e+06 input         1 rows
ORDERS   5.1e+06 rows(.O_ORDERKEY, .O_CUSTKEY)
 O_ORDERDATE >= <c 1993-10-01> < <c 1994-01-01>
time       4.8% fanout   2.00042 input 5.73038e+06 rows
LINEITEM       1.1 rows(.L_EXTENDEDPRICE, .L_DISCOUNT)
 inlined  L_ORDERKEY = .O_ORDERKEY L_RETURNFLAG = <c R>
time        25% fanout         1 input 1.14632e+07 rows
 
Precode:
      0: temp := artm  1  - .L_DISCOUNT
      4: temp := artm .L_EXTENDEDPRICE * temp
      8: BReturn 0
CUSTOMER unq         1 rows (.C_NATIONKEY, .C_CUSTKEY)
 inlined  C_CUSTKEY = k_.O_CUSTKEY
hash partition+bloom by 39 (tmp)hash join merged always card         1 -> (.N_NAME)
time    0.0023% fanout         1 input 1.14632e+07 rows
Hash source 35 merged into ts          1 rows(.C_NATIONKEY) -> (.N_NAME)
time       2.3% fanout         1 input 1.14632e+07 rows
Stage 2
time       3.6% fanout         0 input 1.14632e+07 rows
Sort (q_.C_CUSTKEY, .N_NAME) -> (temp)
 
}
time       0.6% fanout 3.88422e+06 input         1 rows
group by read node  
(.C_CUSTKEY, .N_NAME, revenue)in each partition slice
time      0.57% fanout         0 input 3.88422e+06 rows
Sort (revenue) -> (.N_NAME, .C_CUSTKEY)
 
}
time   6.9e-06% fanout        20 input         1 rows
top order by read (.N_NAME, revenue, .C_CUSTKEY)
time   0.00036% fanout         1 input        20 rows
CUSTOMER unq         1 rows (.C_PHONE, .C_NAME, .C_ACCTBAL, .C_ADDRESS, .C_COMMENT)
 inlined  C_CUSTKEY = .C_CUSTKEY
time   1.1e-06% fanout         0 input        20 rows
Select (.C_CUSTKEY, .C_NAME, revenue, .C_ACCTBAL, .N_NAME, .C_ADDRESS, .C_PHONE, .C_COMMENT)
}


 2153 msec 2457% cpu, 1.71845e+07 rnd 1.67177e+08 seq   76.3221% same seg   21.1204% same pg 

The plan is by index, except for the lookup of nation name for the customer. The most selective condition is on order date, followed by the returnflag on lineitem. Getting the customer by index turns out to be better than by hash, even though almost all customers are hit. See the input cardinality above the first customer entry in the plan -- over 10M. The key point here is that only the c_custkey and c_nationkey get fetched, which saves a lot of time. In fact the c_custkey is needless since this is anyway equal to the o_custkey, but this makes little difference.

One could argue that customer should be between lineitem and orders in join order. Doing this would lose the ORDER BY on orders and lineitem, but would prevent some customer rows from being hit twice for a single order. The difference would not be large, though. For a scale-out setting, one definitely wants to have orders and lineitem without customer in between if the former are partitioned on the same key.

The c_nationkey is next translated into a n_name by hash, and there is a partitioned GROUP BY on c_custkey. The GROUP BY is partitioned because there are many different c_custkey values (155M for 100G scale).

The most important trick is fetching all the many dependent columns of c_custkey only after the TOP k ORDER BY. The last access to customer in the plan does this and is only executed on 20 rows.

Without the TOP k trick, the plan is identical, except that the dependent columns are fetched for nearly all customers. If this is done, the run time is 16s, which is bad enough to sink the whole score.

There is another approach to the challenge of this query: If foreign keys are declared and enforced, the system will know that every order has an actually existing customer and that every customer has a country. If so, the whole GROUP BY and TOP k can be done without any reference to customer, which is a notch better still, at least for this query. In this implementation, we do not declare foreign keys, thus the database must check that the customer and its country in fact exist before doing the GROUP BY. This makes the late projection trick mandatory, but does save the expense of checking foreign keys on updates. In both cases, the optimizer must recognize that the columns to be fetched at the end (late projected) are functionally dependent on a grouping key (c_custkey).

The late projection trick is generally useful, since almost all applications aside from bulk data export have some sort of limit on result set size. A column store especially benefits from this, since some columns of a row can be read without even coming near to other ones. A row store can also benefit from this in the form of decreased intermediate result size. This is especially good when returning long columns, such as text fields or blobs, on which there are most often no search conditions. If there are conditions of such, then these will most often be implemented via a special text index and not a scan.

*           *           *           *           *

In the next installment we will have a look at the overall 100G single server performance. After this we will recap the tricks so far. Then it will be time to look at implications of scale out for performance and run at larger scales. After the relational ground has been covered, we can look at implications of schema-lastness, i.e., triples for this type of workload.

So, while the most salient tricks have been at least briefly mentioned, we are far from having exhausted this most foundational of database topics.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
04/07/2014 12:30 GMT
In Hoc Signo Vinces (part 11 of n): TPC-H Q2, Q10 - Late Projection [ Orri Erling ]

Analytics is generally about making something small out of something large. This reduction is obtained by a TOP k operator (i.e., show only the 10 best by some metric) and/or by grouping and aggregation (i.e., for a set of items, show some attributes of these items and a sum, count, or other aggregate of dependent items for each).

In this installment we will look at late projection, also sometimes known as late materialization. If many attributes are returned and there is a cutoff of some sort, then the query does not need to be concerned about attributes on which there are no conditions, except for fetching them at the last moment, only for the entities which in fact will be returned to the user.

We look at TPC-H Q2 and Q10.

Q2:

  SELECT  TOP 100
                   s_acctbal,
                   s_name,
                   n_name,
                   p_partkey,
                   p_mfgr,
                   s_address,
                   s_phone,
                   s_comment
    FROM  part,
          supplier,
          partsupp,
          nation,
          region
   WHERE  p_partkey = ps_partkey
     AND  s_suppkey = ps_suppkey
     AND  p_size = 15
     AND  p_type LIKE '%BRASS'
     AND  s_nationkey = n_nationkey
     AND  n_regionkey = r_regionkey
     AND  r_name = 'EUROPE'
     AND  ps_supplycost = 
            ( SELECT  MIN(ps_supplycost)
                FROM  partsupp,
                      supplier,
                      nation,
                      region
               WHERE  p_partkey = ps_partkey
                 AND  s_suppkey = ps_suppkey
                 AND  s_nationkey = n_nationkey
                 AND  n_regionkey = r_regionkey
                 AND  r_name = 'EUROPE'
            )
ORDER BY  s_acctbal DESC,
          n_name,
          s_name,
          p_partkey

The intent is to return information about parts and suppliers, such that the part is available from a supplier in Europe, and the supplier has the lowest price for the part among all European suppliers.

Q10:

  SELECT  TOP 20
                                                     c_custkey,
                                                     c_name,
          SUM(l_extendedprice * (1 - l_discount)) AS revenue,
                                                     c_acctbal, 
                                                     n_name,
                                                     c_address,
                                                     c_phone,
                                                     c_comment
    FROM  customer,
          orders,
          lineitem,
          nation
   WHERE  c_custkey = o_custkey
     AND  l_orderkey = o_orderkey
     AND  o_orderdate >= CAST ('1993-10-01' AS DATE)
     AND  o_orderdate < DATEADD ('month', 3, CAST ('1993-10-01' AS DATE))
     AND  l_returnflag = 'R'
     AND  c_nationkey = n_nationkey
GROUP BY  c_custkey,
          c_name,
          c_acctbal,
          c_phone,
          n_name,
          c_address,
          c_comment
ORDER BY  revenue DESC

The intent is to list the customers who cause the greatest loss of revenue in a given quarter by returning items ordered in said quarter.

We notice that both queries return many columns on which there are no conditions, and that both have a cap on returned rows. The difference is that in Q2 the major ORDER BY is on a grouping column, and in Q10 it is on the aggregate of the GROUP BY. Thus the TOP k trick discussed in the previous article does apply to Q2 but not to Q10.

The profile for Q2 follows:

{ 
time   6.1e-05% fanout         1 input         1 rows
time       1.1% fanout         1 input         1 rows
{ hash filler
Subquery 27 
{ 
time    0.0012% fanout         1 input         1 rows
REGION         1 rows(t10.R_REGIONKEY)
 R_NAME = <c EUROPE>
time   0.00045% fanout         5 input         1 rows
NATION         5 rows(t9.N_NATIONKEY)
 N_REGIONKEY = t10.R_REGIONKEY
time       1.6% fanout     40107 input         5 rows
SUPPLIER   4.2e+04 rows(t8.S_SUPPKEY)
 S_NATIONKEY = t9.N_NATIONKEY
 
After code:
      0: t8.S_SUPPKEY :=  := artm t8.S_SUPPKEY
      4: BReturn 0
time       0.1% fanout         0 input    200535 rows
Sort hf 49 (t8.S_SUPPKEY)
}
}
time    0.0004% fanout         1 input         1 rows
{ fork
time        21% fanout     79591 input         1 rows
PART     8e+04 rows(.P_PARTKEY)
 P_TYPE LIKE <c %BRASS> LIKE <c > ,  P_SIZE =  15 
time        44% fanout  0.591889 input     79591 rows
 
Precode:
      0: { 
time     0.083% fanout         1 input     79591 rows
time      0.13% fanout         1 input     79591 rows
{ fork
time        24% fanout  0.801912 input     79591 rows
PARTSUPP       3.5 rows(.PS_SUPPKEY, .PS_SUPPLYCOST)
 inlined  PS_PARTKEY = k_.P_PARTKEY
hash partition+bloom by 62 (tmp)hash join merged always card       0.2 -> ()
time       1.3% fanout         0 input     63825 rows
Hash source 49 merged into ts not partitionable       0.2 rows(.PS_SUPPKEY) -> ()
 
After code:
      0:  min min.PS_SUPPLYCOSTset no set_ctr
      5: BReturn 0
}
 
After code:
      0: aggregate :=  := artm min
      4: BReturn 0
time      0.19% fanout         0 input     79591 rows
Subquery Select(aggregate)
}
 
      8: BReturn 0
PARTSUPP     5e-08 rows(.PS_SUPPKEY)
 inlined  PS_PARTKEY = k_.P_PARTKEY PS_SUPPLYCOST = k_scalar
time       5.9% fanout  0.247023 input     47109 rows
SUPPLIER unq       0.9 rows (.S_ACCTBAL, .S_NATIONKEY, .S_NAME, .S_SUPPKEY)
 inlined  S_SUPPKEY = .PS_SUPPKEY
top k on S_ACCTBAL
time     0.077% fanout         1 input     11637 rows
NATION unq         1 rows (.N_REGIONKEY, .N_NAME)
 inlined  N_NATIONKEY = .S_NATIONKEY
time     0.051% fanout         1 input     11637 rows
REGION unq       0.2 rows ()
 inlined  R_REGIONKEY = .N_REGIONKEY R_NAME = <c EUROPE>
time      0.42% fanout         0 input     11637 rows
Sort (.S_ACCTBAL, .N_NAME, .S_NAME, .P_PARTKEY) -> (.S_SUPPKEY)
 
}
time    0.0016% fanout       100 input         1 rows
top order by read (.S_SUPPKEY, .P_PARTKEY, .N_NAME, .S_NAME, .S_ACCTBAL)
time      0.02% fanout         1 input       100 rows
PART unq      0.95 rows (.P_MFGR)
 inlined  P_PARTKEY = .P_PARTKEY
time     0.054% fanout         1 input       100 rows
SUPPLIER unq         1 rows (.S_PHONE, .S_ADDRESS, .S_COMMENT)
 inlined  S_SUPPKEY = k_.S_SUPPKEY
time   6.7e-05% fanout         0 input       100 rows
Select (.S_ACCTBAL, .S_NAME, .N_NAME, .P_PARTKEY, .P_MFGR, .S_ADDRESS, .S_PHONE, .S_COMMENT)
}


 128 msec 1007% cpu,    196992 rnd 2.53367e+07 seq   50.4135% same seg   45.3574% same pg 

The query starts with a scan looking for the qualifying parts. It then looks for the best price for each part from a European supplier. All the European suppliers have been previously put in a hash table by the hash filler subquery at the start of the plan. Thus, to find the minimum price, the query takes the partsupp for the part by index, and then eliminates all non-European suppliers by a selective hash join. After this, there is a second index lookup on partsupp where we look for the part and the price equal to the minimum price found earlier. These operations could in principle be merged, as the minimum price partsupp has already been seen. The gain would not be very large, though.

Here we note that the cost model guesses that very few rows will survive the check of ps_supplycost = minimum cost. It does not know that the minimum is not just any value, but one of the values that do occur in the ps_supplycost column for the part. Because of this, the remainder of the plan is carried out by index, which is just as well. The point is that if very few rows of input are expected, it is not worthwhile to make a hash table for a hash join. The hash table made for the European suppliers could be reused here, maybe with some small gain. It would however need more columns, which might make it not worthwhile. We note that the major order with the TOP k is on the supplier s_acctbal, hence as soon as there are 100 suppliers found, one can add a restriction on the s_acctbal for subsequent ones.

At the end of the plan, after the TOP k ORDER BY and the reading of the results, we have a separate index-based lookup for getting only the columns that are returned. We note that this is done on 100 rows whereas the previous operations are done on tens-of-thousands of rows. The TOP k restriction produces some benefit, but it is relatively late in the plan, and not many operations follow it.

The plan is easily good enough, with only small space for improvement. Q2 is one of the fastest queries of the set.

Let us now consider the execution of Q10:

{ 
time   1.1e-06% fanout         1 input         1 rows
time   4.4e-05% fanout         1 input         1 rows
{ hash filler
time   1.6e-05% fanout        25 input         1 rows
NATION        25 rows(.N_NATIONKEY, .N_NAME)
 
time   6.7e-06% fanout         0 input        25 rows
Sort hf 35 (.N_NATIONKEY) -> (.N_NAME)
 
}
time   1.5e-06% fanout         1 input         1 rows
{ fork
time   2.4e-06% fanout         1 input         1 rows
{ fork
time        13% fanout 5.73038e+06 input         1 rows
ORDERS   5.1e+06 rows(.O_ORDERKEY, .O_CUSTKEY)
 O_ORDERDATE >= <c 1993-10-01> < <c 1994-01-01>
time       4.8% fanout   2.00042 input 5.73038e+06 rows
LINEITEM       1.1 rows(.L_EXTENDEDPRICE, .L_DISCOUNT)
 inlined  L_ORDERKEY = .O_ORDERKEY L_RETURNFLAG = <c R>
time        25% fanout         1 input 1.14632e+07 rows
 
Precode:
      0: temp := artm  1  - .L_DISCOUNT
      4: temp := artm .L_EXTENDEDPRICE * temp
      8: BReturn 0
CUSTOMER unq         1 rows (.C_NATIONKEY, .C_CUSTKEY)
 inlined  C_CUSTKEY = k_.O_CUSTKEY
hash partition+bloom by 39 (tmp)hash join merged always card         1 -> (.N_NAME)
time    0.0023% fanout         1 input 1.14632e+07 rows
Hash source 35 merged into ts          1 rows(.C_NATIONKEY) -> (.N_NAME)
time       2.3% fanout         1 input 1.14632e+07 rows
Stage 2
time       3.6% fanout         0 input 1.14632e+07 rows
Sort (q_.C_CUSTKEY, .N_NAME) -> (temp)
 
}
time       0.6% fanout 3.88422e+06 input         1 rows
group by read node  
(.C_CUSTKEY, .N_NAME, revenue)in each partition slice
time      0.57% fanout         0 input 3.88422e+06 rows
Sort (revenue) -> (.N_NAME, .C_CUSTKEY)
 
}
time   6.9e-06% fanout        20 input         1 rows
top order by read (.N_NAME, revenue, .C_CUSTKEY)
time   0.00036% fanout         1 input        20 rows
CUSTOMER unq         1 rows (.C_PHONE, .C_NAME, .C_ACCTBAL, .C_ADDRESS, .C_COMMENT)
 inlined  C_CUSTKEY = .C_CUSTKEY
time   1.1e-06% fanout         0 input        20 rows
Select (.C_CUSTKEY, .C_NAME, revenue, .C_ACCTBAL, .N_NAME, .C_ADDRESS, .C_PHONE, .C_COMMENT)
}


 2153 msec 2457% cpu, 1.71845e+07 rnd 1.67177e+08 seq   76.3221% same seg   21.1204% same pg 

The plan is by index, except for the lookup of nation name for the customer. The most selective condition is on order date, followed by the returnflag on lineitem. Getting the customer by index turns out to be better than by hash, even though almost all customers are hit. See the input cardinality above the first customer entry in the plan -- over 10M. The key point here is that only the c_custkey and c_nationkey get fetched, which saves a lot of time. In fact the c_custkey is needless since this is anyway equal to the o_custkey, but this makes little difference.

One could argue that customer should be between lineitem and orders in join order. Doing this would lose the ORDER BY on orders and lineitem, but would prevent some customer rows from being hit twice for a single order. The difference would not be large, though. For a scale-out setting, one definitely wants to have orders and lineitem without customer in between if the former are partitioned on the same key.

The c_nationkey is next translated into a n_name by hash, and there is a partitioned GROUP BY on c_custkey. The GROUP BY is partitioned because there are many different c_custkey values (155M for 100G scale).

The most important trick is fetching all the many dependent columns of c_custkey only after the TOP k ORDER BY. The last access to customer in the plan does this and is only executed on 20 rows.

Without the TOP k trick, the plan is identical, except that the dependent columns are fetched for nearly all customers. If this is done, the run time is 16s, which is bad enough to sink the whole score.

There is another approach to the challenge of this query: If foreign keys are declared and enforced, the system will know that every order has an actually existing customer and that every customer has a country. If so, the whole GROUP BY and TOP k can be done without any reference to customer, which is a notch better still, at least for this query. In this implementation, we do not declare foreign keys, thus the database must check that the customer and its country in fact exist before doing the GROUP BY. This makes the late projection trick mandatory, but does save the expense of checking foreign keys on updates. In both cases, the optimizer must recognize that the columns to be fetched at the end (late projected) are functionally dependent on a grouping key (c_custkey).

The late projection trick is generally useful, since almost all applications aside from bulk data export have some sort of limit on result set size. A column store especially benefits from this, since some columns of a row can be read without even coming near to other ones. A row store can also benefit from this in the form of decreased intermediate result size. This is especially good when returning long columns, such as text fields or blobs, on which there are most often no search conditions. If there are conditions of such, then these will most often be implemented via a special text index and not a scan.

*           *           *           *           *

In the next installment we will have a look at the overall 100G single server performance. After this we will recap the tricks so far. Then it will be time to look at implications of scale out for performance and run at larger scales. After the relational ground has been covered, we can look at implications of schema-lastness, i.e., triples for this type of workload.

So, while the most salient tricks have been at least briefly mentioned, we are far from having exhausted this most foundational of database topics.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
04/07/2014 12:28 GMT Modified: 04/07/2014 12:30 GMT
OpenPHACTS in Vienna [ Virtuso Data Space Bot ]

Hugh Williams and I (Orri Erling) went to the Open PHACTS Steering Committee meeting in Vienna last week. I am a great fan of Open PHACTS; the meetings are fun, with a great team spirit, and there is always something new to learn.

Paul Groth gave a talk about the stellar success of the the initial term of Open PHACTS.

  • Three releases of platform and data
  • 18 applications
  • Open PHACTS Foundation for sustainable exploitation and further development of the platform
  • Superb culture of collaboration
    • great team spirit
    • great output from distributed organization
    • lots of face-to-face time
    • example to every other big collaborative project

"The reincarnation of Steve Jobs," commented someone from the audience. "Except I am a nice guy," retorted Paul.

Commented one attendee, "The semantic web…., I just was in Boston at a semantic web meeting – so nerdy, something to make you walk out of the room… so it is a definite victory for Open PHACTS and why not also semantic web, that something based on these principles actually works."

It is a win anyhow, so I did not say anything at the meeting. So I will say something here, where I have more space as the message bears repeating.

We share part of the perception, so we hardly ever say "semantic web." The word is "linked data," and it means flexible schema and global identifiers. Flexible schema means that everything does not have to be modeled upfront. Global identifiers means that data, when transferred out of its silo of origin, remains interpretable and self-describing, so you can mix it with other data without things getting confused. "Desiloization" is a wonderful new word for describing this.

This ties right into FAIRport and FAIR data: Findable, Accessible, Interoperable, Reusable. Barend Mons talked a lot about this: open just means downloadable; fair means something you can do science with. Barend’s take is that RDF with a URI for everything is the super wire format for exchanging data. When you process it, you will diversely cook it, so an RDF store is one destination but not the only possibility. It has been said before: there is a range of choices between storing triples verbatim, and making application specific extractions, including ones with a schema, whether graph DB or relational.

Nanopublications are also moving ahead. Christine Chichester told me about pending publications involving Open PHACTS nanopublictions about post-translation modification of proteins and their expression in different tissues. So there are nanopublications out there and they can be joined, just as intended. Victory of e-science and data integration.

The Open PHACTS project is now officially extended for another two-year term, bringing the total duration to five years. The Open PHACTS Foundation exists as a legal entity and has its first members. This is meant to be a non-profit industry association for sharing of pre-competitive data and services around these between players in the pharma space, in industry as well as academia. There are press releases to follow in due time.

I am looking forward to more Open PHACTS. From the OpenLink and Virtuoso side, there are directly relevant developments that will enter production in the next few months, including query caching discussed earlier on this blog, as well as running on the TPC-H tuned analytics branch for overall better query optimization. Adaptive schema is something of evident value to Open PHACTS, as much of the integrated data comes from relational sources, so is regular enough. Therefore taking advantage of this for storage cannot hurt. We will see this still within the scope of the project extension.

Otherwise, more cooperation in formulating the queries for the business questions will also help.

All in all, Open PHACTS is the celebrated beauty queen of all the Innovative Medicine Initiative, it would seem. Superbly connected, unparalleled logo cloud, actually working and useful data integration, delivering on time on all in fact very complex business questions.

# PermaLink Comments [0]
03/31/2014 11:49 GMT
OpenPHACTS in Vienna [ Orri Erling ]

Hugh Williams and I (Orri Erling) went to the Open PHACTS Steering Committee meeting in Vienna last week. I am a great fan of Open PHACTS; the meetings are fun, with a great team spirit, and there is always something new to learn.

Paul Groth gave a talk about the stellar success of the the initial term of Open PHACTS.

  • Three releases of platform and data
  • 18 applications
  • Open PHACTS Foundation for sustainable exploitation and further development of the platform
  • Superb culture of collaboration
    • great team spirit
    • great output from distributed organization
    • lots of face-to-face time
    • example to every other big collaborative project

"The reincarnation of Steve Jobs," commented someone from the audience. "Except I am a nice guy," retorted Paul.

Commented one attendee, "The semantic web…., I just was in Boston at a semantic web meeting – so nerdy, something to make you walk out of the room… so it is a definite victory for Open PHACTS and why not also semantic web, that something based on these principles actually works."

It is a win anyhow, so I did not say anything at the meeting. So I will say something here, where I have more space as the message bears repeating.

We share part of the perception, so we hardly ever say "semantic web." The word is "linked data," and it means flexible schema and global identifiers. Flexible schema means that everything does not have to be modeled upfront. Global identifiers means that data, when transferred out of its silo of origin, remains interpretable and self-describing, so you can mix it with other data without things getting confused. "Desiloization" is a wonderful new word for describing this.

This ties right into FAIRport and FAIR data: Findable, Accessible, Interoperable, Reusable. Barend Mons talked a lot about this: open just means downloadable; fair means something you can do science with. Barend’s take is that RDF with a URI for everything is the super wire format for exchanging data. When you process it, you will diversely cook it, so an RDF store is one destination but not the only possibility. It has been said before: there is a range of choices between storing triples verbatim, and making application specific extractions, including ones with a schema, whether graph DB or relational.

Nanopublications are also moving ahead. Christine Chichester told me about pending publications involving Open PHACTS nanopublictions about post-translation modification of proteins and their expression in different tissues. So there are nanopublications out there and they can be joined, just as intended. Victory of e-science and data integration.

The Open PHACTS project is now officially extended for another two-year term, bringing the total duration to five years. The Open PHACTS Foundation exists as a legal entity and has its first members. This is meant to be a non-profit industry association for sharing of pre-competitive data and services around these between players in the pharma space, in industry as well as academia. There are press releases to follow in due time.

I am looking forward to more Open PHACTS. From the OpenLink and Virtuoso side, there are directly relevant developments that will enter production in the next few months, including query caching discussed earlier on this blog, as well as running on the TPC-H tuned analytics branch for overall better query optimization. Adaptive schema is something of evident value to Open PHACTS, as much of the integrated data comes from relational sources, so is regular enough. Therefore taking advantage of this for storage cannot hurt. We will see this still within the scope of the project extension.

Otherwise, more cooperation in formulating the queries for the business questions will also help.

All in all, Open PHACTS is the celebrated beauty queen of all the Innovative Medicine Initiative, it would seem. Superbly connected, unparalleled logo cloud, actually working and useful data integration, delivering on time on all in fact very complex business questions.

# PermaLink Comments [0]
03/31/2014 11:49 GMT
In Hoc Signo Vinces (part 10 of n): TPC-H Q9, Q17, Q20 - Predicate Games [ Virtuso Data Space Bot ]

TPC-H is a hash join game. The rules do allow indices, but maintaining these takes time, and indices will quickly result in non-local access patterns. Indices also take space. Besides, somebody must know what indices to create, which is not obvious. Thus, it is best if a BI data warehouse works without.

Once you go to hash join, one side of the join will be materialized, which takes space, which ipso facto is bad. So, the predicate games are about moving conditions so that the hash table made for the hash join will be as small as possible. Only items that may in fact be retrieved should be put in the hash table. If you know that the query deals with shipments of green parts, putting lineitems of parts that are not green in a hash table makes no sense since only green ones are being looked for.

So, let's consider Q9. The query is:

SELECT                 nation,
                       o_year,
        SUM(amount) AS sum_profit
 FROM  ( SELECT
                                                                          n_name AS nation,
                                               EXTRACT ( YEAR FROM o_orderdate ) AS o_year,
                 l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity AS amount
           FROM
                 part,
                 supplier,
                 lineitem,
                 partsupp,
                 orders,
                 nation
          WHERE  s_suppkey = l_suppkey
            AND  ps_suppkey = l_suppkey
            AND  ps_partkey = l_partkey
            AND  p_partkey = l_partkey
            AND  o_orderkey = l_orderkey
            AND  s_nationkey = n_nationkey
            AND  p_name like '%green%'
       ) AS profit
GROUP BY  nation,
          o_year
ORDER BY  nation,
          o_year DESC
;

The intent is to calculate profit from the sale of a type of part, broken down by year and supplier nation. All orders, lineitems, partsupps, and suppliers involving the parts of interest are visited. This is one of the longest running of the queries. The query is restricted by part only, and the condition selects 1/17 of all parts.

The execution plan is below. First the plan builds hash tables of all nations and suppliers. We expect to do frequent lookups, thus making a hash is faster than using the index. Partsupp is the 3rd largest table in the database. This has a primary key of ps_partkey, ps_suppkey, referenced by the compound foreign key l_partkey, l_suppkey in lineitem. This could be accessed by index, but we expect to hit each partsupp row multiple times, hence hash is better. We further note that only partsupp rows where the part satisfies the condition will contribute to the result. Thus we import the join with part into the hash build. The ps_partkey is not directly joined to p_partkey, but rather the system must understand that this follows from l_partkey = ps_partkey and l_partkey = p_partkey. In this way, the hash table is 1/17th of the size it would otherwise be, which is a crucial gain.

Looking further into the plan, we note a scan of lineitem followed by a hash join with part. Restricting the build of the partsupp hash would have the same effect, hence part is here used twice while it occurs only once in the query. This is deliberate, since the selective hash join with part restricts lineitem faster than the more complex hash join with a 2 part key (l_partkey, l_suppkey). Both joins perform the identical restriction, but doing the part first is faster since this becomes a single-key, invisible hash join, merged into the lineitem scan, done before even accessing the l_suppkey and other columns.

{ 
time   3.9e-06% fanout         1 input         1 rows
time   4.7e-05% fanout         1 input         1 rows
{ hash filler
time   3.6e-05% fanout        25 input         1 rows
NATION        25 rows(.N_NATIONKEY, nation)
 
time   8.8e-06% fanout         0 input        25 rows
Sort hf 35 (.N_NATIONKEY) -> (nation)
 
}
time      0.16% fanout         1 input         1 rows
{ hash filler
time     0.011% fanout     1e+06 input         1 rows
SUPPLIER     1e+06 rows(.S_SUPPKEY, .S_NATIONKEY)
 
time      0.03% fanout         0 input     1e+06 rows
Sort hf 49 (.S_SUPPKEY) -> (.S_NATIONKEY)
 
}
time      0.57% fanout         1 input         1 rows
{ hash filler
Subquery 58 
{ 
time       1.6% fanout 1.17076e+06 input         1 rows
PART   1.2e+06 rows(t1.P_PARTKEY)
 P_NAME LIKE  <c %green%> LIKE  <c >
time       1.1% fanout         4 input 1.17076e+06 rows
PARTSUPP       3.9 rows(t4.PS_SUPPKEY, t4.PS_PARTKEY, t4.PS_SUPPLYCOST)
 inlined  PS_PARTKEY = t1.P_PARTKEY
 
After code:
      0: t4.PS_SUPPKEY :=  := artm t4.PS_SUPPKEY
      4: t4.PS_PARTKEY :=  := artm t4.PS_PARTKEY
      8: t1.P_PARTKEY :=  := artm t1.P_PARTKEY
      12: t4.PS_SUPPLYCOST :=  := artm t4.PS_SUPPLYCOST
      16: BReturn 0
time      0.33% fanout         0 input 4.68305e+06 rows
Sort hf 82 (t4.PS_SUPPKEY, t4.PS_PARTKEY) -> (t1.P_PARTKEY, t4.PS_SUPPLYCOST)
 
}
}
time      0.18% fanout         1 input         1 rows
{ hash filler
time       1.6% fanout 1.17076e+06 input         1 rows
PART   1.2e+06 rows(.P_PARTKEY)
 P_NAME LIKE  <c %green%> LIKE  <c >
time     0.017% fanout         0 input 1.17076e+06 rows
Sort hf 101 (.P_PARTKEY)
}
time   5.1e-06% fanout         1 input         1 rows
{ fork
time   4.1e-06% fanout         1 input         1 rows
{ fork
time        59% fanout 3.51125e+07 input         1 rows
LINEITEM     6e+08 rows(.L_PARTKEY, .L_ORDERKEY, .L_SUPPKEY, .L_EXTENDEDPRICE, .L_DISCOUNT, .L_QUANTITY)
 
hash partition+bloom by 108 (tmp)hash join merged always card     0.058 -> ()
hash partition+bloom by 56 (tmp)hash join merged always card         1 -> (.S_NATIONKEY)
time      0.18% fanout         1 input 3.51125e+07 rows
 
Precode:
      0: temp := artm  1  - .L_DISCOUNT
      4: temp := artm .L_EXTENDEDPRICE * temp
      8: BReturn 0
Hash source 101 merged into ts      0.058 rows(.L_PARTKEY) -> ()
time        17% fanout         1 input 3.51125e+07 rows
Hash source 82       0.057 rows(.L_SUPPKEY, .L_PARTKEY) -> (  <none> , .PS_SUPPLYCOST)
time       6.2% fanout         1 input 3.51125e+07 rows
 
Precode:
      0: temp := artm .PS_SUPPLYCOST * .L_QUANTITY
      4: temp := artm temp - temp
      8: BReturn 0
ORDERS unq         1 rows (.O_ORDERDATE)
 inlined  O_ORDERKEY = k_.L_ORDERKEY
time    0.0055% fanout         1 input 3.51125e+07 rows
Hash source 49 merged into ts          1 rows(k_.L_SUPPKEY) -> (.S_NATIONKEY)
time       3.5% fanout         1 input 3.51125e+07 rows
Hash source 35           1 rows(k_.S_NATIONKEY) -> (nation)
time       8.8% fanout         0 input 3.51125e+07 rows
 
Precode:
      0: o_year := Call year (.O_ORDERDATE)
      5: BReturn 0
Sort (nation, o_year) -> (temp)
 
}
time   4.7e-05% fanout       175 input         1 rows
group by read node  
(nation, o_year, sum_profit)
time   0.00028% fanout         0 input       175 rows
Sort (nation, o_year) -> (sum_profit)
 
}
time   2.2e-05% fanout       175 input         1 rows
Key from temp (nation, o_year, sum_profit)
 
time   1.6e-06% fanout         0 input       175 rows
Select (nation, o_year, sum_profit)
}


 6114 msec 1855% cpu, 3.62624e+07 rnd 6.44384e+08 seq   99.6068% same seg  0.357328% same pg 

6.1s is a good score for this query. When executing the same in 5 parallel invocations, the fastest ends in 13.7s and the slowest in 27.6s. For five concurrent executions, the peak transient memory utilization is 4.7 GB for the hash tables, which is very reasonable.

*           *           *           *           *

Let us next consider Q17.

SELECT
        SUM(l_extendedprice) / 7.0 AS avg_yearly
FROM
        lineitem,
        part
WHERE
        p_partkey = l_partkey
   AND  p_brand = 'Brand#23'
   AND  p_container = 'MED BOX'
   AND  l_quantity 
           < (
                   SELECT
                            2e-1 * AVG(l_quantity)
                   FROM
                            lineitem
                   WHERE
                            l_partkey = p_partkey
                )

Deceptively simple? This calculates the total value of small orders (below 1/5 of average quantity for the part) for all parts of a given brand with a specific container.

If there is an index on l_partkey, the plan is easy enough: Take the parts, look up the average quantity for each, then recheck lineitem and add up the small lineitems. This takes about 1s. But we do not want indices for this workload.

If we made a hash from l_partkey to l_quantity for all lineitems, we could run out of space, and this would take so long the race would be automatically lost on this point alone. The trick is to import the restriction on l_partkey into the hash build. This gives us a plan that does a scan of lineitem twice, doing a very selective hash join (few parts). There is a lookup for the average for each lineitem with the part. The average is calculated potentially several times.

The below plan is workable but better is possible: We notice that the very selective join need be done just once; it is cheaper to remember the result than to do it twice, and the result is not large. The other trick is that the correlated subquery can be rewritten as

SELECT
        ... 
  FROM
        lineitem, 
        part, 
        ( SELECT 
                                           l_partkey, 
                 0.2 * AVG (l_quantity) AS qty 
            FROM 
                 lineitem, 
                 part 
            ...
        ) f 
 WHERE
        l_partkey = f.l_partkey 
 ...

In this form, one can put the entire derived table f on the build side of a hash join. In this way, the average is never done more than once per part.

{ 
time   7.9e-06% fanout         1 input         1 rows
time    0.0031% fanout         1 input         1 rows
{ hash filler
time      0.27% fanout     20031 input         1 rows
PART     2e+04 rows(.P_PARTKEY)
 P_BRAND =  <c Brand#23> ,  P_CONTAINER =  <c MED BOX>
time   0.00047% fanout         0 input     20031 rows
Sort hf 34 (.P_PARTKEY)
}
time       0.1% fanout         1 input         1 rows
{ hash filler
Subquery 40 
{ 
time        46% fanout    600982 input         1 rows
LINEITEM     6e+08 rows(t4.L_PARTKEY, t4.L_QUANTITY)
 
hash partition+bloom by 38 (tmp)hash join merged always card     0.001 -> ()
time    0.0042% fanout         1 input    600982 rows
Hash source 34 merged into ts not partitionable     0.001 rows(t4.L_PARTKEY) -> ()
 
After code:
      0: t4.L_PARTKEY :=  := artm t4.L_PARTKEY
      4: t4.L_QUANTITY :=  := artm t4.L_QUANTITY
      8: BReturn 0
time     0.059% fanout         0 input    600982 rows
Sort hf 62 (t4.L_PARTKEY) -> (t4.L_QUANTITY)
 
}
}
time   6.8e-05% fanout         1 input         1 rows
{ fork
time        46% fanout    600982 input         1 rows
LINEITEM     6e+08 rows(.L_PARTKEY, .L_QUANTITY, .L_EXTENDEDPRICE)
 
hash partition+bloom by 38 (tmp)hash join merged always card   0.00052 -> ()
time   0.00021% fanout         1 input    600982 rows
Hash source 34 merged into ts    0.00052 rows(.L_PARTKEY) -> ()
 
Precode:
      0: .P_PARTKEY :=  := artm .L_PARTKEY
      4: BReturn 0
END Node
After test:
      0: { 
time     0.038% fanout         1 input    600982 rows
time      0.17% fanout         1 input    600982 rows
{ fork
time       6.8% fanout         0 input    600982 rows
Hash source 62  not partitionable      0.03 rows(k_.P_PARTKEY) -> (.L_QUANTITY)
 
After code:
      0:  sum sum.L_QUANTITYset no set_ctr
      5:  sum count 1 set no set_ctr
      10: BReturn 0
}
 
After code:
      0: temp := artm sum / count
      4: temp := artm  0.2  * temp
      8: aggregate :=  := artm temp
      12: BReturn 0
time     0.042% fanout         0 input    600982 rows
Subquery Select(aggregate)
}
 
      8: if (.L_QUANTITY < scalar) then 12 else 13 unkn 13
      12: BReturn 1
      13: BReturn 0
 
After code:
      0:  sum sum.L_EXTENDEDPRICE
      5: BReturn 0
}
 
After code:
      0: avg_yearly := artm sum /  7 
      4: BReturn 0
time   4.6e-06% fanout         0 input         1 rows
Select (avg_yearly)
}


 2695 msec 1996% cpu,         3 rnd 1.18242e+09 seq         0% same seg         0% same pg 

2.7s is tolerable, but if this drags down the overall score by too much, we know that a 2+x improvement is readily available. Playing the rest of the tricks would result in the hash plan almost catching up with the 1s execution time of the index-based plan.

*           *           *           *           *

Q20 is not very long-running, but it is maybe the hardest to optimize of the lot. But as usual, failure to recognize its most salient traps will automatically lose the race, so pay attention.

SELECT TOP 100
           s_name,
           s_address
     FROM
           supplier,
           nation
    WHERE
           s_suppkey IN 
             ( SELECT  
                       ps_suppkey
                 FROM  
                       partsupp
                WHERE  
                       ps_partkey IN 
                         ( SELECT  
                                   p_partkey
                             FROM  
                                   part
                            WHERE  
                                   p_name LIKE 'forest%'
                         )
                  AND  ps_availqty > 
                         ( SELECT  
                                   0.5 * SUM(l_quantity)
                             FROM  
                                   lineitem
                            WHERE  
                                   l_partkey = ps_partkey
                              AND  l_suppkey = ps_suppkey
                              AND  l_shipdate >= CAST ('1994-01-01' AS DATE)
                              AND  l_shipdate < DATEADD ('year', 1, CAST ('1994-01-01' AS DATE))
                         )
             )
      AND  s_nationkey = n_nationkey
      AND  n_name = 'CANADA'
 ORDER BY  s_name

This identifies suppliers that have parts in stock in excess of half a year's shipments of said part.

The use of IN to denote a join is the first catch. The second is joining to lineitem by hash without building an overly large hash table. We know that IN becomes EXISTS which in turn can become a join as follows:

SELECT 
        l_suppkey 
FROM
        lineitem 
WHERE
        l_partkey IN  
          ( SELECT  
                    p_partkey 
              FROM  
                    part 
             WHERE  
                    p_name LIKE 'forest%'
       )
;

-- is --

SELECT  
        l_suppkey 
  FROM  
        lineitem 
 WHERE  EXISTS  
          ( SELECT  
                    p_partkey 
              FROM  
                    part 
             WHERE  
                    p_partkey = l_partkey 
               AND  p_name LIKE 'forest%')
;

-- is --

SELECT  
        l_suppkey 
  FROM  
        lineitem, 
        ( SELECT  
                    DISTINCT p_partkey 
            FROM  
                    part 
           WHERE  
                    p_name LIKE 'forest%') f 
 WHERE  
        l_partkey = f.p_partkey
;

But since p_partkey is unique, the DISTINCT drops off and we have.

SELECT  
        l_suppkey 
  FROM  
        lineitem, 
        part 
 WHERE  
        p_name LIKE 'forest% 
   AND  l_partkey = f.p_partkey
;

You see, the innermost IN with the ps_partkey goes through all these changes, and just becomes a join. The outermost IN stays as a distinct derived table, since ps_suppkey is not unique, and the meaning of IN is not to return a given supplier more than once.

The derived table is flattened and the DISTINCT is done partitioned; hence the stage node in front of the distinct. A DISTINCT can be multithreaded, if each thread gets a specific subset of all the keys. The stage node is an exchange of tuples between several threads. Each thread then does a TOP k sort. The TOP k trick we saw in Q18 is used, but does not contribute much here.

{ 
time   8.2e-06% fanout         1 input         1 rows
time   0.00017% fanout         1 input         1 rows
{ hash filler
time   6.1e-05% fanout         1 input         1 rows
NATION         1 rows(.N_NATIONKEY)
 N_NAME =  <c CANADA>
time   1.2e-05% fanout         0 input         1 rows
Sort hf 34 (.N_NATIONKEY)
}
time     0.073% fanout         1 input         1 rows
{ hash filler
time       4.1% fanout    240672 input         1 rows
PART   2.4e+05 rows(t74.P_PARTKEY)
 P_NAME LIKE  <c forest%> LIKE  <c >
time     0.011% fanout         0 input    240672 rows
Sort hf 47 (t74.P_PARTKEY)
}
time      0.69% fanout         1 input         1 rows
{ hash filler
Subquery 56 
{ 
time        42% fanout 1.09657e+06 input         1 rows
LINEITEM   9.1e+07 rows(t76.L_PARTKEY, t76.L_SUPPKEY, t76.L_QUANTITY)
 L_SHIPDATE >= <c 1994-01-01> < <c 1995-01-01>
hash partition+bloom by 54 (tmp)hash join merged always card     0.012 -> ()
time     0.022% fanout         1 input 1.09657e+06 rows
Hash source 47 merged into ts not partitionable     0.012 rows(t76.L_PARTKEY) -> ()
 
After code:
      0: t76.L_PARTKEY :=  := artm t76.L_PARTKEY
      4: t76.L_SUPPKEY :=  := artm t76.L_SUPPKEY
      8: t76.L_QUANTITY :=  := artm t76.L_QUANTITY
      12: BReturn 0
time      0.22% fanout         0 input 1.09657e+06 rows
Sort hf 80 (t76.L_PARTKEY, t76.L_SUPPKEY) -> (t76.L_QUANTITY)
 
}
}
time   2.1e-05% fanout         1 input         1 rows
time   3.2e-05% fanout         1 input         1 rows
{ fork
time       5.3% fanout    240672 input         1 rows
PART   2.4e+05 rows(t6.P_PARTKEY)
 P_NAME LIKE  <c forest%> LIKE  <c >
time       1.9% fanout         4 input    240672 rows
PARTSUPP       1.2 rows(t4.PS_AVAILQTY, t4.PS_PARTKEY, t4.PS_SUPPKEY)
 inlined  PS_PARTKEY = t6.P_PARTKEY
time        16% fanout  0.680447 input    962688 rows
END Node
After test:
      0: { 
time      0.08% fanout         1 input    962688 rows
time       9.4% fanout         1 input    962688 rows
{ fork
time       3.6% fanout         0 input    962688 rows
Hash source 80       0.013 rows(k_t4.PS_PARTKEY, k_t4.PS_SUPPKEY) -> (t8.L_QUANTITY)
 
After code:
      0:  sum sumt8.L_QUANTITYset no set_ctr
      5: BReturn 0
}
 
After code:
      0: temp := artm  0.5  * sum
      4: aggregate :=  := artm temp
      8: BReturn 0
time      0.85% fanout         0 input    962688 rows
Subquery Select(aggregate)
}
 
      8: if (t4.PS_AVAILQTY > scalar) then 12 else 13 unkn 13
      12: BReturn 1
      13: BReturn 0
time         1% fanout         1 input    655058 rows
Stage 2
time     0.071% fanout         1 input    655058 rows
Distinct (q_t4.PS_SUPPKEY)
 
After code:
      0: PS_SUPPKEY :=  := artm t4.PS_SUPPKEY
      4: BReturn 0
time     0.016% fanout         1 input    655058 rows
Subquery Select(PS_SUPPKEY)
time       3.2% fanout 0.0112845 input    655058 rows
SUPPLIER unq     0.075 rows (.S_NAME, .S_NATIONKEY, .S_ADDRESS)
 inlined  S_SUPPKEY = PS_SUPPKEY
hash partition+bloom by 38 (tmp)hash join merged always card      0.04 -> ()
top k on S_NAME
time    0.0012% fanout         1 input      7392 rows
Hash source 34 merged into ts       0.04 rows(.S_NATIONKEY) -> ()
time     0.074% fanout         0 input      7392 rows
Sort (.S_NAME) -> (.S_ADDRESS)
 
}
time   0.00013% fanout       100 input         1 rows
top order by read (.S_NAME, .S_ADDRESS)
time     5e-06% fanout         0 input       100 rows
Select (.S_NAME, .S_ADDRESS)
}


 1777 msec 1355% cpu,    894483 rnd 6.39422e+08 seq   79.1214% same seg   19.3093% same pg 

1.8s is sufficient, and in the ballpark with VectorWise. Some further gain is possible, as the lineitem hash table can also be restricted by supplier; after all, only 1/25 of all suppliers are in the end considered. Further simplifications are possible. Another 20% of time could be saved. The tricks are however quite complex and specific, and there are easier gains to be had -- for example, in reusing intermediates in Q17 and Q15.

The next installment will discuss late projection and some miscellaneous tricks not mentioned so far. After this, we are ready to take an initial look at the performance of the system as a whole.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
03/20/2014 16:03 GMT Modified: 04/07/2014 12:36 GMT
In Hoc Signo Vinces (part 10 of n): TPC-H Q9, Q17, Q20 - Predicate Games [ Orri Erling ]

TPC-H is a hash join game. The rules do allow indices, but maintaining these takes time, and indices will quickly result in non-local access patterns. Indices also take space. Besides, somebody must know what indices to create, which is not obvious. Thus, it is best if a BI data warehouse works without.

Once you go to hash join, one side of the join will be materialized, which takes space, which ipso facto is bad. So, the predicate games are about moving conditions so that the hash table made for the hash join will be as small as possible. Only items that may in fact be retrieved should be put in the hash table. If you know that the query deals with shipments of green parts, putting lineitems of parts that are not green in a hash table makes no sense since only green ones are being looked for.

So, let's consider Q9. The query is:

SELECT                 nation,
                       o_year,
        SUM(amount) AS sum_profit
 FROM  ( SELECT
                                                                          n_name AS nation,
                                               EXTRACT ( YEAR FROM o_orderdate ) AS o_year,
                 l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity AS amount
           FROM
                 part,
                 supplier,
                 lineitem,
                 partsupp,
                 orders,
                 nation
          WHERE  s_suppkey = l_suppkey
            AND  ps_suppkey = l_suppkey
            AND  ps_partkey = l_partkey
            AND  p_partkey = l_partkey
            AND  o_orderkey = l_orderkey
            AND  s_nationkey = n_nationkey
            AND  p_name like '%green%'
       ) AS profit
GROUP BY  nation,
          o_year
ORDER BY  nation,
          o_year DESC
;

The intent is to calculate profit from the sale of a type of part, broken down by year and supplier nation. All orders, lineitems, partsupps, and suppliers involving the parts of interest are visited. This is one of the longest running of the queries. The query is restricted by part only, and the condition selects 1/17 of all parts.

The execution plan is below. First the plan builds hash tables of all nations and suppliers. We expect to do frequent lookups, thus making a hash is faster than using the index. Partsupp is the 3rd largest table in the database. This has a primary key of ps_partkey, ps_suppkey, referenced by the compound foreign key l_partkey, l_suppkey in lineitem. This could be accessed by index, but we expect to hit each partsupp row multiple times, hence hash is better. We further note that only partsupp rows where the part satisfies the condition will contribute to the result. Thus we import the join with part into the hash build. The ps_partkey is not directly joined to p_partkey, but rather the system must understand that this follows from l_partkey = ps_partkey and l_partkey = p_partkey. In this way, the hash table is 1/17th of the size it would otherwise be, which is a crucial gain.

Looking further into the plan, we note a scan of lineitem followed by a hash join with part. Restricting the build of the partsupp hash would have the same effect, hence part is here used twice while it occurs only once in the query. This is deliberate, since the selective hash join with part restricts lineitem faster than the more complex hash join with a 2 part key (l_partkey, l_suppkey). Both joins perform the identical restriction, but doing the part first is faster since this becomes a single-key, invisible hash join, merged into the lineitem scan, done before even accessing the l_suppkey and other columns.

{ 
time   3.9e-06% fanout         1 input         1 rows
time   4.7e-05% fanout         1 input         1 rows
{ hash filler
time   3.6e-05% fanout        25 input         1 rows
NATION        25 rows(.N_NATIONKEY, nation)
 
time   8.8e-06% fanout         0 input        25 rows
Sort hf 35 (.N_NATIONKEY) -> (nation)
 
}
time      0.16% fanout         1 input         1 rows
{ hash filler
time     0.011% fanout     1e+06 input         1 rows
SUPPLIER     1e+06 rows(.S_SUPPKEY, .S_NATIONKEY)
 
time      0.03% fanout         0 input     1e+06 rows
Sort hf 49 (.S_SUPPKEY) -> (.S_NATIONKEY)
 
}
time      0.57% fanout         1 input         1 rows
{ hash filler
Subquery 58 
{ 
time       1.6% fanout 1.17076e+06 input         1 rows
PART   1.2e+06 rows(t1.P_PARTKEY)
 P_NAME LIKE  <c %green%> LIKE  <c >
time       1.1% fanout         4 input 1.17076e+06 rows
PARTSUPP       3.9 rows(t4.PS_SUPPKEY, t4.PS_PARTKEY, t4.PS_SUPPLYCOST)
 inlined  PS_PARTKEY = t1.P_PARTKEY
 
After code:
      0: t4.PS_SUPPKEY :=  := artm t4.PS_SUPPKEY
      4: t4.PS_PARTKEY :=  := artm t4.PS_PARTKEY
      8: t1.P_PARTKEY :=  := artm t1.P_PARTKEY
      12: t4.PS_SUPPLYCOST :=  := artm t4.PS_SUPPLYCOST
      16: BReturn 0
time      0.33% fanout         0 input 4.68305e+06 rows
Sort hf 82 (t4.PS_SUPPKEY, t4.PS_PARTKEY) -> (t1.P_PARTKEY, t4.PS_SUPPLYCOST)
 
}
}
time      0.18% fanout         1 input         1 rows
{ hash filler
time       1.6% fanout 1.17076e+06 input         1 rows
PART   1.2e+06 rows(.P_PARTKEY)
 P_NAME LIKE  <c %green%> LIKE  <c >
time     0.017% fanout         0 input 1.17076e+06 rows
Sort hf 101 (.P_PARTKEY)
}
time   5.1e-06% fanout         1 input         1 rows
{ fork
time   4.1e-06% fanout         1 input         1 rows
{ fork
time        59% fanout 3.51125e+07 input         1 rows
LINEITEM     6e+08 rows(.L_PARTKEY, .L_ORDERKEY, .L_SUPPKEY, .L_EXTENDEDPRICE, .L_DISCOUNT, .L_QUANTITY)
 
hash partition+bloom by 108 (tmp)hash join merged always card     0.058 -> ()
hash partition+bloom by 56 (tmp)hash join merged always card         1 -> (.S_NATIONKEY)
time      0.18% fanout         1 input 3.51125e+07 rows
 
Precode:
      0: temp := artm  1  - .L_DISCOUNT
      4: temp := artm .L_EXTENDEDPRICE * temp
      8: BReturn 0
Hash source 101 merged into ts      0.058 rows(.L_PARTKEY) -> ()
time        17% fanout         1 input 3.51125e+07 rows
Hash source 82       0.057 rows(.L_SUPPKEY, .L_PARTKEY) -> (  <none> , .PS_SUPPLYCOST)
time       6.2% fanout         1 input 3.51125e+07 rows
 
Precode:
      0: temp := artm .PS_SUPPLYCOST * .L_QUANTITY
      4: temp := artm temp - temp
      8: BReturn 0
ORDERS unq         1 rows (.O_ORDERDATE)
 inlined  O_ORDERKEY = k_.L_ORDERKEY
time    0.0055% fanout         1 input 3.51125e+07 rows
Hash source 49 merged into ts          1 rows(k_.L_SUPPKEY) -> (.S_NATIONKEY)
time       3.5% fanout         1 input 3.51125e+07 rows
Hash source 35           1 rows(k_.S_NATIONKEY) -> (nation)
time       8.8% fanout         0 input 3.51125e+07 rows
 
Precode:
      0: o_year := Call year (.O_ORDERDATE)
      5: BReturn 0
Sort (nation, o_year) -> (temp)
 
}
time   4.7e-05% fanout       175 input         1 rows
group by read node  
(nation, o_year, sum_profit)
time   0.00028% fanout         0 input       175 rows
Sort (nation, o_year) -> (sum_profit)
 
}
time   2.2e-05% fanout       175 input         1 rows
Key from temp (nation, o_year, sum_profit)
 
time   1.6e-06% fanout         0 input       175 rows
Select (nation, o_year, sum_profit)
}


 6114 msec 1855% cpu, 3.62624e+07 rnd 6.44384e+08 seq   99.6068% same seg  0.357328% same pg 

6.1s is a good score for this query. When executing the same in 5 parallel invocations, the fastest ends in 13.7s and the slowest in 27.6s. For five concurrent executions, the peak transient memory utilization is 4.7 GB for the hash tables, which is very reasonable.

*           *           *           *           *

Let us next consider Q17.

SELECT
        SUM(l_extendedprice) / 7.0 AS avg_yearly
FROM
        lineitem,
        part
WHERE
        p_partkey = l_partkey
   AND  p_brand = 'Brand#23'
   AND  p_container = 'MED BOX'
   AND  l_quantity 
           < (
                   SELECT
                            2e-1 * AVG(l_quantity)
                   FROM
                            lineitem
                   WHERE
                            l_partkey = p_partkey
                )

Deceptively simple? This calculates the total value of small orders (below 1/5 of average quantity for the part) for all parts of a given brand with a specific container.

If there is an index on l_partkey, the plan is easy enough: Take the parts, look up the average quantity for each, then recheck lineitem and add up the small lineitems. This takes about 1s. But we do not want indices for this workload.

If we made a hash from l_partkey to l_quantity for all lineitems, we could run out of space, and this would take so long the race would be automatically lost on this point alone. The trick is to import the restriction on l_partkey into the hash build. This gives us a plan that does a scan of lineitem twice, doing a very selective hash join (few parts). There is a lookup for the average for each lineitem with the part. The average is calculated potentially several times.

The below plan is workable but better is possible: We notice that the very selective join need be done just once; it is cheaper to remember the result than to do it twice, and the result is not large. The other trick is that the correlated subquery can be rewritten as

SELECT
        ... 
  FROM
        lineitem, 
        part, 
        ( SELECT 
                                           l_partkey, 
                 0.2 * AVG (l_quantity) AS qty 
            FROM 
                 lineitem, 
                 part 
            ...
        ) f 
 WHERE
        l_partkey = f.l_partkey 
 ...

In this form, one can put the entire derived table f on the build side of a hash join. In this way, the average is never done more than once per part.

{ 
time   7.9e-06% fanout         1 input         1 rows
time    0.0031% fanout         1 input         1 rows
{ hash filler
time      0.27% fanout     20031 input         1 rows
PART     2e+04 rows(.P_PARTKEY)
 P_BRAND =  <c Brand#23> ,  P_CONTAINER =  <c MED BOX>
time   0.00047% fanout         0 input     20031 rows
Sort hf 34 (.P_PARTKEY)
}
time       0.1% fanout         1 input         1 rows
{ hash filler
Subquery 40 
{ 
time        46% fanout    600982 input         1 rows
LINEITEM     6e+08 rows(t4.L_PARTKEY, t4.L_QUANTITY)
 
hash partition+bloom by 38 (tmp)hash join merged always card     0.001 -> ()
time    0.0042% fanout         1 input    600982 rows
Hash source 34 merged into ts not partitionable     0.001 rows(t4.L_PARTKEY) -> ()
 
After code:
      0: t4.L_PARTKEY :=  := artm t4.L_PARTKEY
      4: t4.L_QUANTITY :=  := artm t4.L_QUANTITY
      8: BReturn 0
time     0.059% fanout         0 input    600982 rows
Sort hf 62 (t4.L_PARTKEY) -> (t4.L_QUANTITY)
 
}
}
time   6.8e-05% fanout         1 input         1 rows
{ fork
time        46% fanout    600982 input         1 rows
LINEITEM     6e+08 rows(.L_PARTKEY, .L_QUANTITY, .L_EXTENDEDPRICE)
 
hash partition+bloom by 38 (tmp)hash join merged always card   0.00052 -> ()
time   0.00021% fanout         1 input    600982 rows
Hash source 34 merged into ts    0.00052 rows(.L_PARTKEY) -> ()
 
Precode:
      0: .P_PARTKEY :=  := artm .L_PARTKEY
      4: BReturn 0
END Node
After test:
      0: { 
time     0.038% fanout         1 input    600982 rows
time      0.17% fanout         1 input    600982 rows
{ fork
time       6.8% fanout         0 input    600982 rows
Hash source 62  not partitionable      0.03 rows(k_.P_PARTKEY) -> (.L_QUANTITY)
 
After code:
      0:  sum sum.L_QUANTITYset no set_ctr
      5:  sum count 1 set no set_ctr
      10: BReturn 0
}
 
After code:
      0: temp := artm sum / count
      4: temp := artm  0.2  * temp
      8: aggregate :=  := artm temp
      12: BReturn 0
time     0.042% fanout         0 input    600982 rows
Subquery Select(aggregate)
}
 
      8: if (.L_QUANTITY < scalar) then 12 else 13 unkn 13
      12: BReturn 1
      13: BReturn 0
 
After code:
      0:  sum sum.L_EXTENDEDPRICE
      5: BReturn 0
}
 
After code:
      0: avg_yearly := artm sum /  7 
      4: BReturn 0
time   4.6e-06% fanout         0 input         1 rows
Select (avg_yearly)
}


 2695 msec 1996% cpu,         3 rnd 1.18242e+09 seq         0% same seg         0% same pg 

2.7s is tolerable, but if this drags down the overall score by too much, we know that a 2+x improvement is readily available. Playing the rest of the tricks would result in the hash plan almost catching up with the 1s execution time of the index-based plan.

*           *           *           *           *

Q20 is not very long-running, but it is maybe the hardest to optimize of the lot. But as usual, failure to recognize its most salient traps will automatically lose the race, so pay attention.

SELECT TOP 100
           s_name,
           s_address
     FROM
           supplier,
           nation
    WHERE
           s_suppkey IN 
             ( SELECT  
                       ps_suppkey
                 FROM  
                       partsupp
                WHERE  
                       ps_partkey IN 
                         ( SELECT  
                                   p_partkey
                             FROM  
                                   part
                            WHERE  
                                   p_name LIKE 'forest%'
                         )
                  AND  ps_availqty > 
                         ( SELECT  
                                   0.5 * SUM(l_quantity)
                             FROM  
                                   lineitem
                            WHERE  
                                   l_partkey = ps_partkey
                              AND  l_suppkey = ps_suppkey
                              AND  l_shipdate >= CAST ('1994-01-01' AS DATE)
                              AND  l_shipdate < DATEADD ('year', 1, CAST ('1994-01-01' AS DATE))
                         )
             )
      AND  s_nationkey = n_nationkey
      AND  n_name = 'CANADA'
 ORDER BY  s_name

This identifies suppliers that have parts in stock in excess of half a year's shipments of said part.

The use of IN to denote a join is the first catch. The second is joining to lineitem by hash without building an overly large hash table. We know that IN becomes EXISTS which in turn can become a join as follows:

SELECT 
        l_suppkey 
FROM
        lineitem 
WHERE
        l_partkey IN  
          ( SELECT  
                    p_partkey 
              FROM  
                    part 
             WHERE  
                    p_name LIKE 'forest%'
       )
;

-- is --

SELECT  
        l_suppkey 
  FROM  
        lineitem 
 WHERE  EXISTS  
          ( SELECT  
                    p_partkey 
              FROM  
                    part 
             WHERE  
                    p_partkey = l_partkey 
               AND  p_name LIKE 'forest%')
;

-- is --

SELECT  
        l_suppkey 
  FROM  
        lineitem, 
        ( SELECT  
                    DISTINCT p_partkey 
            FROM  
                    part 
           WHERE  
                    p_name LIKE 'forest%') f 
 WHERE  
        l_partkey = f.p_partkey
;

But since p_partkey is unique, the DISTINCT drops off and we have.

SELECT  
        l_suppkey 
  FROM  
        lineitem, 
        part 
 WHERE  
        p_name LIKE 'forest% 
   AND  l_partkey = f.p_partkey
;

You see, the innermost IN with the ps_partkey goes through all these changes, and just becomes a join. The outermost IN stays as a distinct derived table, since ps_suppkey is not unique, and the meaning of IN is not to return a given supplier more than once.

The derived table is flattened and the DISTINCT is done partitioned; hence the stage node in front of the distinct. A DISTINCT can be multithreaded, if each thread gets a specific subset of all the keys. The stage node is an exchange of tuples between several threads. Each thread then does a TOP k sort. The TOP k trick we saw in Q18 is used, but does not contribute much here.

{ 
time   8.2e-06% fanout         1 input         1 rows
time   0.00017% fanout         1 input         1 rows
{ hash filler
time   6.1e-05% fanout         1 input         1 rows
NATION         1 rows(.N_NATIONKEY)
 N_NAME =  <c CANADA>
time   1.2e-05% fanout         0 input         1 rows
Sort hf 34 (.N_NATIONKEY)
}
time     0.073% fanout         1 input         1 rows
{ hash filler
time       4.1% fanout    240672 input         1 rows
PART   2.4e+05 rows(t74.P_PARTKEY)
 P_NAME LIKE  <c forest%> LIKE  <c >
time     0.011% fanout         0 input    240672 rows
Sort hf 47 (t74.P_PARTKEY)
}
time      0.69% fanout         1 input         1 rows
{ hash filler
Subquery 56 
{ 
time        42% fanout 1.09657e+06 input         1 rows
LINEITEM   9.1e+07 rows(t76.L_PARTKEY, t76.L_SUPPKEY, t76.L_QUANTITY)
 L_SHIPDATE >= <c 1994-01-01> < <c 1995-01-01>
hash partition+bloom by 54 (tmp)hash join merged always card     0.012 -> ()
time     0.022% fanout         1 input 1.09657e+06 rows
Hash source 47 merged into ts not partitionable     0.012 rows(t76.L_PARTKEY) -> ()
 
After code:
      0: t76.L_PARTKEY :=  := artm t76.L_PARTKEY
      4: t76.L_SUPPKEY :=  := artm t76.L_SUPPKEY
      8: t76.L_QUANTITY :=  := artm t76.L_QUANTITY
      12: BReturn 0
time      0.22% fanout         0 input 1.09657e+06 rows
Sort hf 80 (t76.L_PARTKEY, t76.L_SUPPKEY) -> (t76.L_QUANTITY)
 
}
}
time   2.1e-05% fanout         1 input         1 rows
time   3.2e-05% fanout         1 input         1 rows
{ fork
time       5.3% fanout    240672 input         1 rows
PART   2.4e+05 rows(t6.P_PARTKEY)
 P_NAME LIKE  <c forest%> LIKE  <c >
time       1.9% fanout         4 input    240672 rows
PARTSUPP       1.2 rows(t4.PS_AVAILQTY, t4.PS_PARTKEY, t4.PS_SUPPKEY)
 inlined  PS_PARTKEY = t6.P_PARTKEY
time        16% fanout  0.680447 input    962688 rows
END Node
After test:
      0: { 
time      0.08% fanout         1 input    962688 rows
time       9.4% fanout         1 input    962688 rows
{ fork
time       3.6% fanout         0 input    962688 rows
Hash source 80       0.013 rows(k_t4.PS_PARTKEY, k_t4.PS_SUPPKEY) -> (t8.L_QUANTITY)
 
After code:
      0:  sum sumt8.L_QUANTITYset no set_ctr
      5: BReturn 0
}
 
After code:
      0: temp := artm  0.5  * sum
      4: aggregate :=  := artm temp
      8: BReturn 0
time      0.85% fanout         0 input    962688 rows
Subquery Select(aggregate)
}
 
      8: if (t4.PS_AVAILQTY > scalar) then 12 else 13 unkn 13
      12: BReturn 1
      13: BReturn 0
time         1% fanout         1 input    655058 rows
Stage 2
time     0.071% fanout         1 input    655058 rows
Distinct (q_t4.PS_SUPPKEY)
 
After code:
      0: PS_SUPPKEY :=  := artm t4.PS_SUPPKEY
      4: BReturn 0
time     0.016% fanout         1 input    655058 rows
Subquery Select(PS_SUPPKEY)
time       3.2% fanout 0.0112845 input    655058 rows
SUPPLIER unq     0.075 rows (.S_NAME, .S_NATIONKEY, .S_ADDRESS)
 inlined  S_SUPPKEY = PS_SUPPKEY
hash partition+bloom by 38 (tmp)hash join merged always card      0.04 -> ()
top k on S_NAME
time    0.0012% fanout         1 input      7392 rows
Hash source 34 merged into ts       0.04 rows(.S_NATIONKEY) -> ()
time     0.074% fanout         0 input      7392 rows
Sort (.S_NAME) -> (.S_ADDRESS)
 
}
time   0.00013% fanout       100 input         1 rows
top order by read (.S_NAME, .S_ADDRESS)
time     5e-06% fanout         0 input       100 rows
Select (.S_NAME, .S_ADDRESS)
}


 1777 msec 1355% cpu,    894483 rnd 6.39422e+08 seq   79.1214% same seg   19.3093% same pg 

1.8s is sufficient, and in the ballpark with VectorWise. Some further gain is possible, as the lineitem hash table can also be restricted by supplier; after all, only 1/25 of all suppliers are in the end considered. Further simplifications are possible. Another 20% of time could be saved. The tricks are however quite complex and specific, and there are easier gains to be had -- for example, in reusing intermediates in Q17 and Q15.

The next installment will discuss late projection and some miscellaneous tricks not mentioned so far. After this, we are ready to take an initial look at the performance of the system as a whole.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
03/20/2014 16:01 GMT Modified: 04/07/2014 12:34 GMT
Linked Geospatial Data 2014 Workshop, Part 4: GeoKnow, London, Brussels, The Message [ Virtuso Data Space Bot ]

Last Friday (2014-03-14) I gave a talk about GeoKnow at the EC Copernicus Big Data workshop. This was a trial run for more streamlined messaging. I have, aside the practice of geekcraft, occupied myself with questions of communication these last weeks.

The clear take-home from London and Brussels alike is that these events have full days and 4 or more talks an hour. It is not quite TV commercial spots yet but it is going in this direction.

If you say something complex, little will get across unless the audience already knows what you will be saying.

I had a set of slides from Jens Lehmann, the GeoKnow project coordinator, for whom I was standing in. Now these are a fine rendition of the description of work. What is wrong with partners, work packages, objectives, etc? Nothing, except everybody has them.

I recall the old story about the journalist and the Zen master: The Zen master repeatedly advises the reporter to cut the story in half. We get the same from PR professionals, "If it is short, they have at least thought about what should go in there," said one recently, talking of pitches and messages. The other advice was to use pictures. And to have a personal dimension to it.

Enter "Ms. Globe" and "Mr. Cube". Frans Knibbe of Geodan gave the Linked Geospatial Data 2014 workshop's most memorable talk entitled "Linked Data and Geoinformatics - a love story" (pdf) about the excitement and the pitfalls of the burgeoning courtship of Ms. Globe (geoinformatics) and Mr. Cube (semantic technology). They get to talking, later Ms. Globe thinks to herself... "Desiloisazation, explicit semantics, integrated metadata..." Mr. Cube, young upstart now approaching a more experienced and sophisticated lady, dreams of finally making an entry into adult society, "critical mass, global scope, relevant applications..." There is a vibration in the air.

So, with Frans Knibbe's gracious permission I borrowed the storyline and some of the pictures.

We ought to make a series of cartoons about the couple. There will be twists and turns in the story to come.

Mr. Cube is not Ms. Globe's first lover, though; there is also rich and worldly Mr. Table. How will Mr. Cube prove himself? The eternal question... Well, not by moping around, not by wise-cracking about semantics, no. By boldly setting out upon a journey to fetch the Golden Fleece from beyond the crashing rocks. "Column store, vectored execution, scale out, data clustering, adaptive schema..." he affirms, with growing confidence.

This is where the story stands, right now. Virtuoso run circles around PostGIS doing aggregations and lookups on geometries in a map-scrolling scenario (GeoKnow's GeoBenchLab). Virtuoso SPARQL outperforms PostGIS SQL against planet-scale OpenStreetMap; Virtuoso SQL goes 5-10x faster still.

Mr Cube is fast on the draw, but still some corners can be smoothed out.

Later in GeoKnow, there will be still more speed but also near parity between SQL and SPARQL via taking advantage of data regularity in guiding physical storage. If it is big, it is bound to have repeating structure.

The love story grows more real by the day. To be consummated still within GeoKnow.

Talking of databases has the great advantage that this has been a performance game from the start. There are few people who need convincing about the desirability of performance, as this also makes for lower cost and more flexibility on the application side.

But this is not all there is to it.

In Brussels, the public was about E-science (Earth observation). In science, it is understood that qualitative aspects can be even more crucial. I told the story about an E-science-oriented workshop I attended in America years ago. The practitioners, from high energy physics to life sciences to climate, had invariably come across the need for self-description of data and for schema-last. This was essentially never provided by RDF, except for some life science cases. Rather, we had one-off schemes, ranging from key-value pairs to putting the table name in a column of the same table to preserve the origin across data export.

Explicit semantics and integrated metadata are important, Ms. Globe knows, but she cannot sacrifice operational capacity for this. So it is more than a DBMS or even data model choice -- there must be a solid tool chain for data integration and visualization. GeoKnow provides many tools in this space.

Some of these, such as the LIMES entity matching framework (pdf) are probably close to the best there is. For other parts, the SQL-based products with hundreds of person years invested in user interaction are simply unbeatable.

In these cases, the world can continue to talk SQL. If the regular part of the data is in fact tables already, so much the better. You connect to Virtuoso via SQL, just like to PostGIS or Oracle Spatial, and talk SQL MM. The triples, in the sense of flexible annotation and integrated metadata, stay there; you just do not see them if you do not want them.

There are possibilities all right. In the coming months I will showcase some of the progress, starting with a detailed look at the OpenStreetMap experiments we have made in GeoKnow.

Linked Geospatial Data 2014 Workshop posts:

# PermaLink Comments [0]
03/18/2014 10:45 GMT Modified: 03/18/2014 10:52 GMT
Linked Geospatial Data 2014 Workshop, Part 3: The Stellar Reach of OKFN [ Virtuso Data Space Bot ]

The Open Knowledge Foundation (OKFN) held a London Open Data Meetup in the evening of the first day of the Linked Geospatial Data 2014 workshop. The event was, as they themselves put it, at the amazing open concept office of OKFN at the Center for Creative Collaboration in Central London. What could sound cooler? True, OKFN threw a good party, with ever engaging and charismatic founder Rufus Pollock presiding. Phil Archer noted, only half in jest, that OKFN was so influential, visible, had the ear of government and public alike, etc., that it put W3C to shame.

Now, OKFN is a party in the LOD2 FP7 project, so I have over the years met people from there on and off. In LOD2, OKFN is praised to the skies for its visibility and influence and outreach and sometimes, in passing, critiqued for not publishing enough RDF, let alone five star linked data.

As it happens, CSV rules, and even the W3C will, it appears, undertake to standardize a CSV-to-RDF mapping. As far as I am concerned, as long as there is no alignment of identifiers or vocabulary, whether a thing is CSV or exactly equivalent RDF, there is no great difference, except that CSV is smaller and loads into Excel.

For OKFN, which has a mission of opening data, insisting on any particular format would just hinder the cause.

What do we learn from this? OKFN is praised not only for government relations but also for developer friendliness. Lobbying for open data is something I can understand, but how do you do developer relations? This is not like talking to customers, where the customer wants to do something and it is usually possible to give some kind of advice or recommendation on how they can use our technology for the purpose.

Are JSON and Mongo DB the key? A well renowned database guy once said that to be with the times, JSON is your data model, Hadoop your file system, Mongo DB your database, and JavaScript your language, and failing this, you are an old fart, a legacy suit, well, some uncool fossil.

The key is not limited to JSON. More generally, it is zero time to some result and no learning curve. Some people will sacrifice almost anything for this, such as the possibility of doing arbitrary joins. People will even write code, even lots of it, if it only happens to be in their framework of choice.

Phil again deplored the early fiasco of RDF messaging. "Triples are not so difficult. It is not true that RDF has a very steep learning curve." I would have to agree. The earlier gaffes of the RDF/XML syntax and the infamous semantic web layer cake diagram now lie buried and unlamented; let them be.

Generating user experience from data or schema is an old mirage that has never really worked out. The imagined gain from eliminating application writing has however continued to fascinate IT minds and attempts in this direction have never really ceased. The lesson of history seems to be that coding is not to be eliminated, but that it should have fast turnaround time and immediately visible results.

And since this is the age of data, databases should follow this lead. Schema-last is a good point, maybe adding JSON alongside XML as an object type in RDF might not be so bad. There are already XML functions, so why not the analog for JSON? Just don't mention XML to the JSON folks...

How does this relate to OKFN? Well, in the first instance this is the cultural impression I received from the meetup, but in a broader sense these factors are critical to realizing the full potential of OKFN's successes so far. OKFN is a data opening advocacy group; it is not a domain-specific think tank or special interest group. The data owners and their consultants will do analytics and even data integration if they see enough benefit in this, all in the established ways. However, the widespread opening of data does create possibilities that did not exist before. Actual benefits depend in great part on constant lowering of access barriers, and on a commitment by publishers to keep the data up to date, so that developers can build more than just a one-off mashup.

True, there are government users of open data, since there is a productivity gain in already having the neighboring department's data opened to a point; one does no longer have to go through red tape to gain access to it.

For an application ecosystem to keep growing on the base of tens of thousands of very heterogeneous datasets coming into the open, continuing to lower barriers is key. This is a very different task from making faster and faster databases or of optimizing a particular business process, and it demands different thinking.

Linked Geospatial Data 2014 Workshop posts:

# PermaLink Comments [0]
03/18/2014 10:45 GMT
Linked Geospatial Data 2014 Workshop, Part 2: Is SPARQL Slow? [ Virtuso Data Space Bot ]

I had a conversation with Andy Seaborne of Epimorphics, initial founder of the Jena RDF Framework tool chain and editor of many W3C recommendations, among which the two SPARQLs. We exchanged some news; I told Andy about our progress in cutting the RDF-to-SQL performance penalty and doing more and better SQL tricks. Andy asked me if there were use cases doing analytics over RDF, not in the business intelligence sense, but in the sense of machine learning or discovery of structure. There is, in effect, such work, notably in data set summarization and description. A part of this has to do with learning the schema, like one would if wanting to put triples into tables when appropriate. CWI in LOD2 has worked in this direction, as has DERI (Giovanni Tummarello's team), in the context of giving hints to SPARQL query writers. I would also mention Chris Bizer et al., at University of Mannheim, with their data integration work, which is all about similarity detection in a schema-less world, e.g., the 150M HTML tables in the Common Crawl, briefly mentioned in the previous blog. Jens Lehmann from University of Leipzig has also done work in learning a schema from the data, this time in OWL.

Andy was later on a panel where Phil Archer asked him whether SPARQL was slow by nature or whether this was a matter of bad implementations. Andy answered approximately as follows: "If you allow for arbitrary ad hoc structure, you will always pay something for this. However, if you tell the engine what your data is like, it is no different from executing SQL." This is essentially the gist of our conversation. Most likely we will make this happen via adaptive schema for the regular part and exceptions as quads.

Later I talked with Phil about the "SPARQL is slow" meme. The fact is that Virtuoso SPARQL will outperform or match PostGIS SQL for Geospatial lookups against the OpenStreetMap dataset. Virtuoso SQL will win by a factor of 5 to 10. Still, the SPARQL is slow meme is not entirely without a basis in fact. I would say that the really blatant cases that give SPARQL a bad name are query optimization problems. With 50 triple patterns in a query there are 50-factorial ways of getting a bad plan. This is where the catastrophic failures of 100+ times worse than SQL come from. The regular penalty of doing triples vs tables is somewhere between 2.5 (Star Schema Benchmark) and 10 (lookups with many literals), quite acceptable for many applications. Some really bad cases can occur with regular expressions on URI strings or literals, but then, if this is the core of the application, it should use a different data model or an n-gram index.

The solutions, including more dependable query plan choice, will flow from adaptive schema which essentially reduces RDF back into relational, however without forcing schema first and with accommodation for exceptions in the data.

Phil noted here that there already exist many (so far, proprietary) ways of describing the shape of a graph. He said there would be a W3C activity for converging these. If so, a vocabulary that can express relationships, the types of related entities, their cardinalities, etc., comes close to a SQL schema and its statistics. Such a thing can be the output of data analysis, or the input to a query optimizer or storage engine, for using a schema where one in fact exists. Like this, there is no reason why things would be less predictable than with SQL. The idea of a re-convergence of data models is definitely in the air; this is in no sense limited to us.

Linked Geospatial Data 2014 Workshop posts:

# PermaLink Comments [0]
03/18/2014 10:45 GMT
Linked Geospatial Data 2014 Workshop, Part 1: Web Services or SPARQL Modeling? [ Virtuso Data Space Bot ]

The W3C (World Wide Web Consortium) and OGC (Open Geospatial Consortium) organized the Linked Geospatial Data 2014 workshop in London this week. The GeoKnow project was represented by Claus Stadler of Universität Leipzig, and Hugh Williams and myself (Orri Erling) from OpenLink Software. The Open Knowledge Foundation (OKFN) also held an Open Data Meetup in the evening of the first day of the workshop.

Reporting on each talk and the many highly diverse topics addressed is beyond the scope of this article; for this you can go to the program and the slides that will be online. Instead, I will talk about questions that to me seemed to be in the air, and about some conversations I had with the relevant people.

The trend in events like this is towards shorter and shorter talks and more and more interaction. In this workshop, talks were given in series of three talks with all questions at the end, with all the presenters on stage. This is not a bad idea since we get a panel-like effect where many presenters can address the same question. If the subject matter allows, a panel is my preferred format.

Web services or SPARQL? Is GeoSPARQL good? Is it about Linked Data or about ontologies?

Geospatial data tends to be exposed via web services, e.g., WFS (Web Feature Service). This allows item retrieval on a lookup basis and some predefined filtering, transformation, and content negotiation. Capabilities vary; OGC now has WFS 2.0, and there are open source implementations that do a fair job of providing the functionality.

Of course, a real query language is much more expressive, but a service API is more scalable, as people say. What they mean is that an API is more predictable. For pretty much any complex data task, a query language is near-infinitely more efficient than going back-and-forth, often on a wide area network, via an API. So, as Andreas Harth put it: for data publishers, make an API; an open SPARQL endpoint is too "brave," [Andreas' word, with the meaning of foolhardy]. When you analyze, he continued, then you load it into a endpoint, but you use your own. Any quality of service terms must be formulated with respect to a fixed workload, this is not meaningful with ad hoc queries in an expressive language. Things like anytime semantics (return whatever is found within a time limit) are only good for a first interactive look, not for applications.

Should the application go to the data or the reverse? Some data is big and moving it is not self-evident. A culture of datasets being hosted on a cloud may be forming. Of course some linked data like DBpedia has for a long time been available as Amazon images. Recently, SindiceTech has made a similar packaging of Freebase. The data of interest here is larger and its target audience is more specific, on the e-science side.

How should geometries be modeled? I have met the GeoSPARQL and the SQL MM on which it is based with a sense of relief, as these are reasonable things that can be efficiently implemented. There are proposals where points have URIs, and linestrings are ordered sets of points, and collections are actual trees with RDF subjects as nodes. As a standard, such a thing is beyond horrible, as it hits all the RDF penalties and overheads full force, and promises easily 10x worse space consumption and 100x worse run times compared to the sweetly reasonable GeoSPARQL. One presenter said that cases of actually hanging attributes off points of complex geometries had been heard of but were, in his words, anecdotal. He posed a question to the audience about use cases where points in fact needed separately addressable identities. Several cases did emerge, involving, for example, different measurement certainties for different points on on a trajectory trace obtained by radar. Applications that need data of this sort will perforce be very domain specific. OpenStreetMap (OSM) itself is a bit like this, but there the points that have individual identity also have predominantly non-geometry attributes and stand for actually-distinct entities. OSM being a practical project, these are then again collapsed into linestrings for cases where this is more efficient. The OGC data types themselves have up to 4 dimensions, of which the 4th could be used as an identifier of a point in the event this really were needed. If so, this would likely be empty for most points and would compress away if the data representation were done right.

For data publishing, Andreas proposed to give OGC geometries URIs, i.e., the borders of a country can be more or less precisely modeled, and the large polygon may have different versions and provenances. This is reasonable enough, as long as the geometries are big. For applications, one will then collapse the 1:n between entity and its geometry into a 1:1. In the end, when you make an application, even an RDF one, you do not just throw all the data in a bucket and write queries against that. Some alignment and transformation is generally involved.

Linked Geospatial Data 2014 Workshop posts:

# PermaLink Comments [0]
03/18/2014 10:45 GMT
 <<     | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |     >>
Powered by OpenLink Virtuoso Universal Server
Running on Linux platform