Details

Virtuso Data Space Bot
Burlington, United States

Subscribe

Post Categories

Recent Articles

Display Settings

articles per page.
order.
LOD2 Finale (part 2 of n): The 500 Giga-triples

No epic is complete without a descent into hell. Enter the historia calamitatum of the 500 Giga-triples (Gt) at CWI's Scilens cluster.

Now, from last time, we know to generate the data without 10 GB of namespace prefixes per file and with many short files. So we have 1.5 TB of gzipped data in 40,000 files, spread over 12 machines. The data generator has again been modified. Now the generation was about 4 days. Also from last time, we know to treat small integers specially when they occur as partition keys: 1 and 2 are very common values and skew becomes severe if they all go to the same partition; hence consecutive small INTs each go to a different partition, but for larger ones the low 8 bits are ignored, which is good for compression: Consecutive values must fall in consecutive places, but not for small INTs. Another uniquely brain-dead feature of the BSBM generator has also been rectified: When generating multiple files, the program would put things in files in a round-robin manner, instead of putting consecutive numbers in consecutive places, which is how every other data generator or exporter does it. This impacts bulk load locality and as you, dear reader, ought to know by now, performance comes from (1) locality and (2) parallelism.

The machines are similar to last time: each a dual E5 2650 v2 with 256 GB RAM and QDR InfiniBand (IB). No SSD this time, but a slightly higher clock than last time; anyway, a different set of machines.

The first experiment is with triples, so no characteristic sets, no schema.

So, first day (Monday), we notice that one cannot allocate more than 9 GB of memory. Then we figure out that it cannot be done with malloc, whether in small or large pieces, but it can with mmap. Ain't seen that before. One day shot. Then, towards the end of day 2, load begins. But it does not run for more than 15 minutes before a network error causes the whole thing to abort. All subsequent tries die within 15 minutes. Then, in the morning of day 3, we switch from IB to Gigabit Ethernet (GigE). For loading this is all the same; the maximal aggregate throughput is 800 MB/s, which is around 40% of the nominal bidirectional capacity of 12 GigE's. So, it works better, for 30 minutes, and one can even stop the load and do a checkpoint. But after resuming, one box just dies; does not even respond to ping. We change this to another. After this, still running on GigE, there are no more network errors. So, at the end of day 3, maybe 10% of the data are in. But now it takes 2h21min to make a checkpoint, i.e., make the loaded data durable on disk. One of the boxes manages to write 2 MB/s to a RAID-0 of three 2 TB drives. Bad disk, seen such before. The data can however be read back once the write is finally done.

Well, this is a non-starter. So, by mid-day of day 4, another machine has been replaced. Now writing to disk is possible within expected delays.

In the afternoon of day 4, the load rate is about 4.3 Mega-triples (Mt) per second, all going in RAM.

In the evening of day 4, adding more files to load in parallel increases the load rate to between 4.9 and 5.2 Mt/s. This is about as fast as this will go, since the load is not exactly even. This comes from the RDF stupidity of keeping an index on everything, so even object values where an index is useless get indexed, leading to some load peaks. For example, there is an index on POSG for triples were the predicate is rdf:type and the object is a common type. Use of characteristic sets will stop this nonsense.

But let us not get ahead of the facts: At 9:10 PM of day 4, the whole cluster goes unreachable. No, this is not a software crash or swapping; this also affects boxes on which nothing of the experiment was running. A whole night of running is shot.

A previous scale model experiment of loading 37.5 Gt in 192 GB of RAM, paging to a pair of 2 TB disks, has been done a week before. This finishes in time, keeping a load rate of above 400 Kt/s on a 12-core box.

At 10AM on day 5 (Friday), the cluster is rebooted; a whole night's run missed. The cluster starts and takes about 30 minutes to get to its former 5 Mt/s load rate. We now try switching the network back to InfiniBand. The whole ethernet network seemed to have crashed at 9PM on day 4. This is of course unexplained but the experiment had been driving the ethernet at about half its cross-sectional throughput, so maybe a switch crashed. We will never know. We will now try IB rather than risk this happening again, especially since if it did repeat, the whole weekend would be shot, as we would have to wait for the admin to reboot the lot on Monday (day 8).

So, at noon on day 5, the cluster is restarted with IB. The cruising speed is now 6.2 Mt/s, thanks to the faster network. The cross sectional throughput is about 960 MB/s, up from 720 MB/s, which accounts for the difference. CPU load is correspondingly up. This is still not full platform since there is load unbalance as noted above.

At 9PM on day 5, the rate is around 5.7 Mt/s with the peak node at 1500% CPU out of a possible 1600%. The next one is under 800%, which is just to show what it means to index everything. In specific, the node that has the highest CPU is the one in whose partition the bsbm:offer class falls, so that there is a local peak since one of every 9 or so triples says that something is an offer. The stupidity of the triple store is to index garbage like this to begin with. The reason why the performance is still good is that a POSG index where P and O are fixed and the S is densely ascending is very good, with everything but the S represented as run lengths and the S as bitmaps. Still, no representation at all is better for performance than even the most efficient representation.

The journey consists of 3 different parts. At 10PM, the 3rd and last part is started. The triples have more literals, but the load is more even. The cruising speed is 4.3 Mt/s down from 6.2, but the data has a different shape, including more literals.

The last stretch of the data is about reviews. This stretch of the data has less skew. So we increase parallelism, running 8 x 24 files at a time. The load rate goes above 6.3 Mt/s.

At 6:45 in the morning of day 6, the data is all loaded. The count of triples is 490.0 billion. If the load were done in a single stretch without stops and reconfiguration, it would likely go in under 24h. The average rate for a 4 hour sample between midnight and 4AM of day 6 is 6.8 MT/s. The resulting database files add up to 10.9 TB, with about 20% of the volume in unallocated pages.

At this time, noon of day 6, we find that some cross-partition joins need more distinct pieces of memory than the default kernel settings allow per process. A large number of partitions makes a large number of sometimes long messages which makes many mmaps. So we will wait until morning of day 8 (Monday) for the administrator to set these. In the meantime, we analyze the behavior of the workload on the 37 Gt scale model cluster on my desktop.

To be continued...

LOD2 Finale Series

# PermaLink Comments [0]
08/18/2014 16:55 GMT-0500
LOD2 Finale (part 1 of n): RDF Before The Dawn

The LOD2 FP7 ends at the end of August, 2014. This post begins a series that will crown the project with a grand finale, another decisive step towards the project’s chief goal of giving RDF and linked data performance parity with SQL systems.

In a nutshell, LOD2 went like this:

  1. Triples were done right, taking the best of the column store world and adapting it to RDF. This is now in widespread use.

  2. SQL was done right, as I have described in detail in the TPC-H series. This is generally available as open source in v7fasttrack. SQL is the senior science and a runner-up like sem-tech will not carry the day without mastering this.

  3. RDF is now breaking free of the triple store. RDF is a very general, minimalistic way of talking about things. It is not a prescription on how to do database. Confusing these two things has given rise to RDF’s relative cost against alternatives. To cap off LOD2, we will have the flexibility of triples with the speed of the best SQL.

In this post we will look at accomplishments so far and outline what is to follow during August. We will also look at what in fact constitutes the RDF overhead, why this is presently so, and why this does not have to stay thus.

This series will be of special interest to anybody concerned with RDF efficiency and scalability.

At the beginning of LOD2, I wrote a blog post discussing the RDF technology and its planned revolution in terms of the legend of Perseus. The classics give us exemplars and archetypes, but actual histories seldom follow them one-to-one; rather, events may have a fractal nature where subplots reproduce the overall scheme of the containing story.

So it is also with LOD2: The Promethean pattern of fetching the fire (state of the art of the column store) from the gods (the DB world) and bringing it to fuel the campfires of the primitive semantic tribes is one phase, but it is not the totality. This is successfully concluded, and Virtuoso 7 is widely used at present. Space efficiency gains are about 3x over the previous, with performance gains anywhere from 3 to 100x. As pointed out in the Star Schema Benchmark series (part 1 and part 2), in the good case one can run circles in SPARQL around anything but the best SQL analytics databases.

In the larger scheme of things, this is just preparation. In the classical pattern, there is the call or the crisis: Presently this is that having done triples about as right as they can be done, the mediocre in SQL can be vanquished, but the best cannot. Then there is the actual preparation: Perseus talking to Athena and receiving the shield of polished brass and the winged sandals. In the present case, this is my second pilgrimage to Mount Database, consisting of the TPC-H series. Now, the incense has been burned and libations offered at each of the 22 stations. This is not reading papers, but personally making one of the best-ever implementations of this foundational workload. This establishes Virtuoso as one of the top-of-the-line SQL analytics engines. The RDF public, which is anyway the principal Virtuoso constituency today, may ask what this does for them.

Well, without this step, the LOD2 goal of performance parity with SQL would be both meaningless and unattainable. The goal of parity is worth something only if you compare the RDF contestant to the very best SQL. And the comparison cannot possibly be successful unless it incorporates the very same hard core of down-to-the-metal competence the SQL world has been pursuing now for over forty years.

It is now time to cut the Gorgon’s head. The knowledge and prerequisite conditions exist.

The epic story is mostly about principles. If it is about personal combat, the persons stand for values and principles rather than for individuals. Here the enemy is actually an illusion, an error of perception, that has kept RDF in chains all this time. Yes, RDF is defined as a data model with triples in named graphs, i.e., quads. If nothing else is said, an RDF Store is a thing that can take arbitrary triples and retrieve them with SPARQL. The naïve implementation is to store things as rows in a quad table, indexed in any number of ways. There have been other approaches suggested, such as property tables or materialized views of some joins, but these tend to flush the baby with the bathwater: If RDF is used in the first place, it is used for its schema-less-ness and for having global identifiers. In some cases, there is also some inference, but the matter of schema-less-ness and identifiers predominates.

We need to go beyond a triple table and a dictionary of URI names while maintaining the present semantics and flexibility. Nobody said that physical structure needs to follow this. Everybody just implements things this way because this is the minimum that will in any case be required. Combining this with a SQL database for some other part of the data/workload hits basically insoluble problems of impedance mismatch between the SQL and SPARQL type systems, maybe using multiple servers for different parts of a query, etc. But if you own one of the hottest SQL racers in DB city and can make it do anything you want, most of these problems fall away.

The idea is simple: Put the de facto rectangular part of RDF data into tables; do not naively index everything in places where an index gives no benefit; keep the irregular or sparse part of the data as quads. Optimize queries according to the table-like structure, as that is where the volume is and where getting the best plan is a make or break matter, as we saw in the TPC-H series. Then, execute in a way where the details of the physical plan track the data; i.e., sometimes the operator is on a table, sometimes on triples, for the long tail of exceptions.

In the next articles we will look at how this works and what the gains are.

These experiments will for the first time showcase the adaptive schema features of the Virtuoso RDF store. Some of these features will be commercial only, but the interested will be able to reproduce the single server experiments themselves using the v7fasttrack open source preview. This will be updated around the second week of September to give a preview of this with BSBM and possibly some other datasets, e.g., Uniprot. Performance gains for regular datasets will be very large.

To be continued...

LOD2 Finale Series

# PermaLink Comments [0]
08/18/2014 16:55 GMT-0500
In Hoc Signo Vinces (part 15 of n): TPC-H and the Science of Hash

This piece is dedicated to Peter Boncz, architect of Actian Vector and MonetDB.

Query optimization is hard. It is a set of mutually interacting tricks and special cases. Execution is also hard, but there the tricks do not interact quite as much or as unpredictably. So, if there is a few percent of score to be had from optimization of either execution or query, I will take execution first. It is less likely to break things and will probably benefit a larger set of use cases.

As we see from the profile in the previous article, hash join is the main piece of execution in TPC-H. So between the article on late projection and the first result preview, I changed the hash table used in HASH JOIN and GROUP BY from cuckoo to linear.

Let's see how the hash tables work: Cuckoo hash is a scheme where an entry can be in one of two possible places in the table. If a new entry is inserted and either of the possible places is unoccupied, it goes there. If both are occupied, it could be that one contains an entry whose other possible location is free -- and then that entry may be relocated. Thus an insert may push the previous occupant of the place somewhere else, which in turn may push another, and so on. It may happen that an insert is still not possible, in which case the entry to insert goes into an exceptions list.

To look up an entry, you get a hash number, and use different fields of it to pick the two places. Look in one, then the other, then the exceptions. If there is no match and the table is reasonably close to capacity, you will have looked in at least 3 widely-separated places to determine the absence of a match. In practice, the hash table consists on a prime number of distinct arrays of a fixed size (partitions), and each partition has its own exception list. A modulo of the hash number picks the array, then two further modulos of different parts of the number pick the places in the array.

In most cases in TPC-H, the hash joins are selective; i.e., most items on the probe side find no match in the hash table.

So, quite often you have 3 cache misses to show that there is no hit. This is, at least in theory, quite bad.

There are Bloom filters before the hash table. The Bloom filter will prune most of the probes that would miss. A Bloom filter is an array of bits. Given a hash number, the Bloom filter will very efficiently tell you whether the entry is sure not to be in the hash table. If the Bloom filter says it can be in the hash table, you must look.

In the Virtuoso case, for each entry in the hash table, the Bloom filter has 8 bits. The Bloom check uses a field of the hash number to pick a 64-bit word from the Bloom filter. Then different fields of the hash number are used to set up-to 4 bits in a 64-bit bit-mask. When building the hash table, the masks are OR-ed into the Bloom filter. When probing, before looking in the hash table, the system checks to see if the bits corresponding to the hash number are all on in the appropriate word. If they are not, the hash lookup is sure to miss.

Most expositions of Bloom filters talk about setting two bits for every value. With two bits set, we found 8 bits-per-value to work best. More bits makes a larger filter and misses the cache more; fewer bits makes too many collisions, and the Bloom filter produces too many false positives. A significant finding is that with 8 bits-per-value, setting 4 bits instead of 2 causes the filter to be twice as selective. The simple trick of setting 4 bits cuts the number of hash lookups for items that passed the Bloom filter to half in many selective hash joins. Examples are the many joins of lineitem and part or supplier where there is a condition on the smaller table.

Still, even with Bloom filters, a cuckoo hash will make too many cache misses.

So, enter linear hash. The idea is simple: The hash number picks a place in an array. Either the entry being sought is in the vicinity, or it is not in the hash table. If the vicinity is full of other entries, the entry can still be in an exception list.

With this and cuckoo alike, there are 3 different variants of hash table:

  1. A set of single unique integers

  2. A single-integer key with 0 or more dependent values, possibly with a next link if the key is not unique

  3. A key of n arbitrary values, 0 or more dependent values, optional next link if the key is not unique

In the first case, the hash table is an array of values; in the two other cases, it is an array of pointers. But since a pointer is 64 bits, of which the high 16 are not in the address space of x86_64, these high bits can be used to keep a part of the hash number. It will be necessary to dereference the pointer only if the high bits match the hash number. This means that nearly all lookups that do not find a match are handled with a single cache miss.

Each cache miss brings in a cache line of 8 words. The lookup starts at a point given by the hash number and wraps around at the end of the cache line. Only in the case that all 8 words are occupied but do not match does one need to look at the exceptions. There is one exception list for each partition of the hash table, like in the cuckoo scheme.

A hash lookup is always done on a vector of keys; the loop that takes most of the time is in fact the Bloom filter check. It goes as follows:

#define CHB_VAR(n)                           \
   uint64 h##n, w##n, mask##n;


#define CHB_INIT(n, i)                       \
   MHASH_STEP_1 (h##n, i);                   \
   w##n = bf[BF_WORD (h##n, sz)];            \
   mask##n = BF_MASK (h##n);


#define CHB_CK(n)                            \
   { matches[mfill] = inx + n;               \
     mfill += (w##n & mask##n) == mask##n; }

       for (inx = inx; inx < last; inx ++)
         {
           CHB_VAR (0);
           CHB_INIT (0, REF ((ce_first + sizeof (ELT_T) * inx)));
           CHB_CK (0);
         }

This is the perfect loop for out-of-order execution. Now, I have tried every variation you can imagine, and this does not get better. The loop calculates a hash number, fetches the corresponding word from the Bloom filter, calculates a mask, stores the index of the key in a results array, and increments the results counter if all the bits were set. There is no control dependency anywhere, just a data dependency between successive iterations; i.e., to know where the result must go, you must know if the previous was a hit.

You can unroll this loop very easily, so, for example, take 4 keys, do the numbers, fetch the words, and then check them one after the other. One would think this would have more misses in flight at any one time, which it does. But it does not run any faster.

Maybe the loop is too long. Circumstantial evidence suggests that short loops are better for instruction prefetching. So, one can also make a loop that gets any number of words of the Bloom filter and puts them in one local array and the hash numbers in another array. A subsequent loop then reads the hash number, calculates the mask, and checks if there is a hit. In this way one can generate as many misses as one wants and check them as late as one wants. It so happens that doing 8 misses and then checking them is better than either 4 or 16. But 8 is still marginally worse than the loop first mentioned.

One can also vary the test. Instead of adding a truth value to the result counter, one can have

if ((word & mask) == mask) result[fill++] = inx;

There is no clear difference between predication (incrementing the fill by truth value) and a conditional jump. The theory of out-of-order execution would predict predication to be better, but the difference is lost in measurement noise. This is true on both Intel Nehalem (Xeon 55xx) and Sandy Bridge (E5 26xx), but could be different on other architectures.

The multicore scalability of the test will give some information about platform utilization.

This is the ultimately simplified selective hash join:

SELECT  COUNT (*) 
  FROM  lineitem, 
        part 
 WHERE  l_partkey = p_partkey 
   AND     p_size < 15
;

This is the second simplest hash join but misses the cache much more; since this now has a key and a dependent part in the hash table, there is an extra pointer to follow, and the hash entry is two words plus the pointer to these in the hash table array.

SELECT  SUM (p_retailprice) 
  FROM  lineitem, 
        part 
 WHERE  l_partkey = p_partkey 
   AND     p_size < 15
;

By adjusting the number of parts selected, we can vary the Bloom filter selectivity and the size of the hash table. Below, we show the times for the two queries with single-thread and 24-thread execution, with different percentages of the part table on the build side of the hash join. All runs are against warm 100G TPC-H on the same test system as in the rest of the TPC-H series (dual Xeon E5-2630).

This table compares the performance of the linear and cuckoo implementations on the above queries (count vs. sum) on either 24 threads or 1 thread. Four data points are given for different sizes of hash table, given as percentage of the part table (having 400K - 20M entries) in the hash table. The rightmost column, which represents the case where the entire part table is on the build side does not have a Bloom filter; the other cases do. The Bloom bits are 8/4 for linear and 8/2 for cuckoo. The times are all in milliseconds, and the thousands separator is a comma.

Hash type Query type Threads 2%
(ms)
10%
(ms)
30%
(ms)
100%
(ms)
Linear COUNT 24 1,204 1,683 3,100 6,214
Linear SUM 24 1,261 2,447 5,059 13,086
Linear COUNT 1 15,286 22,451 38,863 66,722
Linear SUM 1 17,575 33,664 81,927 179,013
Cuckoo COUNT 24 1,849 2,840 4,105 6,203
Cuckoo SUM 24 2,833 4,903 9,446 19,652
Cuckoo COUNT 1 25,146 39,064 57,383 85,105
Cuckoo SUM 1 33,647 67,089 121,989 240,941

We clearly see cache effects on the two first lines, where the SUM and COUNT run in almost the same time on a small hash table but have a 2x difference on the larger hash table. The instruction path length is not very different for SUM and COUNT, but the memory footprint has a 3x difference.

We note that the SMP scalability of linear is slightly better, contrasting the ratio of 24-thread SUM to single-thread SUM. Both numbers are over 12x, indicating net benefit from core multithreading. (The test system has 12 physical cores.) The linear hash systematically outperforms cuckoo, understandably, since it makes a smaller net number of cache misses. The overall effect on the TPC-H score is noticeable, at around 15-20K units of composite score at 100G.

In conclusion, the Virtuoso hash join implementation is certainly on the level, with only small gains to be expected from further vectoring and prefetching. These results may be reproduced using the v7fasttrack Virtuoso Open Source releases from GitHub; develop/7 for cuckoo and feature/analytics for linear hash.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
05/28/2014 17:12 GMT-0500
In Hoc Signo Vinces (part 14 of n): Virtuoso TPC-H Implementation Analysis

In the previous article we saw an unofficial result of running the full workload. Here we will look more closely at the performance profile.

In this article we look at what the server actually does. The execution profiles for all the queries are available for download. To experiment with parallelism, you may download the software and run it locally. An Amazon image may be provided later.

Execution Profile

Below is the top of the oprofile output for a run of the 22 queries with qualification parameters against the 100G database. The operation in TPC-H terms is given under each heading.

CPU: Intel Sandy Bridge microarchitecture, speed 2299.98 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
1406009   9.5117  ce_vec_int_range_hash
 
-- Bloom filter before selective hash join where selective hash join is best or only condition on a scan, e.g., lineitem scan where l_partkey checked against a Bloom filter
730935    4.9448  ce_vec_int_sets_hash
 
-- Bloom filter check where another condition is applied first, e.g., lineitem scan with condition on l_shipdate, then Bloom filter check on l_partkey
617091    4.1746  hash_source_chash_input_1i_n
 
-- Q13 right outer hash join with probe from orders, build from customer, NOT EXISTS test in Q16
586938    3.9706  cs_decode
 
-- Generic reading of a column, all queries
536273    3.6279  ce_intd_range_ltgt
 
-- Date range comparison, most queries, e.g., Q1, 3, 4, 5, 6, 7, 8, 20
479898    3.2465  cha_cmp_1i
 
-- Q13 GROUP BY on c_custkey, Q15 GROUP BY on S_suppkey. Indicates missing the cache on high cardinality GROUP BY single INT grouping key
473721    3.2047  cha_inline_1i_n_int
 
-- Selective hash join after prefiltering with Bloom filter. Check only that key in hash table, no dependent part. For example Q8, Q9, Q17, Q20, with lineitem filtered by part
463723    3.1371  ce_dict_generic_range_filter
 
-- Range condition on low cardinality column (dictionary encoded), e.g., l_quantity, l_discount
425149    2.8761  cha_inline_1i_int
 
-- Hash join check that a single INT key with a dependent part is in a hash table, fetch the dependent part for hits
365040    2.4695  setp_chash_run
 
-- GROUP BY, e.g., GROUP BY on c_custkey in Q13
359645    2.4330  clrg_partition_dc
 
-- Partitioning a vector of values, occurs in all high cardinality GROUP BYs, e.g., Q13, Q15
349473    2.3642  gb_aggregate
 
-- Updating aggregates after the grouping keys are resolved, e.g., Q1
331926    2.2455  ce_dict_any_sets_decode
 
-- Fetching non-contiguous dictionary encoded strings, e.g., l_returnflag, l_linestatus in Q1
316731    2.1427  cha_insert_1i
 
-- Building a hash join hash table from a 1 INT key to a dependent part, e.g., Q14 from l_partkey to lineitems in a time window with a given l_partkey
286865    1.9406  ce_search_rld
 
-- Index lookup for run length delta compressed keys, e.g., l_orderkey, ps_partkey
231390    1.5654  cha_insert
 
-- Build of a hash join hash table for multipart keys, e.g., hash join with partsupp (Q9) or lineitem (Q17, Q20) on build side
224070    1.5158  ce_dict_int64_sets_decode
 
-- Fetching a non-contiguous set of double column values from dictionary encoding, e.g., l_discount, l_quantity
218506    1.4782  ce_intd_any_sets_decode
 
-- Fetching a non-contiguous set of date column values, e.g., l_shipdate in Q9
200686    1.3576  page_wait_access
 
-- Translating page numbers into buffers in buffer pool for column access
198854    1.3452  itc_col_seg
 
-- Generic part of table scan/index access
197645    1.3371  cha_insert_1i_n
 
-- Hash join build for hash tables with a 1 INT key and no dependent, e.g., lineitem to part join in Q9, Q17, Q20
195300    1.3212  cha_bloom_unroll_a
 
-- Bloom filter for hash based in predicate
192309    1.3010  hash_source_chash_input
 
-- Hash join probe with multi-part key, e.g., Q9 against partsupp
191313    1.2942  strstr_sse42
 
-- SSE 4.2 substring match, e.g. NOT LIKE condition in Q13
186325    1.2605  itc_fetch_col_vec
 
-- Translating page numbers into buffers in buffer pool for column access
159198    1.0770  ce_vec_int64_sets_decode
 
-- Fetching non-contiguous 64-bit values from array represented column, e.g., l_extendedprice

So what does TPC-H do? It is all selective hash joins. Then it is table scans with a condition on a DATE column. The DATE column part is because this implementation does not order the big tables (lineitem, orders) on their DATE columns. Third, TPC-H does big GROUP BYs, with Q13 representing most of this; all other GROUP BYs have at least 10x fewer groups. Then it just extracts column values, most often after selecting the rows on some condition; quite often the condition is a foreign key column of the row finding a hit in a hash table. This last is called invisible hash join.

Then there are some index lookups, but there the join is usually a merge pattern, like reading orders in order of o_orderkey and then getting l_orderkey matches from lineitem (e.g., Q21). Then there are quite often partitioning operations, i.e., many threads produce tuples and these tuples go to a consumer thread selected based on a partitioning column on the tuple. This is also called exchange. This is in all the high cardinality GROUP BYs and in the RIGHT OUTER JOIN of Q13.

Whether the implementation is RAM-only or paging from disk makes next to no difference. Virtuoso and Actian Vector both have a buffer pool, with Virtuoso using substantially smaller pages (8K vs. 256K or 512K). The page-number-to-buffer translation and the latching that goes with it is under 3% (itc_fetch_col_vec, page_wait_access). Of course, actually accessing secondary storage will kill the score, but checking that something is in memory is safe as long as this is always a hit.

So, once the query plans are right, the problem resolves into a few different loops. The bulk of the effort in making a TPC-H implementation is in query optimization so that the right loops run in the right order. I will further explain what the loops should contain in the next article.

Many of the functions seen in the profile are instantiations of a template for a specific data type. There are also different templates for variants of a data structure, like a hash table with different keys.

Compilation is a way of generating exactly the right loop for any set of data types. We do not find much need for this here, though. The type-specific operations that are in fact needed are anticipatable and can be predefined based on templates.

Future Gains

There are possible gains in the following domains:

  • Query optimization - Q15, Q17 do some work in duplicate, so reuse will roughly cut the time in half. In addition to reuse, there is a possibility to convert a scalar subquery into a derived table with a GROUP BY (decorrelation). Decorrelation also applies in Q20. Reuse will give some 10-12K more score; decorrelation is probably below measurement noise.

  • Group join - Merging the GROUP BY and the hash probe in Q13 will save 2% of time on throughput, maybe 5K more score.

  • Better parallelism in power test - A good power test takes 42s, and 5 times the work takes 165s, so the platform utilization of the power test is roughly 4/5. This could be slightly better.

  • NUMA - Up to 20% performance gains have been seen on scale-out configurations from binding a process to one CPU socket. This gain will be readily seen in a scale-out setting when running one process per physical CPU. Gains in the NUMA department are rather fickle and fragile and are worth less in real life than in benchmarks. So I would say that work in single-server NUMA optimization is not immediately needed since scale-out configurations will get these gains as soon as one sets affinities for processes. But whether binding processes to CPUs makes sense depends on how even the workload is. TPC-H is even, but reality is less so.

In principle, a composite score of a good run could go from 250K to 280K by diverse incremental improvements. There is some noise in run times, so two consecutive runs can deviate by a few percent, which is already significant when talking of small improvements.

Larger Scales

For a 100 GB run with 5 throughput streams, the peak query memory consumption is 3.5 GB. This includes all the GROUP BY and hash join tables that are made. Q13 is the most expensive in terms of space and allocates a peak of 1 GB for a single execution. This is trivial in comparison with the working set. But at 100x greater scale (10,000 GB or 10 TB) this becomes about 630 GB, as there will be a minimum of 9 concurrent streams instead of 5.

Now 10 TB is clearly a scale-out size, so the 630 GB transient memory is not on a single machine.

Still going to large scale-out will change the nature of some queries and introduce significant data transfer. We will know soon enough.

Problems of the Metric

A small-scale TPC-H, e.g., 100 GB, starts to have features of a lookup workload. This means that there is high variability between consecutive executions, and that the pre-execution state of the system does have large effect on a measurement.

The rules say that power test follows bulk load with maybe some checking of correctness of load in between. The bulk load is basically unregulated and usually will include statistics gathering.

The first power test shows significant variation, anything from 220 K to 240 K, while the second power test is steadily around 255 K. Since the reported score is the lower of the two, the biggest return to the implementor is in making sure the first power test is good.

The second throughput test is usually 5-10 K higher than the first; the throughput test is less sensitive. The difference does not come from I/O, but from system time for memory allocation. The memory and the same quantities and block sizes is reused by the second run.

A good power run is 42s from 100% warm RAM. A power run that gets the data from the OS disk buffers is 70s or so. A power run that gets data from SSD is worse, maybe 120s.

To cite an example, increasing the buffer pool size from 64 GB to 72 GB gets the first post-load power test from 120-150K to 230-240K while having no appreciable effect on subsequent tests. The effect is exacerbated by the fact that the power score is based on a geometric mean of run times. Very short queries (e.g., Q2) vary between consecutive in-memory executions from 120ms to 220ms. A similar variation occurs in Q13 which is on either side of 6s. Due to geometric mean, the same variability has very different impact depending on which query it hits. A Q2 that reads data from out of process can take 2s instead of the expected under 200ms. This kills the score even if a delay of 1.8s as such did not. So increasing the buffer pool in the example just serves to make sure the small supplier table is in memory. Fetching it from the OS is simply not an option in the first Q2 even if it were an option in a longer running query. Remember, the the lower of the two scores is reported, and the first power test will be bad unless it is somehow primed by some trick like bulk load order.

When differences between implementations are small, variation between consecutive runs becomes important. This is why OLTP benchmarks are required to run for a relatively long time and only measure the steady state portion. This would also be appropriate for small TPC-H runs.

Conclusions

Query performance reduces to a few loops when all the conditions are right. Getting the conditions to be right depends on query optimization and the right architecture choices. These results would be unthinkable without vectored execution and a compressed column store design. These in and of themselves guarantee nothing unless all plans are right. A previous Virtuoso 7 takes over 10x longer because of bad plans and missing some key execution techniques like the RIGHT OUTER JOIN in Q13 and partitioned GROUP BY.

Last summer, we did runs of the much simpler Star Schema Benchmark. The results were very good, because the engine had the much smaller set of tricks and the right plans. Repeating these tests now would show some gains from a still better hash join but nothing dramatic.

In the next article we will look at the finer points of hash join. After this we move to larger scales and clusters.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
05/01/2014 13:07 GMT-0500 Modified: 05/28/2014 17:13 GMT-0500
In Hoc Signo Vinces (part 13 of n): Virtuoso TPC-H Kit Now on V7 Fast Track

This article shows how you can reproduce the TPC-H experiments from the previous posts in this series.

All the code is in the feature/analytics branch of the v7fasttrack git repository on GitHub.

Prerequisites

Start by checking out and compiling Virtuoso Open Source (VOS).

git clone https://github.com/v7fasttrack/virtuoso-opensource
cd virtuoso-opensource 
git checkout feature/analytics
./autogen.sh
export CFLAGS="-msse4.2 -DSSE42"
./configure 
make -j 24
make install

The system should be an x86_64, Intel Core i7 or later, with SSE 4.2 support. (Running without SSE 4.2 is possible, but for better score you need to define it before doing configure.) The gcc may be any that supports SSE 4.2.

To have a good result, the system should have at least 96 GB of RAM and SSD.

To get a good load time, both the database files and the CSV files made by dbgen should be on SSD.

Running TPC-H

Set up

Copy the binsrc/tests/tpc-h directory from the check-out to an SSD, if it is not already on one. Set $HOME to point to the root directory of the check-out. Rename the virtuoso.ini-100G in the tpc-h directory to virtuoso.ini. Edit the database file paths in the virtuoso.ini. Where it says --

Segment1 = 1024, /1s1/dbs/tpch100cp-1.db = q1, /1s2/dbs/tpch100cp-2.db = q2

-- change the file names, and add or remove files as appropriate, each file with a different = qn, until you have one file per independent device. If this is a RAID, one file per distinct device in the RAID usually brings improvement. Edit the TransactionFile entry, and replace /1s2/dbs/ with a suitable path.

Edit ThreadsPerQuery to be the number of threads on the machine. For i7, this is double the number of cores; your environment may vary. AsyncQueueMaxThreads should be set to double ThreadsPerQuery.

The memory settings (NumberOfBuffers and MaxDirtyBuffers) are OK for 100G scale. For larger scale, make the memory settings correspondingly larger, not to exceed 75% of system memory. Count 8.5 KB per buffer. If you have less memory, you can decrease these. If so, the first power test will be hit the worst, so the scores will not be as good.

The default BIOS settings are usually OK. Disabling prefetch of adjacent cache line does not help, and turning off core threads does not help either.

For 100G, the data files for loading will take 100 GB, and the database files will take 88 GB divided among however many files. Be sure there is enough space before you start.

Generate the data

In the tpc-h directory (copy of binsrc/tests/tpc-h) --

./gen.sh 100 5 2

The first parameter is the scale factor; the second is the number of streams to use in the throughput test; the last is the number of consecutive test runs. The minimum number of streams is 5 for 100G; each successive scale adds one more. A larger number of streams is allowed, but will not make a better result in this case. A test always consists of 2 runs; you could specify more, but the extra tests will not influence the score.

Making the data files takes the longest time. You may run dbgen multithreaded to make the dataset in parallel, but then the load scripts will have to be changed to match.

Run the load

Start Virtuoso.

./load.sh 100

Looking at iostat, you will see a read rate of about 140 MB/s from the source files.

Run the test

./run.sh 100 5 2 

The parameters have the same meaning as in gen.sh, and the same values must be specified.

Outputs

The run produces two main files, report1.txt and report2.txt. These are the numerical quantity summaries for the first and second run. Additionally, there are output files for each query stream. The suppfiles.sh script can be used to collect the supporting files archive for a TPC-H full disclosure report.

The database is left running. To reuse the loaded data for another experiment, kill the Virtuoso process, delete the transaction log, and restart. This will have the data in the post-load state. To get warm cache, use the warm.sql script in the tpc-h directory.

On the test system used in this series, 12 core E5 at 2.3GHz, we expect 240K for the first run and 250K for the second. With a top-of-the-line E5, we expect around 400K. For an 8-core 2.26GHz Nehalem, we expect 150K.

If you get scores that are significantly different, something is broken; we would like to know about this.

If you have this audited according to the TPC rules, you will be allowed to call this a TPC-H result. Without such audit, the result should be labeled Virt-H.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
04/28/2014 12:10 GMT-0500 Modified: 05/28/2014 17:13 GMT-0500
In Hoc Signo Vinces (part 12 of n): TPC-H: Result Preview

In this article, we look at the 100 GB single-server results for the whole workload. We will call this Virt-H instead of TPC-H in order to comply with the TPC rules: Use of the TPC-H label requires an audit.

The test consists of a bulk load followed by two runs. Each run consists of a single user power test and a multi-user throughput test. The number of users in the throughput test is up to the test sponsor but must be at least 5 for the 100 GB scale. The reported score is the lower of the two scores.

Result Summary

Scale Factor 100 GB
dbgen version 2.15
Lload time 0:15:02
Composite qph 241,482.3
System Availability Date 2014-04-22

The price/performance is left open. The hardware costs about 5000 euros and the software is open source so the cost per performance would be a minimum of 0.02 euros per qph at 100G. This is not compliant with the TPC pricing rules though. These require 3 year maintenance contracts for all parts.

The software configuration did not use RAID. Otherwise the software would be auditable to the best of my knowledge. The hardware would have to be the same from Dell, HP, or other large brand to satisfy the TPC pricing rule.

Executive Summaries of Each Run

Run 1

Report Date 2014-04-21
Database Scale Factor 100
Total Data Storage/Database Size 1 TB / 87,496 MB
Start of Database Load 2014-04-21 21:02:43
End of Database Load 2014-04-21 21:17:45
Database Load Time 0:15:02
Query Streams for Throughput Test 5
Virt-H Power 239,785.1
Virt-H Throughput 243,191.4
Virt-H Composite Query-per-Hour Metric (Qph@100GB) 241,482.3
Measurement Interval in Throughput Test (Ts) 162.935000 seconds

Duration of stream execution

  Start Date/Time End Date/Time Duration
Stream 0 2014-04-21 21:17:46 2014-04-21 21:18:33 0:00:47
Stream 1 2014-04-21 21:18:33 2014-04-21 21:21:13 0:02:40
Stream 2 2014-04-21 21:18:33 2014-04-21 21:21:13 0:02:40
Stream 3 2014-04-21 21:18:33 2014-04-21 21:21:06 0:02:33
Stream 4 2014-04-21 21:18:33 2014-04-21 21:21:10 0:02:37
Stream 5 2014-04-21 21:18:33 2014-04-21 21:21:16 0:02:43
Refresh 0 2014-04-21 21:17:46 2014-04-21 21:17:49 0:00:03
  2014-04-21 21:17:50 2014-04-21 21:17:51 0:00:01
Refresh 1 2014-04-21 21:19:25 2014-04-21 21:19:38 0:00:13
Refresh 2 2014-04-21 21:18:33 2014-04-21 21:18:48 0:00:15
Refresh 3 2014-04-21 21:18:49 2014-04-21 21:19:01 0:00:12
Refresh 4 2014-04-21 21:19:01 2014-04-21 21:19:13 0:00:12
Refresh 5 2014-04-21 21:19:13 2014-04-21 21:19:25 0:00:12

Numerical Quantities Summary -- Timing Intervals in Seconds

  Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Stream 0 2.311882 0.383459 1.143286 0.439926 1.594027 0.736482 1.440826 1.198925
Stream 1 5.192341 0.952574 6.184940 1.194804 6.998207 5.122059 5.962717 6.773401
Stream 2 7.354001 1.191604 4.238262 1.770639 5.782669 1.357578 4.034697 6.354747
Stream 3 6.489788 1.585291 4.645022 3.358926 7.904636 3.220767 5.694622 7.431067
Stream 4 5.609555 1.066582 6.740518 2.503038 9.439980 3.424101 4.404849 4.256317
Stream 5 10.346825 1.787459 4.391000 3.151059 4.974037 2.932079 6.191782 3.619255
Min Qi 5.192341 0.952574 4.238262 1.194804 4.974037 1.357578 4.034697 3.619255
Max Qi 10.346825 1.787459 6.740518 3.358926 9.439980 5.122059 6.191782 7.431067
Avg Qi 6.998502 1.316702 5.239948 2.395693 7.019906 3.211317 5.257733 5.686957
 
  Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Stream 0 4.476940 2.004782 2.070967 1.015134 7.995799 2.142581 1.989357 1.581758
Stream 1 11.351299 6.657059 7.719765 5.157236 25.156379 8.566067 7.028898 8.146883
Stream 2 13.954105 8.341359 10.265949 3.289724 25.249435 6.370577 11.262650 7.684574
Stream 3 13.597277 5.783821 5.944240 5.214661 24.253991 8.742896 7.701709 5.801641
Stream 4 15.612070 6.126494 4.533748 5.733828 23.021583 6.423207 8.358223 6.866477
Stream 5 8.421209 9.040726 7.799425 3.908758 23.342975 9.934672 11.455598 8.258504
Min Qi 8.421209 5.783821 4.533748 3.289724 23.021583 6.370577 7.028898 5.801641
Max Qi 15.612070 9.040726 10.265949 5.733828 25.249435 9.934672 11.455598 8.258504
Avg Qi 12.587192 7.189892 7.252625 4.660841 24.204873 8.007484 9.161416 7.351616
 
  Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2
Stream 0 2.258070 0.981896 1.161602 1.933124 2.203497 1.042949 3.349407 1.296630
Stream 1 8.213340 4.070175 5.662723 12.260503 7.792825 3.323136 9.296430 3.939927
Stream 2 16.754827 3.895688 4.413773 7.529466 6.288539 2.717479 11.222082 4.135510
Stream 3 8.486809 2.615640 7.426936 7.274289 6.706145 3.402654 8.278881 4.260483
Stream 4 12.604905 7.735042 5.627039 6.343302 7.242370 3.492640 6.503095 3.698821
Stream 5 8.221733 2.670036 5.866626 13.108081 9.428098 4.282014 8.213320 4.088321
Min Qi 8.213340 2.615640 4.413773 6.343302 6.288539 2.717479 6.503095 3.698821
Max Qi 16.754827 7.735042 7.426936 13.108081 9.428098 4.282014 11.222082 4.260483
Avg Qi 10.856323 4.197316 5.799419 9.303128 7.491595 3.443585 8.702762 4.024612

Run 2

Report Date 2014-04-21
Database Scale Factor 100
Total Data Storage/Database Size 1 TB / 87,496 MB
Start of Database Load 2014-04-21 21:02:43
End of Database Load 2014-04-21 21:17:45
Database Load Time 0:15:02
Query Streams for Throughput Test 5
Virt-H Power 257,944.7
Virt-H Throughput 240,998.0
Virt-H Composite Query-per-Hour Metric (Qph@100GB) 249,327.4
Measurement Interval in Throughput Test (Ts) 164.417000 seconds

Duration of stream execution

  Start Date/Time End Date/Time Duration
Stream 0 2014-04-21 21:21:20 2014-04-21 21:22:01 0:00:41
Stream 1 2014-04-21 21:22:02 2014-04-21 21:24:41 0:02:39
Stream 2 2014-04-21 21:22:02 2014-04-21 21:24:41 0:02:39
Stream 3 2014-04-21 21:22:02 2014-04-21 21:24:41 0:02:39
Stream 4 2014-04-21 21:22:02 2014-04-21 21:24:44 0:02:42
Stream 5 2014-04-21 21:22:02 2014-04-21 21:24:46 0:02:44
Refresh 0 2014-04-21 21:21:20 2014-04-21 21:21:22 0:00:02
&$160; 2014-04-21 21:21:22 2014-04-21 21:21:23 0:00:01
Refresh 1 2014-04-21 21:22:49 2014-04-21 21:23:04 0:00:15
Refresh 2 2014-04-21 21:22:01 2014-04-21 21:22:14 0:00:13
Refresh 3 2014-04-21 21:22:14 2014-04-21 21:22:27 0:00:13
Refresh 4 2014-04-21 21:22:26 2014-04-21 21:22:39 0:00:13
Refresh 5 2014-04-21 21:22:39 2014-04-21 21:22:49 0:00:10

Numerical Quantities Summary -- Timing Intervals in Seconds

  Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Stream 0 2.437262 0.227516 1.172620 0.541201 1.542084 0.743255 1.459368 1.183166
Stream 1 5.205225 0.499833 4.854558 4.818087 5.920773 3.347414 5.446411 3.723247
Stream 2 5.833803 0.659051 6.023266 3.123523 4.358200 3.371315 6.772453 4.978415
Stream 3 6.308935 0.662744 7.573807 5.000859 5.282467 4.391930 5.280472 7.852718
Stream 4 5.791856 0.421592 5.953592 4.688037 9.949038 3.098282 4.153124 4.824209
Stream 5 13.537098 1.760386 3.308982 2.299178 4.882695 2.652497 5.383128 10.178447
Min Qi 5.205225 0.421592 3.308982 2.299178 4.358200 2.652497 4.153124 3.723247
Max Qi 13.537098 1.760386 7.573807 5.000859 9.949038 4.391930 6.772453 10.178447
Avg Qi 7.335383 0.800721 5.542841 3.985937 6.078635 3.372288 5.407118 6.311407
 
  Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Stream 0 4.441940 1.948770 2.154384 1.148494 6.014453 1.647725 1.437587 1.585284
Stream 1 14.127674 7.824844 7.100679 3.586457 28.216115 7.587547 9.859152 5.829869
Stream 2 16.102880 7.676986 5.887327 2.796729 24.847035 7.146757 11.408922 7.641239
Stream 3 15.678701 5.786427 9.221883 2.692321 28.434916 6.657457 8.219745 7.706585
Stream 4 11.985421 10.182807 5.667618 6.875264 27.547492 7.438075 9.065924 8.895070
Stream 5 6.913707 7.662703 8.657333 3.282895 24.126612 10.963691 12.138564 7.962654
Min Qi 6.913707 5.786427 5.667618 2.692321 24.126612 6.657457 8.219745 5.829869
Max Qi 16.102880 10.182807 9.221883 6.875264 28.434916 10.963691 12.138564 8.895070
Avg Qi 12.961677 7.826753 7.306968 3.846733 26.634434 7.958705 10.138461 7.607083
 
  Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2
Stream 0 2.275267 1.139390 1.165591 2.073658 2.261869 0.703055 2.327755 1.146501
Stream 1 13.720792 4.428528 3.651645 9.841610 6.710473 2.595879 9.783844 3.800103
Stream 2 12.532257 2.312755 6.182661 8.666967 9.383983 1.414853 7.570509 4.539598
Stream 3 7.578779 3.342352 8.155356 4.925493 6.590047 2.612912 8.497542 4.638512
Stream 4 10.967178 2.173935 6.382803 5.082562 8.744671 3.074768 7.577794 4.435140
Stream 5 9.438581 2.551124 8.375607 8.339441 8.201650 1.982935 7.334306 3.404017
Min Qi 7.578779 2.173935 3.651645 4.925493 6.590047 1.414853 7.334306 3.404017
Max Qi 13.720792 4.428528 8.375607 9.841610 9.383983 3.074768 9.783844 4.638512
Avg Qi 10.847517 2.961739 6.549614 7.371215 7.926165 2.336269 8.152799 4.163474

Details of System Under Test (SUT)

Hardware

Chassis Supermicro 2U
Motherboard Supermicro X9DR3-LN4F+
CPU 2 x Intel Xeon E5-2630 @ 2.3 GHz
(6 cores, 12 threads each;
total 12 cores, 24 threads)
RAM 192 GB DDR3 (24 x 8 GB, 1066MHz)
Storage 2 x Crucial 512 GB SSD
 

Software

DBMS Virtuoso Open Source 7.11.3209
(feature/analytics on v7fasttrack on GitHub)
OS CentOS 6.2

Conclusions

This experiment places Virtuoso in the ballpark with Actian Vector (formerly branded Vectorwise), which has dominated the TPC-H score board in recent years. The published Vector results are on more cores and/or faster clock; one would have to run on the exact same platform to make precise comparisons.

Virtuoso ups the ante by providing this level of performance in open source. For a comparison with EXASolution and Actian Matrix (formerly ParAccel), we will have to go to the Virtuoso scale-out configuration, to follow shortly.

The next articles will provide a detailed analysis of performance and instructions for reproducing the results. The run outputs and scripts are available for download.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
04/22/2014 11:33 GMT-0500 Modified: 05/28/2014 17:13 GMT-0500
In Hoc Signo Vinces (part 11 of n): TPC-H Q2, Q10 - Late Projection

Analytics is generally about making something small out of something large. This reduction is obtained by a TOP k operator (i.e., show only the 10 best by some metric) and/or by grouping and aggregation (i.e., for a set of items, show some attributes of these items and a sum, count, or other aggregate of dependent items for each).

In this installment we will look at late projection, also sometimes known as late materialization. If many attributes are returned and there is a cutoff of some sort, then the query does not need to be concerned about attributes on which there are no conditions, except for fetching them at the last moment, only for the entities which in fact will be returned to the user.

We look at TPC-H Q2 and Q10.

Q2:

  SELECT  TOP 100
                   s_acctbal,
                   s_name,
                   n_name,
                   p_partkey,
                   p_mfgr,
                   s_address,
                   s_phone,
                   s_comment
    FROM  part,
          supplier,
          partsupp,
          nation,
          region
   WHERE  p_partkey = ps_partkey
     AND  s_suppkey = ps_suppkey
     AND  p_size = 15
     AND  p_type LIKE '%BRASS'
     AND  s_nationkey = n_nationkey
     AND  n_regionkey = r_regionkey
     AND  r_name = 'EUROPE'
     AND  ps_supplycost = 
            ( SELECT  MIN(ps_supplycost)
                FROM  partsupp,
                      supplier,
                      nation,
                      region
               WHERE  p_partkey = ps_partkey
                 AND  s_suppkey = ps_suppkey
                 AND  s_nationkey = n_nationkey
                 AND  n_regionkey = r_regionkey
                 AND  r_name = 'EUROPE'
            )
ORDER BY  s_acctbal DESC,
          n_name,
          s_name,
          p_partkey

The intent is to return information about parts and suppliers, such that the part is available from a supplier in Europe, and the supplier has the lowest price for the part among all European suppliers.

Q10:

  SELECT  TOP 20
                                                     c_custkey,
                                                     c_name,
          SUM(l_extendedprice * (1 - l_discount)) AS revenue,
                                                     c_acctbal, 
                                                     n_name,
                                                     c_address,
                                                     c_phone,
                                                     c_comment
    FROM  customer,
          orders,
          lineitem,
          nation
   WHERE  c_custkey = o_custkey
     AND  l_orderkey = o_orderkey
     AND  o_orderdate >= CAST ('1993-10-01' AS DATE)
     AND  o_orderdate < DATEADD ('month', 3, CAST ('1993-10-01' AS DATE))
     AND  l_returnflag = 'R'
     AND  c_nationkey = n_nationkey
GROUP BY  c_custkey,
          c_name,
          c_acctbal,
          c_phone,
          n_name,
          c_address,
          c_comment
ORDER BY  revenue DESC

The intent is to list the customers who cause the greatest loss of revenue in a given quarter by returning items ordered in said quarter.

We notice that both queries return many columns on which there are no conditions, and that both have a cap on returned rows. The difference is that in Q2 the major ORDER BY is on a grouping column, and in Q10 it is on the aggregate of the GROUP BY. Thus the TOP k trick discussed in the previous article does apply to Q2 but not to Q10.

The profile for Q2 follows:

{ 
time   6.1e-05% fanout         1 input         1 rows
time       1.1% fanout         1 input         1 rows
{ hash filler
Subquery 27 
{ 
time    0.0012% fanout         1 input         1 rows
REGION         1 rows(t10.R_REGIONKEY)
 R_NAME = <c EUROPE>
time   0.00045% fanout         5 input         1 rows
NATION         5 rows(t9.N_NATIONKEY)
 N_REGIONKEY = t10.R_REGIONKEY
time       1.6% fanout     40107 input         5 rows
SUPPLIER   4.2e+04 rows(t8.S_SUPPKEY)
 S_NATIONKEY = t9.N_NATIONKEY
 
After code:
      0: t8.S_SUPPKEY :=  := artm t8.S_SUPPKEY
      4: BReturn 0
time       0.1% fanout         0 input    200535 rows
Sort hf 49 (t8.S_SUPPKEY)
}
}
time    0.0004% fanout         1 input         1 rows
{ fork
time        21% fanout     79591 input         1 rows
PART     8e+04 rows(.P_PARTKEY)
 P_TYPE LIKE <c %BRASS> LIKE <c > ,  P_SIZE =  15 
time        44% fanout  0.591889 input     79591 rows
 
Precode:
      0: { 
time     0.083% fanout         1 input     79591 rows
time      0.13% fanout         1 input     79591 rows
{ fork
time        24% fanout  0.801912 input     79591 rows
PARTSUPP       3.5 rows(.PS_SUPPKEY, .PS_SUPPLYCOST)
 inlined  PS_PARTKEY = k_.P_PARTKEY
hash partition+bloom by 62 (tmp)hash join merged always card       0.2 -> ()
time       1.3% fanout         0 input     63825 rows
Hash source 49 merged into ts not partitionable       0.2 rows(.PS_SUPPKEY) -> ()
 
After code:
      0:  min min.PS_SUPPLYCOSTset no set_ctr
      5: BReturn 0
}
 
After code:
      0: aggregate :=  := artm min
      4: BReturn 0
time      0.19% fanout         0 input     79591 rows
Subquery Select(aggregate)
}
 
      8: BReturn 0
PARTSUPP     5e-08 rows(.PS_SUPPKEY)
 inlined  PS_PARTKEY = k_.P_PARTKEY PS_SUPPLYCOST = k_scalar
time       5.9% fanout  0.247023 input     47109 rows
SUPPLIER unq       0.9 rows (.S_ACCTBAL, .S_NATIONKEY, .S_NAME, .S_SUPPKEY)
 inlined  S_SUPPKEY = .PS_SUPPKEY
top k on S_ACCTBAL
time     0.077% fanout         1 input     11637 rows
NATION unq         1 rows (.N_REGIONKEY, .N_NAME)
 inlined  N_NATIONKEY = .S_NATIONKEY
time     0.051% fanout         1 input     11637 rows
REGION unq       0.2 rows ()
 inlined  R_REGIONKEY = .N_REGIONKEY R_NAME = <c EUROPE>
time      0.42% fanout         0 input     11637 rows
Sort (.S_ACCTBAL, .N_NAME, .S_NAME, .P_PARTKEY) -> (.S_SUPPKEY)
 
}
time    0.0016% fanout       100 input         1 rows
top order by read (.S_SUPPKEY, .P_PARTKEY, .N_NAME, .S_NAME, .S_ACCTBAL)
time      0.02% fanout         1 input       100 rows
PART unq      0.95 rows (.P_MFGR)
 inlined  P_PARTKEY = .P_PARTKEY
time     0.054% fanout         1 input       100 rows
SUPPLIER unq         1 rows (.S_PHONE, .S_ADDRESS, .S_COMMENT)
 inlined  S_SUPPKEY = k_.S_SUPPKEY
time   6.7e-05% fanout         0 input       100 rows
Select (.S_ACCTBAL, .S_NAME, .N_NAME, .P_PARTKEY, .P_MFGR, .S_ADDRESS, .S_PHONE, .S_COMMENT)
}


 128 msec 1007% cpu,    196992 rnd 2.53367e+07 seq   50.4135% same seg   45.3574% same pg 

The query starts with a scan looking for the qualifying parts. It then looks for the best price for each part from a European supplier. All the European suppliers have been previously put in a hash table by the hash filler subquery at the start of the plan. Thus, to find the minimum price, the query takes the partsupp for the part by index, and then eliminates all non-European suppliers by a selective hash join. After this, there is a second index lookup on partsupp where we look for the part and the price equal to the minimum price found earlier. These operations could in principle be merged, as the minimum price partsupp has already been seen. The gain would not be very large, though.

Here we note that the cost model guesses that very few rows will survive the check of ps_supplycost = minimum cost. It does not know that the minimum is not just any value, but one of the values that do occur in the ps_supplycost column for the part. Because of this, the remainder of the plan is carried out by index, which is just as well. The point is that if very few rows of input are expected, it is not worthwhile to make a hash table for a hash join. The hash table made for the European suppliers could be reused here, maybe with some small gain. It would however need more columns, which might make it not worthwhile. We note that the major order with the TOP k is on the supplier s_acctbal, hence as soon as there are 100 suppliers found, one can add a restriction on the s_acctbal for subsequent ones.

At the end of the plan, after the TOP k ORDER BY and the reading of the results, we have a separate index-based lookup for getting only the columns that are returned. We note that this is done on 100 rows whereas the previous operations are done on tens-of-thousands of rows. The TOP k restriction produces some benefit, but it is relatively late in the plan, and not many operations follow it.

The plan is easily good enough, with only small space for improvement. Q2 is one of the fastest queries of the set.

Let us now consider the execution of Q10:

{ 
time   1.1e-06% fanout         1 input         1 rows
time   4.4e-05% fanout         1 input         1 rows
{ hash filler
time   1.6e-05% fanout        25 input         1 rows
NATION        25 rows(.N_NATIONKEY, .N_NAME)
 
time   6.7e-06% fanout         0 input        25 rows
Sort hf 35 (.N_NATIONKEY) -> (.N_NAME)
 
}
time   1.5e-06% fanout         1 input         1 rows
{ fork
time   2.4e-06% fanout         1 input         1 rows
{ fork
time        13% fanout 5.73038e+06 input         1 rows
ORDERS   5.1e+06 rows(.O_ORDERKEY, .O_CUSTKEY)
 O_ORDERDATE >= <c 1993-10-01> < <c 1994-01-01>
time       4.8% fanout   2.00042 input 5.73038e+06 rows
LINEITEM       1.1 rows(.L_EXTENDEDPRICE, .L_DISCOUNT)
 inlined  L_ORDERKEY = .O_ORDERKEY L_RETURNFLAG = <c R>
time        25% fanout         1 input 1.14632e+07 rows
 
Precode:
      0: temp := artm  1  - .L_DISCOUNT
      4: temp := artm .L_EXTENDEDPRICE * temp
      8: BReturn 0
CUSTOMER unq         1 rows (.C_NATIONKEY, .C_CUSTKEY)
 inlined  C_CUSTKEY = k_.O_CUSTKEY
hash partition+bloom by 39 (tmp)hash join merged always card         1 -> (.N_NAME)
time    0.0023% fanout         1 input 1.14632e+07 rows
Hash source 35 merged into ts          1 rows(.C_NATIONKEY) -> (.N_NAME)
time       2.3% fanout         1 input 1.14632e+07 rows
Stage 2
time       3.6% fanout         0 input 1.14632e+07 rows
Sort (q_.C_CUSTKEY, .N_NAME) -> (temp)
 
}
time       0.6% fanout 3.88422e+06 input         1 rows
group by read node  
(.C_CUSTKEY, .N_NAME, revenue)in each partition slice
time      0.57% fanout         0 input 3.88422e+06 rows
Sort (revenue) -> (.N_NAME, .C_CUSTKEY)
 
}
time   6.9e-06% fanout        20 input         1 rows
top order by read (.N_NAME, revenue, .C_CUSTKEY)
time   0.00036% fanout         1 input        20 rows
CUSTOMER unq         1 rows (.C_PHONE, .C_NAME, .C_ACCTBAL, .C_ADDRESS, .C_COMMENT)
 inlined  C_CUSTKEY = .C_CUSTKEY
time   1.1e-06% fanout         0 input        20 rows
Select (.C_CUSTKEY, .C_NAME, revenue, .C_ACCTBAL, .N_NAME, .C_ADDRESS, .C_PHONE, .C_COMMENT)
}


 2153 msec 2457% cpu, 1.71845e+07 rnd 1.67177e+08 seq   76.3221% same seg   21.1204% same pg 

The plan is by index, except for the lookup of nation name for the customer. The most selective condition is on order date, followed by the returnflag on lineitem. Getting the customer by index turns out to be better than by hash, even though almost all customers are hit. See the input cardinality above the first customer entry in the plan -- over 10M. The key point here is that only the c_custkey and c_nationkey get fetched, which saves a lot of time. In fact the c_custkey is needless since this is anyway equal to the o_custkey, but this makes little difference.

One could argue that customer should be between lineitem and orders in join order. Doing this would lose the ORDER BY on orders and lineitem, but would prevent some customer rows from being hit twice for a single order. The difference would not be large, though. For a scale-out setting, one definitely wants to have orders and lineitem without customer in between if the former are partitioned on the same key.

The c_nationkey is next translated into a n_name by hash, and there is a partitioned GROUP BY on c_custkey. The GROUP BY is partitioned because there are many different c_custkey values (155M for 100G scale).

The most important trick is fetching all the many dependent columns of c_custkey only after the TOP k ORDER BY. The last access to customer in the plan does this and is only executed on 20 rows.

Without the TOP k trick, the plan is identical, except that the dependent columns are fetched for nearly all customers. If this is done, the run time is 16s, which is bad enough to sink the whole score.

There is another approach to the challenge of this query: If foreign keys are declared and enforced, the system will know that every order has an actually existing customer and that every customer has a country. If so, the whole GROUP BY and TOP k can be done without any reference to customer, which is a notch better still, at least for this query. In this implementation, we do not declare foreign keys, thus the database must check that the customer and its country in fact exist before doing the GROUP BY. This makes the late projection trick mandatory, but does save the expense of checking foreign keys on updates. In both cases, the optimizer must recognize that the columns to be fetched at the end (late projected) are functionally dependent on a grouping key (c_custkey).

The late projection trick is generally useful, since almost all applications aside from bulk data export have some sort of limit on result set size. A column store especially benefits from this, since some columns of a row can be read without even coming near to other ones. A row store can also benefit from this in the form of decreased intermediate result size. This is especially good when returning long columns, such as text fields or blobs, on which there are most often no search conditions. If there are conditions of such, then these will most often be implemented via a special text index and not a scan.

*           *           *           *           *

In the next installment we will have a look at the overall 100G single server performance. After this we will recap the tricks so far. Then it will be time to look at implications of scale out for performance and run at larger scales. After the relational ground has been covered, we can look at implications of schema-lastness, i.e., triples for this type of workload.

So, while the most salient tricks have been at least briefly mentioned, we are far from having exhausted this most foundational of database topics.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
04/07/2014 12:30 GMT-0500 Modified: 05/28/2014 17:13 GMT-0500
OpenPHACTS in Vienna

Hugh Williams and I (Orri Erling) went to the Open PHACTS Steering Committee meeting in Vienna last week. I am a great fan of Open PHACTS; the meetings are fun, with a great team spirit, and there is always something new to learn.

Paul Groth gave a talk about the stellar success of the the initial term of Open PHACTS.

  • Three releases of platform and data
  • 18 applications
  • Open PHACTS Foundation for sustainable exploitation and further development of the platform
  • Superb culture of collaboration
    • great team spirit
    • great output from distributed organization
    • lots of face-to-face time
    • example to every other big collaborative project

"The reincarnation of Steve Jobs," commented someone from the audience. "Except I am a nice guy," retorted Paul.

Commented one attendee, "The semantic web…., I just was in Boston at a semantic web meeting – so nerdy, something to make you walk out of the room… so it is a definite victory for Open PHACTS and why not also semantic web, that something based on these principles actually works."

It is a win anyhow, so I did not say anything at the meeting. So I will say something here, where I have more space as the message bears repeating.

We share part of the perception, so we hardly ever say "semantic web." The word is "linked data," and it means flexible schema and global identifiers. Flexible schema means that everything does not have to be modeled upfront. Global identifiers means that data, when transferred out of its silo of origin, remains interpretable and self-describing, so you can mix it with other data without things getting confused. "Desiloization" is a wonderful new word for describing this.

This ties right into FAIRport and FAIR data: Findable, Accessible, Interoperable, Reusable. Barend Mons talked a lot about this: open just means downloadable; fair means something you can do science with. Barend’s take is that RDF with a URI for everything is the super wire format for exchanging data. When you process it, you will diversely cook it, so an RDF store is one destination but not the only possibility. It has been said before: there is a range of choices between storing triples verbatim, and making application specific extractions, including ones with a schema, whether graph DB or relational.

Nanopublications are also moving ahead. Christine Chichester told me about pending publications involving Open PHACTS nanopublictions about post-translation modification of proteins and their expression in different tissues. So there are nanopublications out there and they can be joined, just as intended. Victory of e-science and data integration.

The Open PHACTS project is now officially extended for another two-year term, bringing the total duration to five years. The Open PHACTS Foundation exists as a legal entity and has its first members. This is meant to be a non-profit industry association for sharing of pre-competitive data and services around these between players in the pharma space, in industry as well as academia. There are press releases to follow in due time.

I am looking forward to more Open PHACTS. From the OpenLink and Virtuoso side, there are directly relevant developments that will enter production in the next few months, including query caching discussed earlier on this blog, as well as running on the TPC-H tuned analytics branch for overall better query optimization. Adaptive schema is something of evident value to Open PHACTS, as much of the integrated data comes from relational sources, so is regular enough. Therefore taking advantage of this for storage cannot hurt. We will see this still within the scope of the project extension.

Otherwise, more cooperation in formulating the queries for the business questions will also help.

All in all, Open PHACTS is the celebrated beauty queen of all the Innovative Medicine Initiative, it would seem. Superbly connected, unparalleled logo cloud, actually working and useful data integration, delivering on time on all in fact very complex business questions.

# PermaLink Comments [0]
03/31/2014 11:49 GMT-0500
In Hoc Signo Vinces (part 10 of n): TPC-H Q9, Q17, Q20 - Predicate Games

TPC-H is a hash join game. The rules do allow indices, but maintaining these takes time, and indices will quickly result in non-local access patterns. Indices also take space. Besides, somebody must know what indices to create, which is not obvious. Thus, it is best if a BI data warehouse works without.

Once you go to hash join, one side of the join will be materialized, which takes space, which ipso facto is bad. So, the predicate games are about moving conditions so that the hash table made for the hash join will be as small as possible. Only items that may in fact be retrieved should be put in the hash table. If you know that the query deals with shipments of green parts, putting lineitems of parts that are not green in a hash table makes no sense since only green ones are being looked for.

So, let's consider Q9. The query is:

SELECT                 nation,
                       o_year,
        SUM(amount) AS sum_profit
 FROM  ( SELECT
                                                                          n_name AS nation,
                                               EXTRACT ( YEAR FROM o_orderdate ) AS o_year,
                 l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity AS amount
           FROM
                 part,
                 supplier,
                 lineitem,
                 partsupp,
                 orders,
                 nation
          WHERE  s_suppkey = l_suppkey
            AND  ps_suppkey = l_suppkey
            AND  ps_partkey = l_partkey
            AND  p_partkey = l_partkey
            AND  o_orderkey = l_orderkey
            AND  s_nationkey = n_nationkey
            AND  p_name like '%green%'
       ) AS profit
GROUP BY  nation,
          o_year
ORDER BY  nation,
          o_year DESC
;

The intent is to calculate profit from the sale of a type of part, broken down by year and supplier nation. All orders, lineitems, partsupps, and suppliers involving the parts of interest are visited. This is one of the longest running of the queries. The query is restricted by part only, and the condition selects 1/17 of all parts.

The execution plan is below. First the plan builds hash tables of all nations and suppliers. We expect to do frequent lookups, thus making a hash is faster than using the index. Partsupp is the 3rd largest table in the database. This has a primary key of ps_partkey, ps_suppkey, referenced by the compound foreign key l_partkey, l_suppkey in lineitem. This could be accessed by index, but we expect to hit each partsupp row multiple times, hence hash is better. We further note that only partsupp rows where the part satisfies the condition will contribute to the result. Thus we import the join with part into the hash build. The ps_partkey is not directly joined to p_partkey, but rather the system must understand that this follows from l_partkey = ps_partkey and l_partkey = p_partkey. In this way, the hash table is 1/17th of the size it would otherwise be, which is a crucial gain.

Looking further into the plan, we note a scan of lineitem followed by a hash join with part. Restricting the build of the partsupp hash would have the same effect, hence part is here used twice while it occurs only once in the query. This is deliberate, since the selective hash join with part restricts lineitem faster than the more complex hash join with a 2 part key (l_partkey, l_suppkey). Both joins perform the identical restriction, but doing the part first is faster since this becomes a single-key, invisible hash join, merged into the lineitem scan, done before even accessing the l_suppkey and other columns.

{ 
time   3.9e-06% fanout         1 input         1 rows
time   4.7e-05% fanout         1 input         1 rows
{ hash filler
time   3.6e-05% fanout        25 input         1 rows
NATION        25 rows(.N_NATIONKEY, nation)
 
time   8.8e-06% fanout         0 input        25 rows
Sort hf 35 (.N_NATIONKEY) -> (nation)
 
}
time      0.16% fanout         1 input         1 rows
{ hash filler
time     0.011% fanout     1e+06 input         1 rows
SUPPLIER     1e+06 rows(.S_SUPPKEY, .S_NATIONKEY)
 
time      0.03% fanout         0 input     1e+06 rows
Sort hf 49 (.S_SUPPKEY) -> (.S_NATIONKEY)
 
}
time      0.57% fanout         1 input         1 rows
{ hash filler
Subquery 58 
{ 
time       1.6% fanout 1.17076e+06 input         1 rows
PART   1.2e+06 rows(t1.P_PARTKEY)
 P_NAME LIKE  <c %green%> LIKE  <c >
time       1.1% fanout         4 input 1.17076e+06 rows
PARTSUPP       3.9 rows(t4.PS_SUPPKEY, t4.PS_PARTKEY, t4.PS_SUPPLYCOST)
 inlined  PS_PARTKEY = t1.P_PARTKEY
 
After code:
      0: t4.PS_SUPPKEY :=  := artm t4.PS_SUPPKEY
      4: t4.PS_PARTKEY :=  := artm t4.PS_PARTKEY
      8: t1.P_PARTKEY :=  := artm t1.P_PARTKEY
      12: t4.PS_SUPPLYCOST :=  := artm t4.PS_SUPPLYCOST
      16: BReturn 0
time      0.33% fanout         0 input 4.68305e+06 rows
Sort hf 82 (t4.PS_SUPPKEY, t4.PS_PARTKEY) -> (t1.P_PARTKEY, t4.PS_SUPPLYCOST)
 
}
}
time      0.18% fanout         1 input         1 rows
{ hash filler
time       1.6% fanout 1.17076e+06 input         1 rows
PART   1.2e+06 rows(.P_PARTKEY)
 P_NAME LIKE  <c %green%> LIKE  <c >
time     0.017% fanout         0 input 1.17076e+06 rows
Sort hf 101 (.P_PARTKEY)
}
time   5.1e-06% fanout         1 input         1 rows
{ fork
time   4.1e-06% fanout         1 input         1 rows
{ fork
time        59% fanout 3.51125e+07 input         1 rows
LINEITEM     6e+08 rows(.L_PARTKEY, .L_ORDERKEY, .L_SUPPKEY, .L_EXTENDEDPRICE, .L_DISCOUNT, .L_QUANTITY)
 
hash partition+bloom by 108 (tmp)hash join merged always card     0.058 -> ()
hash partition+bloom by 56 (tmp)hash join merged always card         1 -> (.S_NATIONKEY)
time      0.18% fanout         1 input 3.51125e+07 rows
 
Precode:
      0: temp := artm  1  - .L_DISCOUNT
      4: temp := artm .L_EXTENDEDPRICE * temp
      8: BReturn 0
Hash source 101 merged into ts      0.058 rows(.L_PARTKEY) -> ()
time        17% fanout         1 input 3.51125e+07 rows
Hash source 82       0.057 rows(.L_SUPPKEY, .L_PARTKEY) -> (  <none> , .PS_SUPPLYCOST)
time       6.2% fanout         1 input 3.51125e+07 rows
 
Precode:
      0: temp := artm .PS_SUPPLYCOST * .L_QUANTITY
      4: temp := artm temp - temp
      8: BReturn 0
ORDERS unq         1 rows (.O_ORDERDATE)
 inlined  O_ORDERKEY = k_.L_ORDERKEY
time    0.0055% fanout         1 input 3.51125e+07 rows
Hash source 49 merged into ts          1 rows(k_.L_SUPPKEY) -> (.S_NATIONKEY)
time       3.5% fanout         1 input 3.51125e+07 rows
Hash source 35           1 rows(k_.S_NATIONKEY) -> (nation)
time       8.8% fanout         0 input 3.51125e+07 rows
 
Precode:
      0: o_year := Call year (.O_ORDERDATE)
      5: BReturn 0
Sort (nation, o_year) -> (temp)
 
}
time   4.7e-05% fanout       175 input         1 rows
group by read node  
(nation, o_year, sum_profit)
time   0.00028% fanout         0 input       175 rows
Sort (nation, o_year) -> (sum_profit)
 
}
time   2.2e-05% fanout       175 input         1 rows
Key from temp (nation, o_year, sum_profit)
 
time   1.6e-06% fanout         0 input       175 rows
Select (nation, o_year, sum_profit)
}


 6114 msec 1855% cpu, 3.62624e+07 rnd 6.44384e+08 seq   99.6068% same seg  0.357328% same pg 

6.1s is a good score for this query. When executing the same in 5 parallel invocations, the fastest ends in 13.7s and the slowest in 27.6s. For five concurrent executions, the peak transient memory utilization is 4.7 GB for the hash tables, which is very reasonable.

*           *           *           *           *

Let us next consider Q17.

SELECT
        SUM(l_extendedprice) / 7.0 AS avg_yearly
FROM
        lineitem,
        part
WHERE
        p_partkey = l_partkey
   AND  p_brand = 'Brand#23'
   AND  p_container = 'MED BOX'
   AND  l_quantity 
           < (
                   SELECT
                            2e-1 * AVG(l_quantity)
                   FROM
                            lineitem
                   WHERE
                            l_partkey = p_partkey
                )

Deceptively simple? This calculates the total value of small orders (below 1/5 of average quantity for the part) for all parts of a given brand with a specific container.

If there is an index on l_partkey, the plan is easy enough: Take the parts, look up the average quantity for each, then recheck lineitem and add up the small lineitems. This takes about 1s. But we do not want indices for this workload.

If we made a hash from l_partkey to l_quantity for all lineitems, we could run out of space, and this would take so long the race would be automatically lost on this point alone. The trick is to import the restriction on l_partkey into the hash build. This gives us a plan that does a scan of lineitem twice, doing a very selective hash join (few parts). There is a lookup for the average for each lineitem with the part. The average is calculated potentially several times.

The below plan is workable but better is possible: We notice that the very selective join need be done just once; it is cheaper to remember the result than to do it twice, and the result is not large. The other trick is that the correlated subquery can be rewritten as

SELECT
        ... 
  FROM
        lineitem, 
        part, 
        ( SELECT 
                                           l_partkey, 
                 0.2 * AVG (l_quantity) AS qty 
            FROM 
                 lineitem, 
                 part 
            ...
        ) f 
 WHERE
        l_partkey = f.l_partkey 
 ...

In this form, one can put the entire derived table f on the build side of a hash join. In this way, the average is never done more than once per part.

{ 
time   7.9e-06% fanout         1 input         1 rows
time    0.0031% fanout         1 input         1 rows
{ hash filler
time      0.27% fanout     20031 input         1 rows
PART     2e+04 rows(.P_PARTKEY)
 P_BRAND =  <c Brand#23> ,  P_CONTAINER =  <c MED BOX>
time   0.00047% fanout         0 input     20031 rows
Sort hf 34 (.P_PARTKEY)
}
time       0.1% fanout         1 input         1 rows
{ hash filler
Subquery 40 
{ 
time        46% fanout    600982 input         1 rows
LINEITEM     6e+08 rows(t4.L_PARTKEY, t4.L_QUANTITY)
 
hash partition+bloom by 38 (tmp)hash join merged always card     0.001 -> ()
time    0.0042% fanout         1 input    600982 rows
Hash source 34 merged into ts not partitionable     0.001 rows(t4.L_PARTKEY) -> ()
 
After code:
      0: t4.L_PARTKEY :=  := artm t4.L_PARTKEY
      4: t4.L_QUANTITY :=  := artm t4.L_QUANTITY
      8: BReturn 0
time     0.059% fanout         0 input    600982 rows
Sort hf 62 (t4.L_PARTKEY) -> (t4.L_QUANTITY)
 
}
}
time   6.8e-05% fanout         1 input         1 rows
{ fork
time        46% fanout    600982 input         1 rows
LINEITEM     6e+08 rows(.L_PARTKEY, .L_QUANTITY, .L_EXTENDEDPRICE)
 
hash partition+bloom by 38 (tmp)hash join merged always card   0.00052 -> ()
time   0.00021% fanout         1 input    600982 rows
Hash source 34 merged into ts    0.00052 rows(.L_PARTKEY) -> ()
 
Precode:
      0: .P_PARTKEY :=  := artm .L_PARTKEY
      4: BReturn 0
END Node
After test:
      0: { 
time     0.038% fanout         1 input    600982 rows
time      0.17% fanout         1 input    600982 rows
{ fork
time       6.8% fanout         0 input    600982 rows
Hash source 62  not partitionable      0.03 rows(k_.P_PARTKEY) -> (.L_QUANTITY)
 
After code:
      0:  sum sum.L_QUANTITYset no set_ctr
      5:  sum count 1 set no set_ctr
      10: BReturn 0
}
 
After code:
      0: temp := artm sum / count
      4: temp := artm  0.2  * temp
      8: aggregate :=  := artm temp
      12: BReturn 0
time     0.042% fanout         0 input    600982 rows
Subquery Select(aggregate)
}
 
      8: if (.L_QUANTITY < scalar) then 12 else 13 unkn 13
      12: BReturn 1
      13: BReturn 0
 
After code:
      0:  sum sum.L_EXTENDEDPRICE
      5: BReturn 0
}
 
After code:
      0: avg_yearly := artm sum /  7 
      4: BReturn 0
time   4.6e-06% fanout         0 input         1 rows
Select (avg_yearly)
}


 2695 msec 1996% cpu,         3 rnd 1.18242e+09 seq         0% same seg         0% same pg 

2.7s is tolerable, but if this drags down the overall score by too much, we know that a 2+x improvement is readily available. Playing the rest of the tricks would result in the hash plan almost catching up with the 1s execution time of the index-based plan.

*           *           *           *           *

Q20 is not very long-running, but it is maybe the hardest to optimize of the lot. But as usual, failure to recognize its most salient traps will automatically lose the race, so pay attention.

SELECT TOP 100
           s_name,
           s_address
     FROM
           supplier,
           nation
    WHERE
           s_suppkey IN 
             ( SELECT  
                       ps_suppkey
                 FROM  
                       partsupp
                WHERE  
                       ps_partkey IN 
                         ( SELECT  
                                   p_partkey
                             FROM  
                                   part
                            WHERE  
                                   p_name LIKE 'forest%'
                         )
                  AND  ps_availqty > 
                         ( SELECT  
                                   0.5 * SUM(l_quantity)
                             FROM  
                                   lineitem
                            WHERE  
                                   l_partkey = ps_partkey
                              AND  l_suppkey = ps_suppkey
                              AND  l_shipdate >= CAST ('1994-01-01' AS DATE)
                              AND  l_shipdate < DATEADD ('year', 1, CAST ('1994-01-01' AS DATE))
                         )
             )
      AND  s_nationkey = n_nationkey
      AND  n_name = 'CANADA'
 ORDER BY  s_name

This identifies suppliers that have parts in stock in excess of half a year's shipments of said part.

The use of IN to denote a join is the first catch. The second is joining to lineitem by hash without building an overly large hash table. We know that IN becomes EXISTS which in turn can become a join as follows:

SELECT 
        l_suppkey 
FROM
        lineitem 
WHERE
        l_partkey IN  
          ( SELECT  
                    p_partkey 
              FROM  
                    part 
             WHERE  
                    p_name LIKE 'forest%'
       )
;

-- is --

SELECT  
        l_suppkey 
  FROM  
        lineitem 
 WHERE  EXISTS  
          ( SELECT  
                    p_partkey 
              FROM  
                    part 
             WHERE  
                    p_partkey = l_partkey 
               AND  p_name LIKE 'forest%')
;

-- is --

SELECT  
        l_suppkey 
  FROM  
        lineitem, 
        ( SELECT  
                    DISTINCT p_partkey 
            FROM  
                    part 
           WHERE  
                    p_name LIKE 'forest%') f 
 WHERE  
        l_partkey = f.p_partkey
;

But since p_partkey is unique, the DISTINCT drops off and we have.

SELECT  
        l_suppkey 
  FROM  
        lineitem, 
        part 
 WHERE  
        p_name LIKE 'forest% 
   AND  l_partkey = f.p_partkey
;

You see, the innermost IN with the ps_partkey goes through all these changes, and just becomes a join. The outermost IN stays as a distinct derived table, since ps_suppkey is not unique, and the meaning of IN is not to return a given supplier more than once.

The derived table is flattened and the DISTINCT is done partitioned; hence the stage node in front of the distinct. A DISTINCT can be multithreaded, if each thread gets a specific subset of all the keys. The stage node is an exchange of tuples between several threads. Each thread then does a TOP k sort. The TOP k trick we saw in Q18 is used, but does not contribute much here.

{ 
time   8.2e-06% fanout         1 input         1 rows
time   0.00017% fanout         1 input         1 rows
{ hash filler
time   6.1e-05% fanout         1 input         1 rows
NATION         1 rows(.N_NATIONKEY)
 N_NAME =  <c CANADA>
time   1.2e-05% fanout         0 input         1 rows
Sort hf 34 (.N_NATIONKEY)
}
time     0.073% fanout         1 input         1 rows
{ hash filler
time       4.1% fanout    240672 input         1 rows
PART   2.4e+05 rows(t74.P_PARTKEY)
 P_NAME LIKE  <c forest%> LIKE  <c >
time     0.011% fanout         0 input    240672 rows
Sort hf 47 (t74.P_PARTKEY)
}
time      0.69% fanout         1 input         1 rows
{ hash filler
Subquery 56 
{ 
time        42% fanout 1.09657e+06 input         1 rows
LINEITEM   9.1e+07 rows(t76.L_PARTKEY, t76.L_SUPPKEY, t76.L_QUANTITY)
 L_SHIPDATE >= <c 1994-01-01> < <c 1995-01-01>
hash partition+bloom by 54 (tmp)hash join merged always card     0.012 -> ()
time     0.022% fanout         1 input 1.09657e+06 rows
Hash source 47 merged into ts not partitionable     0.012 rows(t76.L_PARTKEY) -> ()
 
After code:
      0: t76.L_PARTKEY :=  := artm t76.L_PARTKEY
      4: t76.L_SUPPKEY :=  := artm t76.L_SUPPKEY
      8: t76.L_QUANTITY :=  := artm t76.L_QUANTITY
      12: BReturn 0
time      0.22% fanout         0 input 1.09657e+06 rows
Sort hf 80 (t76.L_PARTKEY, t76.L_SUPPKEY) -> (t76.L_QUANTITY)
 
}
}
time   2.1e-05% fanout         1 input         1 rows
time   3.2e-05% fanout         1 input         1 rows
{ fork
time       5.3% fanout    240672 input         1 rows
PART   2.4e+05 rows(t6.P_PARTKEY)
 P_NAME LIKE  <c forest%> LIKE  <c >
time       1.9% fanout         4 input    240672 rows
PARTSUPP       1.2 rows(t4.PS_AVAILQTY, t4.PS_PARTKEY, t4.PS_SUPPKEY)
 inlined  PS_PARTKEY = t6.P_PARTKEY
time        16% fanout  0.680447 input    962688 rows
END Node
After test:
      0: { 
time      0.08% fanout         1 input    962688 rows
time       9.4% fanout         1 input    962688 rows
{ fork
time       3.6% fanout         0 input    962688 rows
Hash source 80       0.013 rows(k_t4.PS_PARTKEY, k_t4.PS_SUPPKEY) -> (t8.L_QUANTITY)
 
After code:
      0:  sum sumt8.L_QUANTITYset no set_ctr
      5: BReturn 0
}
 
After code:
      0: temp := artm  0.5  * sum
      4: aggregate :=  := artm temp
      8: BReturn 0
time      0.85% fanout         0 input    962688 rows
Subquery Select(aggregate)
}
 
      8: if (t4.PS_AVAILQTY > scalar) then 12 else 13 unkn 13
      12: BReturn 1
      13: BReturn 0
time         1% fanout         1 input    655058 rows
Stage 2
time     0.071% fanout         1 input    655058 rows
Distinct (q_t4.PS_SUPPKEY)
 
After code:
      0: PS_SUPPKEY :=  := artm t4.PS_SUPPKEY
      4: BReturn 0
time     0.016% fanout         1 input    655058 rows
Subquery Select(PS_SUPPKEY)
time       3.2% fanout 0.0112845 input    655058 rows
SUPPLIER unq     0.075 rows (.S_NAME, .S_NATIONKEY, .S_ADDRESS)
 inlined  S_SUPPKEY = PS_SUPPKEY
hash partition+bloom by 38 (tmp)hash join merged always card      0.04 -> ()
top k on S_NAME
time    0.0012% fanout         1 input      7392 rows
Hash source 34 merged into ts       0.04 rows(.S_NATIONKEY) -> ()
time     0.074% fanout         0 input      7392 rows
Sort (.S_NAME) -> (.S_ADDRESS)
 
}
time   0.00013% fanout       100 input         1 rows
top order by read (.S_NAME, .S_ADDRESS)
time     5e-06% fanout         0 input       100 rows
Select (.S_NAME, .S_ADDRESS)
}


 1777 msec 1355% cpu,    894483 rnd 6.39422e+08 seq   79.1214% same seg   19.3093% same pg 

1.8s is sufficient, and in the ballpark with VectorWise. Some further gain is possible, as the lineitem hash table can also be restricted by supplier; after all, only 1/25 of all suppliers are in the end considered. Further simplifications are possible. Another 20% of time could be saved. The tricks are however quite complex and specific, and there are easier gains to be had -- for example, in reusing intermediates in Q17 and Q15.

The next installment will discuss late projection and some miscellaneous tricks not mentioned so far. After this, we are ready to take an initial look at the performance of the system as a whole.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
03/20/2014 16:03 GMT-0500 Modified: 05/28/2014 17:13 GMT-0500
Linked Geospatial Data 2014 Workshop, Part 4: GeoKnow, London, Brussels, The Message

Last Friday (2014-03-14) I gave a talk about GeoKnow at the EC Copernicus Big Data workshop. This was a trial run for more streamlined messaging. I have, aside the practice of geekcraft, occupied myself with questions of communication these last weeks.

The clear take-home from London and Brussels alike is that these events have full days and 4 or more talks an hour. It is not quite TV commercial spots yet but it is going in this direction.

If you say something complex, little will get across unless the audience already knows what you will be saying.

I had a set of slides from Jens Lehmann, the GeoKnow project coordinator, for whom I was standing in. Now these are a fine rendition of the description of work. What is wrong with partners, work packages, objectives, etc? Nothing, except everybody has them.

I recall the old story about the journalist and the Zen master: The Zen master repeatedly advises the reporter to cut the story in half. We get the same from PR professionals, "If it is short, they have at least thought about what should go in there," said one recently, talking of pitches and messages. The other advice was to use pictures. And to have a personal dimension to it.

Enter "Ms. Globe" and "Mr. Cube". Frans Knibbe of Geodan gave the Linked Geospatial Data 2014 workshop's most memorable talk entitled "Linked Data and Geoinformatics - a love story" (pdf) about the excitement and the pitfalls of the burgeoning courtship of Ms. Globe (geoinformatics) and Mr. Cube (semantic technology). They get to talking, later Ms. Globe thinks to herself... "Desiloisazation, explicit semantics, integrated metadata..." Mr. Cube, young upstart now approaching a more experienced and sophisticated lady, dreams of finally making an entry into adult society, "critical mass, global scope, relevant applications..." There is a vibration in the air.

So, with Frans Knibbe's gracious permission I borrowed the storyline and some of the pictures.

We ought to make a series of cartoons about the couple. There will be twists and turns in the story to come.

Mr. Cube is not Ms. Globe's first lover, though; there is also rich and worldly Mr. Table. How will Mr. Cube prove himself? The eternal question... Well, not by moping around, not by wise-cracking about semantics, no. By boldly setting out upon a journey to fetch the Golden Fleece from beyond the crashing rocks. "Column store, vectored execution, scale out, data clustering, adaptive schema..." he affirms, with growing confidence.

This is where the story stands, right now. Virtuoso run circles around PostGIS doing aggregations and lookups on geometries in a map-scrolling scenario (GeoKnow's GeoBenchLab). Virtuoso SPARQL outperforms PostGIS SQL against planet-scale OpenStreetMap; Virtuoso SQL goes 5-10x faster still.

Mr Cube is fast on the draw, but still some corners can be smoothed out.

Later in GeoKnow, there will be still more speed but also near parity between SQL and SPARQL via taking advantage of data regularity in guiding physical storage. If it is big, it is bound to have repeating structure.

The love story grows more real by the day. To be consummated still within GeoKnow.

Talking of databases has the great advantage that this has been a performance game from the start. There are few people who need convincing about the desirability of performance, as this also makes for lower cost and more flexibility on the application side.

But this is not all there is to it.

In Brussels, the public was about E-science (Earth observation). In science, it is understood that qualitative aspects can be even more crucial. I told the story about an E-science-oriented workshop I attended in America years ago. The practitioners, from high energy physics to life sciences to climate, had invariably come across the need for self-description of data and for schema-last. This was essentially never provided by RDF, except for some life science cases. Rather, we had one-off schemes, ranging from key-value pairs to putting the table name in a column of the same table to preserve the origin across data export.

Explicit semantics and integrated metadata are important, Ms. Globe knows, but she cannot sacrifice operational capacity for this. So it is more than a DBMS or even data model choice -- there must be a solid tool chain for data integration and visualization. GeoKnow provides many tools in this space.

Some of these, such as the LIMES entity matching framework (pdf) are probably close to the best there is. For other parts, the SQL-based products with hundreds of person years invested in user interaction are simply unbeatable.

In these cases, the world can continue to talk SQL. If the regular part of the data is in fact tables already, so much the better. You connect to Virtuoso via SQL, just like to PostGIS or Oracle Spatial, and talk SQL MM. The triples, in the sense of flexible annotation and integrated metadata, stay there; you just do not see them if you do not want them.

There are possibilities all right. In the coming months I will showcase some of the progress, starting with a detailed look at the OpenStreetMap experiments we have made in GeoKnow.

Linked Geospatial Data 2014 Workshop posts:

# PermaLink Comments [0]
03/18/2014 10:45 GMT-0500 Modified: 03/18/2014 10:52 GMT-0500
 <<     | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |     >>
Powered by OpenLink Virtuoso Universal Server
Running on Linux platform
OpenLink Software 1998-2006