Details
OpenLink Software
Burlington, United States
Subscribe
Post Categories
Recent Articles
Community Member Blogs
Display Settings
Translate
|
New Semantic Publishing Benchmark Record
[
Virtuoso Data Space Bot
]
There is a new SPB (Semantic Publishing Benchmark) 256 Mtriple record with Virtuoso.
As before, the result has been measured with the feature/analytics branch of the v7fasttrack open source distribution, and it will soon be available as a preconfigured Amazon EC2 image. The updated benchmarks AMI with this version of the software will be out there within the next week, to be announced on this blog.
On the Cost of RDF Query Optimization
RDF query optimization is harder than the relational equivalent; first, because there are more joins, hence an NP complete explosion of plan search space, and second, because cardinality estimation is harder and usually less reliable. The work on characteristic sets, pioneered by Thomas Neumann in RDF3X, uses regularities in structure for treating properties usually occurring in the same subject as columns of a table. The same idea is applied for tuning physical representation in the joint Virtuoso / MonetDB work published at WWW 2015.
The Virtuoso results discussed here, however, are all based on a single RDF quad table with Virtuoso's default index configuration.
Introducing query plan caching raises the Virtuoso score from 80 qps to 144 qps at the 256 Mtriple scale. The SPB queries are not extremely complex; lookups with many more triple patterns exist in actual workloads, e.g., Open PHACTS. In such applications, query optimization indeed dominates execution times. In SPB, data volumes touched by queries grow near linearly with data scale. At the 256 Mtriple scale, nearly half of CPU cycles are spent deciding a query plan. Below are the CPU cycles for execution and compilation per query type, sorted by descending sum of the times, scaled to milliseconds per execution. These are taken from a one minute sample of running at full throughput.
Test system is the same used before in the TPC-H series: dual Xeon E5-2630 Sandy Bridge, 2 x 6 cores x 2 threads, 2.3GHz, 192 GB RAM.
We measure the compile and execute times, with and without using hash join. When considering hash join, the throughput is 80 qps. When not considering hash join, the throughput is 110 qps. With query plan caching, the throughput is 145 qps whether or not hash join is considered. Using hash join is not significant for the workload but considering its use in query optimization leads to significant extra work.
With hash join
Compile |
Execute |
Total |
Query |
3156 ms |
1181 ms |
4337 ms |
Total |
1327 ms |
28 ms |
1355 ms |
query 01 |
444 ms |
460 ms |
904 ms |
query 08 |
466 ms |
54 ms |
520 ms |
query 06 |
123 ms |
268 ms |
391 ms |
query 05 |
257 ms |
5 ms |
262 ms |
query 11 |
191 ms |
59 ms |
250 ms |
query 10 |
9 ms |
179 ms |
188 ms |
query 04 |
114 ms |
26 ms |
140 ms |
query 07 |
46 ms |
62 ms |
108 ms |
query 09 |
71 ms |
25 ms |
96 ms |
query 12 |
61 ms |
13 ms |
74 ms |
query 03 |
47 ms |
2 ms |
49 ms |
query 02 |
|
|
Without hash join
Compile |
Execute |
Total |
Query |
1816 ms |
1019 ms |
2835 ms |
Total |
197 ms |
466 ms |
663 ms |
query 08 |
609 ms |
32 ms |
641 ms |
query 01 |
188 ms |
293 ms |
481 ms |
query 05 |
275 ms |
61 ms |
336 ms |
query 09 |
163 ms |
10 ms |
173 ms |
query 03 |
128 ms |
38 ms |
166 ms |
query 10 |
102 ms |
5 ms |
107 ms |
query 11 |
63 ms |
27 ms |
90 ms |
query 12 |
24 ms |
57 ms |
81 ms |
query 06 |
47 ms |
1 ms |
48 ms |
query 02 |
15 ms |
24 ms |
39 ms |
query 07 |
5 ms |
5 ms |
10 ms |
query 04 |
|
Considering hash join always slows down compilation, and sometimes improves and sometimes worsens execution. Some improvement in cost-model and plan-space traversal-order is possible, but altogether removing compilation via caching is better still. The results are as expected, since a lookup workload such as SPB has little use for hash join by nature.
The rationale for considering hash join in the first place is that analytical workloads rely heavily on this. A good TPC-H score is simply unfeasible without this as previously discussed on this blog. If RDF is to be a serious contender beyond serving lookups, then hash join is indispensable. The decision for using this however depends on accurate cardinality estimates on either side of the join.
Previous work (e.g., papers from FORTH around MonetDB) advocates doing away with a cost model altogether, since one is hard and unreliable with RDF anyway. The idea is not without its attraction but will lead to missing out of analytics or to relying on query hints for hash join.
The present Virtuoso thinking is that going to rule based optimization is not the preferred solution, but rather using characteristic sets for reducing triples into wider tables, which also cuts down on plan search space and increases reliability of cost estimation.
When looking at execution alone, we see that actual database operations are low in the profile, with memory management taking the top 19%. This is due to CONSTRUCT queries allocating small blocks for returning graphs, which is entirely avoidable.
01/11/2016 15:22 GMT
|
Modified:
01/11/2016 15:25 GMT
|
New Semantic Publishing Benchmark Record
[
Orri Erling
]
There is a new SPB (Semantic Publishing Benchmark) 256 Mtriple record with Virtuoso.
As before, the result has been measured with the feature/analytics branch of the v7fasttrack open source distribution, and it will soon be available as a preconfigured Amazon EC2 image. The updated benchmarks AMI with this version of the software will be out there within the next week, to be announced on this blog.
On the Cost of RDF Query Optimization
RDF query optimization is harder than the relational equivalent; first, because there are more joins, hence an NP complete explosion of plan search space, and second, because cardinality estimation is harder and usually less reliable. The work on characteristic sets, pioneered by Thomas Neumann in RDF3X, uses regularities in structure for treating properties usually occurring in the same subject as columns of a table. The same idea is applied for tuning physical representation in the joint Virtuoso / MonetDB work published at WWW 2015.
The Virtuoso results discussed here, however, are all based on a single RDF quad table with Virtuoso's default index configuration.
Introducing query plan caching raises the Virtuoso score from 80 qps to 144 qps at the 256 Mtriple scale. The SPB queries are not extremely complex; lookups with many more triple patterns exist in actual workloads, e.g., Open PHACTS. In such applications, query optimization indeed dominates execution times. In SPB, data volumes touched by queries grow near linearly with data scale. At the 256 Mtriple scale, nearly half of CPU cycles are spent deciding a query plan. Below are the CPU cycles for execution and compilation per query type, sorted by descending sum of the times, scaled to milliseconds per execution. These are taken from a one minute sample of running at full throughput.
Test system is the same used before in the TPC-H series: dual Xeon E5-2630 Sandy Bridge, 2 x 6 cores x 2 threads, 2.3GHz, 192 GB RAM.
We measure the compile and execute times, with and without using hash join. When considering hash join, the throughput is 80 qps. When not considering hash join, the throughput is 110 qps. With query plan caching, the throughput is 145 qps whether or not hash join is considered. Using hash join is not significant for the workload but considering its use in query optimization leads to significant extra work.
With hash join
Compile |
Execute |
Total |
Query |
3156 ms |
1181 ms |
4337 ms |
Total |
1327 ms |
28 ms |
1355 ms |
query 01 |
444 ms |
460 ms |
904 ms |
query 08 |
466 ms |
54 ms |
520 ms |
query 06 |
123 ms |
268 ms |
391 ms |
query 05 |
257 ms |
5 ms |
262 ms |
query 11 |
191 ms |
59 ms |
250 ms |
query 10 |
9 ms |
179 ms |
188 ms |
query 04 |
114 ms |
26 ms |
140 ms |
query 07 |
46 ms |
62 ms |
108 ms |
query 09 |
71 ms |
25 ms |
96 ms |
query 12 |
61 ms |
13 ms |
74 ms |
query 03 |
47 ms |
2 ms |
49 ms |
query 02 |
|
|
Without hash join
Compile |
Execute |
Total |
Query |
1816 ms |
1019 ms |
2835 ms |
Total |
197 ms |
466 ms |
663 ms |
query 08 |
609 ms |
32 ms |
641 ms |
query 01 |
188 ms |
293 ms |
481 ms |
query 05 |
275 ms |
61 ms |
336 ms |
query 09 |
163 ms |
10 ms |
173 ms |
query 03 |
128 ms |
38 ms |
166 ms |
query 10 |
102 ms |
5 ms |
107 ms |
query 11 |
63 ms |
27 ms |
90 ms |
query 12 |
24 ms |
57 ms |
81 ms |
query 06 |
47 ms |
1 ms |
48 ms |
query 02 |
15 ms |
24 ms |
39 ms |
query 07 |
5 ms |
5 ms |
10 ms |
query 04 |
|
Considering hash join always slows down compilation, and sometimes improves and sometimes worsens execution. Some improvement in cost-model and plan-space traversal-order is possible, but altogether removing compilation via caching is better still. The results are as expected, since a lookup workload such as SPB has little use for hash join by nature.
The rationale for considering hash join in the first place is that analytical workloads rely heavily on this. A good TPC-H score is simply unfeasible without this as previously discussed on this blog. If RDF is to be a serious contender beyond serving lookups, then hash join is indispensable. The decision for using this however depends on accurate cardinality estimates on either side of the join.
Previous work (e.g., papers from FORTH around MonetDB) advocates doing away with a cost model altogether, since one is hard and unreliable with RDF anyway. The idea is not without its attraction but will lead to missing out of analytics or to relying on query hints for hash join.
The present Virtuoso thinking is that going to rule based optimization is not the preferred solution, but rather using characteristic sets for reducing triples into wider tables, which also cuts down on plan search space and increases reliability of cost estimation.
When looking at execution alone, we see that actual database operations are low in the profile, with memory management taking the top 19%. This is due to CONSTRUCT queries allocating small blocks for returning graphs, which is entirely avoidable.
01/11/2016 15:22 GMT
|
Modified:
01/11/2016 15:25 GMT
|
DBpedia Usage Report, August 2015
[
Virtuoso Data Space Bot
]
We recently published the latest DBpedia Usage Report, covering v3.3 (released July, 2009) to v3.10 (sometimes called "DBpedia 2014"; released September, 2014).
The new report has usage data through July 31, 2015, and brought a few surprises to our eyes. What do you think?
DBpedia Usage Report, August 2015
[
Orri Erling
]
We recently published the latest DBpedia Usage Report, covering v3.3 (released July, 2009) to v3.10 (sometimes called "DBpedia 2014"; released September, 2014).
The new report has usage data through July 31, 2015, and brought a few surprises to our eyes. What do you think?
Big Data, Part 2: Virtuoso Meets Impala
[
Virtuoso Data Space Bot
]
In this article we will look at Virtuoso vs. Impala with 100G TPC-H on two R3.8 EC2 instances. We get a single user win for Virtuoso by a factor of 136, and a five user win by a factor of 55. The details and analysis follow.
The load setup is the same as ever, with copying from CSV files attached as external tables into Parquet tables. We get lineitem split over 88 Parquet files, which should provide enough parallelism for the platform. The Impala documentation states that there can be up to one thread per file, and here we wish to see maximum parallelism for a single query stream. We use the schema from the Impala github checkout, with string for string and date columns, and decimal for numbers . We suppose the authors know what works best.
The execution behavior is surprising. Sometimes we get full platform utilization, but quite often only 200% CPU per box. The query plan for Q1, for example, says 2 cores per box. This makes no sense, as the same plan fully well knows the table cardinality. The settings for scanner threads and cores to use (in impala-shell ) can be changed, but the behavior does not seem to change.
Following are the run times for one query stream.
Query |
Virtuoso |
Impala |
Notes |
— |
332 s |
841 s |
Data Load |
Q1 |
1.098 s |
164.61 s |
|
Q2 |
0.187 s |
24.19 s |
|
Q3 |
0.761 s |
105.70 s |
|
Q4 |
0.205 s |
179.67 s |
|
Q5 |
0.808 s |
84.51 s |
|
Q6 |
2.403 s |
4.43 s |
|
Q7 |
0.59 s |
270.88 s |
|
Q8 |
0.775 s |
51.89 s |
|
Q9 |
1.836 s |
177.72 s |
|
Q10 |
3.165 s |
39.85 s |
|
Q11 |
1.37 s |
22.56 s |
|
Q12 |
0.356 s |
17.03 s |
|
Q13 |
2.233 s |
103.67 s |
|
Q14 |
0.488 s |
10.86 s |
|
Q15 |
0.72 s |
11.49 s |
|
Q16 |
0.814 s |
23.93 s |
|
Q17 |
0.681 s |
276.06 s |
|
Q18 |
1.324 s |
267.13 s |
|
Q19 |
0.417 s |
368.80 s |
|
Q20 |
0.792 s |
60.45 s |
|
Q21 |
0.720 s |
418.09 s |
|
Q22 |
0.155 s |
40.59 s |
|
Total |
20 s |
2724 s |
|
Because the platform utilization was often low, we made a second experiment running the same queries in five parallel sessions. We show the average execution time for each query. We then compare this with the Virtuoso throughput run average times. We permute the single query stream used in the first tests in 5 different orders, as per the TPC-H spec. The results are not entirely comparable, because Virtuoso is doing the refreshes in parallel. According to Impala documentation, there is no random delete operation, so the refreshes cannot be implemented.
Just to establish a baseline, we do SELECT COUNT (*) FROM lineitem . This takes 20s when run by itself. When run in five parallel sessions, the fastest terminates in 64s and the slowest in 69s. Looking at top , the platform utilization is indeed about 5x more in CPU%, but the concurrency does not add much to throughput. This is odd, considering that there is no synchronization requirement worth mentioning between the operations.
Following are the average times for each query in the 5 stream experiment.
Query |
Virtuoso |
Impala |
Notes |
Q1 |
1.95 s |
191.81 s |
|
Q2 |
0.70 s |
40.40 s |
|
Q3 |
2.01 s |
95.67 s |
|
Q4 |
0.71 s |
345.11 s |
|
Q5 |
2.93 s |
112.29 s |
|
Q6 |
4.76 s |
14.41 s |
|
Q7 |
2.08 s |
329.25 s |
|
Q8 |
3.00 s |
98.91 s |
|
Q9 |
5.58 s |
250.88 s |
|
Q10 |
8.23 s |
55.23 s |
|
Q11 |
4.26 s |
27.84 s |
|
Q12 |
1.74 s |
37.66 s |
|
Q13 |
6.07 s |
147.69 s |
|
Q14 |
1.73 s |
23.91 s |
|
Q15 |
2.27 s |
23.79 s |
|
Q16 |
2.41 s |
34.76 s |
|
Q17 |
3.92 s |
362.43 s |
|
Q18 |
3.02 s |
348.08 s |
|
Q19 |
2.27 s |
443.94 s |
|
Q20 |
3.05 s |
92.50 s |
|
Q21 |
2.00 s |
623.69 s |
|
Q22 |
0.37 s |
61.36 s |
|
Total for Slowest Stream |
67 s |
3740 s |
|
There are 4 queries in Impala that terminated with an error (memory limit exceeded ). These were two Q21s, one Q19, one Q4. One stream executed without errors, so this stream is reported as the slowest stream. Q21 will, in the absence of indexed access, do a hash build side of half of lineitem , which explains running out of memory. Virtuoso does Q21 mostly by index.
Looking at the 5 streams, we see CPU between 1000% and 2000% on either box. This looks about 5x more than the 250% per box that we were seeing with, for instance, Q1. The process sizes for impalad are over 160G, certainly enough to have the working set in memory. iostat also does not show any I , so we seem to be running from memory, as intended.
We observe that Impala does not store tables in any specific order. Therefore a merge join of orders and lineitem is not possible. Thus we always get a hash join with a potentially large build side, e.g., half of orders and half of lineitem in Q21, and all orders in Q9. This explains in part why these take so long. TPC-DS does not pose this particular problem though, as there are no tables in the DS schema where the primary key of one would be the prefix of that of another.
However, the lineitem/orders join does not explain the scores on Q1, Q20, or Q19. A simple hash join of lineitem and part was about 90s, with a replicated part hash table. In the profile, the hash probe was 74s, which seems excessive. One would have to single-step through the hash probe to find out what actually happens. Maybe there are prohibitive numbers of collisions, which would throw off the results across the board. We would have to ask the Impala community about this.
Anyway, Impala experts out there are invited to set the record straight. We have attached the results and the output of the Impala profile statement for each query for the single stream run. impala_stream0.zip contains the evidence for the single-stream run; impala-stream1-5.zip holds the 5-stream run.
To be more Big Data-like, we should probably run with significantly larger data than memory; for example, 3T in 0.5T RAM. At EC2, we could do this with 2 I3.8 instances (6.4T SSD each). With Virtuoso, we'd be done in 8 hours or so, counting 2x for the I/O and 30x for the greater scale (the 100G experiment goes in 8 minutes or so, all included). With Impala, we could be running for weeks, so at the very least we'd like to do this with an Impala expert, to make sure things are done right and will not have to be retried. Some of the hash joins would have to be done in multiple passes and with partitioning.
In subsequent articles, we will look at other players in this space, and possibly some other benchmarks, like the TPC-DS subset that Actian uses to beat Impala.
Big Data, Part 2: Virtuoso Meets Impala
[
Orri Erling
]
In this article we will look at Virtuoso vs. Impala with 100G TPC-H on two R3.8 EC2 instances. We get a single user win for Virtuoso by a factor of 136, and a five user win by a factor of 55. The details and analysis follow.
The load setup is the same as ever, with copying from CSV files attached as external tables into Parquet tables. We get lineitem split over 88 Parquet files, which should provide enough parallelism for the platform. The Impala documentation states that there can be up to one thread per file, and here we wish to see maximum parallelism for a single query stream. We use the schema from the Impala github checkout, with string for string and date columns, and decimal for numbers . We suppose the authors know what works best.
The execution behavior is surprising. Sometimes we get full platform utilization, but quite often only 200% CPU per box. The query plan for Q1, for example, says 2 cores per box. This makes no sense, as the same plan fully well knows the table cardinality. The settings for scanner threads and cores to use (in impala-shell ) can be changed, but the behavior does not seem to change.
Following are the run times for one query stream.
Query |
Virtuoso |
Impala |
Notes |
— |
332 s |
841 s |
Data Load |
Q1 |
1.098 s |
164.61 s |
|
Q2 |
0.187 s |
24.19 s |
|
Q3 |
0.761 s |
105.70 s |
|
Q4 |
0.205 s |
179.67 s |
|
Q5 |
0.808 s |
84.51 s |
|
Q6 |
2.403 s |
4.43 s |
|
Q7 |
0.59 s |
270.88 s |
|
Q8 |
0.775 s |
51.89 s |
|
Q9 |
1.836 s |
177.72 s |
|
Q10 |
3.165 s |
39.85 s |
|
Q11 |
1.37 s |
22.56 s |
|
Q12 |
0.356 s |
17.03 s |
|
Q13 |
2.233 s |
103.67 s |
|
Q14 |
0.488 s |
10.86 s |
|
Q15 |
0.72 s |
11.49 s |
|
Q16 |
0.814 s |
23.93 s |
|
Q17 |
0.681 s |
276.06 s |
|
Q18 |
1.324 s |
267.13 s |
|
Q19 |
0.417 s |
368.80 s |
|
Q20 |
0.792 s |
60.45 s |
|
Q21 |
0.720 s |
418.09 s |
|
Q22 |
0.155 s |
40.59 s |
|
Total |
20 s |
2724 s |
|
Because the platform utilization was often low, we made a second experiment running the same queries in five parallel sessions. We show the average execution time for each query. We then compare this with the Virtuoso throughput run average times. We permute the single query stream used in the first tests in 5 different orders, as per the TPC-H spec. The results are not entirely comparable, because Virtuoso is doing the refreshes in parallel. According to Impala documentation, there is no random delete operation, so the refreshes cannot be implemented.
Just to establish a baseline, we do SELECT COUNT (*) FROM lineitem . This takes 20s when run by itself. When run in five parallel sessions, the fastest terminates in 64s and the slowest in 69s. Looking at top , the platform utilization is indeed about 5x more in CPU%, but the concurrency does not add much to throughput. This is odd, considering that there is no synchronization requirement worth mentioning between the operations.
Following are the average times for each query in the 5 stream experiment.
Query |
Virtuoso |
Impala |
Notes |
Q1 |
1.95 s |
191.81 s |
|
Q2 |
0.70 s |
40.40 s |
|
Q3 |
2.01 s |
95.67 s |
|
Q4 |
0.71 s |
345.11 s |
|
Q5 |
2.93 s |
112.29 s |
|
Q6 |
4.76 s |
14.41 s |
|
Q7 |
2.08 s |
329.25 s |
|
Q8 |
3.00 s |
98.91 s |
|
Q9 |
5.58 s |
250.88 s |
|
Q10 |
8.23 s |
55.23 s |
|
Q11 |
4.26 s |
27.84 s |
|
Q12 |
1.74 s |
37.66 s |
|
Q13 |
6.07 s |
147.69 s |
|
Q14 |
1.73 s |
23.91 s |
|
Q15 |
2.27 s |
23.79 s |
|
Q16 |
2.41 s |
34.76 s |
|
Q17 |
3.92 s |
362.43 s |
|
Q18 |
3.02 s |
348.08 s |
|
Q19 |
2.27 s |
443.94 s |
|
Q20 |
3.05 s |
92.50 s |
|
Q21 |
2.00 s |
623.69 s |
|
Q22 |
0.37 s |
61.36 s |
|
Total for Slowest Stream |
67 s |
3740 s |
|
There are 4 queries in Impala that terminated with an error (memory limit exceeded ). These were two Q21s, one Q19, one Q4. One stream executed without errors, so this stream is reported as the slowest stream. Q21 will, in the absence of indexed access, do a hash build side of half of lineitem , which explains running out of memory. Virtuoso does Q21 mostly by index.
Looking at the 5 streams, we see CPU between 1000% and 2000% on either box. This looks about 5x more than the 250% per box that we were seeing with, for instance, Q1. The process sizes for impalad are over 160G, certainly enough to have the working set in memory. iostat also does not show any I , so we seem to be running from memory, as intended.
We observe that Impala does not store tables in any specific order. Therefore a merge join of orders and lineitem is not possible. Thus we always get a hash join with a potentially large build side, e.g., half of orders and half of lineitem in Q21, and all orders in Q9. This explains in part why these take so long. TPC-DS does not pose this particular problem though, as there are no tables in the DS schema where the primary key of one would be the prefix of that of another.
However, the lineitem/orders join does not explain the scores on Q1, Q20, or Q19. A simple hash join of lineitem and part was about 90s, with a replicated part hash table. In the profile, the hash probe was 74s, which seems excessive. One would have to single-step through the hash probe to find out what actually happens. Maybe there are prohibitive numbers of collisions, which would throw off the results across the board. We would have to ask the Impala community about this.
Anyway, Impala experts out there are invited to set the record straight. We have attached the results and the output of the Impala profile statement for each query for the single stream run. impala_stream0.zip contains the evidence for the single-stream run; impala-stream1-5.zip holds the 5-stream run.
To be more Big Data-like, we should probably run with significantly larger data than memory; for example, 3T in 0.5T RAM. At EC2, we could do this with 2 I3.8 instances (6.4T SSD each). With Virtuoso, we'd be done in 8 hours or so, counting 2x for the I/O and 30x for the greater scale (the 100G experiment goes in 8 minutes or so, all included). With Impala, we could be running for weeks, so at the very least we'd like to do this with an Impala expert, to make sure things are done right and will not have to be retried. Some of the hash joins would have to be done in multiple passes and with partitioning.
In subsequent articles, we will look at other players in this space, and possibly some other benchmarks, like the TPC-DS subset that Actian uses to beat Impala.
Vectored Execution in Column/Row Stores
[
Virtuoso Data Space Bot
]
This article discusses the relationship between vectored execution and column- and row-wise data representations. Column stores are traditionally considered to be good for big scans but poor at indexed access. This is not necessarily so, though. We take TPC-H Q9 as a starting point, working with different row- and column-wise data representations and index choices. The goal of the article is to provide a primer on the performance implications of different physical designs.
All the experiments are against the TPC-H 100G dataset hosted in Virtuoso on the test system used before in the TPC-H series: dual Xeon E5-2630, 2x6 cores x 2 threads, 2.3GHz, 192 GB RAM. The Virtuoso version corresponds to the feature/analytics branch in the v7fasttrack github project. All run times are from memory, and queries generally run at full platform, 24 concurrent threads.
We note that RDF stores and graph databases usually do not have secondary indices with multiple key parts. However, these do predominantly index-based access as opposed to big scans and hash joins. To explore the impact of this, we have decomposed the tables into projections with a single dependent column, which approximates a triple store or a vertically-decomposed graph database like Sparksee.
So, in these experiments, we store the relevant data four times over, as follows:
-
100G TPC-H dataset in the column-wise schema as discussed in the TPC-H series, now complemented with indices on l_partkey and on l_partkey, l_suppkey
-
The same in row-wise data representation
-
Column-wise tables with a single dependent column for l_partkey, l_suppkey, l_extendedprice, l_quantity, l_discount, ps_supplycost, s_nationkey, p_name . These all have the original tables primary key, e.g., l_orderkey, l_linenumber for the l_ prefixed tables
-
The same with row-wise tables
The column-wise structures are in the DB qualifier, and the row-wise are in the R qualifier. There is a summary of space consumption at the end of the article. This is relevant for scalability, since even if row-wise structures can be faster for scattered random access, they will fit less data in RAM, typically 2 to 3x less. Thus, if "faster" rows cause the working set not to fit, "slower" columns will still win.
As a starting point, we know that the best Q9 is the one in the Virtuoso TPC-H implementation which is described in Part 10 of the TPC-H blog series. This is a scan of lineitem with a selective hash join followed ordered index access of orders , then hash joins against the smaller tables. There are special tricks to keep the hash tables small by propagating restrictions from the probe side to the build side.
The query texts are available here, along with the table declarations and scripts for populating the single-column projections. rs.sql makes the tables and indices, rsload.sql copies the data from the TPC-H tables.
The business question is to calculate the profit from sale of selected parts grouped by year and country of the supplier . This touches most of the tables, aggregates over 1/17 of all sales, and touches at least every page of the tables concerned, if not every row.
SELECT
n_name AS nation,
EXTRACT(year FROM o_orderdate) AS o_year,
SUM (l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity) AS sum_profit
FROM lineitem, part, partsupp, orders, supplier, nation
WHERE s_suppkey = l_suppkey
AND ps_suppkey = l_suppkey
AND ps_partkey = l_partkey
AND p_partkey = l_partkey
AND o_orderkey = l_orderkey
AND s_nationkey = n_nationkey
AND p_name LIKE '%green%'
GROUP BY nation, o_year
ORDER BY nation, o_year DESC
Query Variants
The query variants discussed here are:
-
Hash based, the best plan -- 9h.sql
-
Index based with multicolumn rows, with lineitem index on l_partkey -- 9i.sql, 9ir.sql
-
Index based with multicolumn rows, lineitem index on l_partkey, l_suppkey -- 9ip.sql, 9ipr.sql
-
Index based with one table per dependent column, index on l_partkey -- 9p.sql
-
index based with one table per dependent column, with materialized l_partkey, l_suppkey -> l_orderkey, l_minenumber -- 9pp.sql, 9ppr.sql
These are done against row- and column-wise data representations with 3 different vectorization settings. The dynamic vector size starts at 10,000 values in a vector, and adaptively upgrades this to 1,000,000 if it finds that index access is too sparse. Accessing rows close to each other is more efficient than widely scattered rows in vectored index access, so using a larger vector will likely cause a denser, hence more efficient, access pattern.
The 10K vector size corresponds to running with a fixed vector size. The Vector 1 sets vector size to 1, effectively running a tuple at a time, which corresponds to a non-vectorized engine.
We note that lineitem and its single column projections contain 600M rows. So, a vector of 10K values will hit, on the average, every 60,000th row. A vector of 1,000,000 will thus hit every 600th. This is when doing random lookups that are in no specific order, e.g., getting lineitems by a secondary index on l_partkey .
1 — Hash-based plan
Vector |
Dynamic |
10k |
1 |
Column-wise
|
4.1 s
|
4.1 s
|
145 s
|
Row-wise
|
25.6 s
|
25.9 s
|
45.4 s
|
Dynamic vector size has no effect here, as there is no indexed access that would gain from more locality. The column store is much faster because of less memory access (just scan the l_partkey column, and filter this with a Bloom filter; and then hash table lookup to pick only items with the desired part ). The other columns are accessed only for the matching rows. The hash lookup is vectored since there are hundreds of compressed l_partkey values available at each time. The row store does the hash lookup row by row, hence losing cache locality and instruction-level parallelism.
Without vectorization, we have a situation where the lineitem scan emits one row at a time. Restarting the scan with the column store takes much longer, since 5 buffers have to be located and pinned instead of one for the row store. The row store is thus slowed down less, but it too suffers almost a factor of 2 from interpretation overhead.
2 — Index-based, lineitem indexed on l_partkey
Vector |
Dynamic |
10k |
1 |
Column-wise
|
30.4 s
|
62.3 s
|
321 s
|
Row-wise
|
31.8 s
|
27.7 s
|
122 s
|
Here the plan scans part , then partsupp , which shares ordering with part ; both are ordered on partkey . Then lineitem is fetched by a secondary index on l_partkey . This produces l_orderkey, l_lineitem , which are used to get the l_suppkey . We then check if the l_suppkey matches the ps_suppkey from partsupp , which drops 3/4 of the rows. The next join is on orders , which shares ordering with lineitem ; both are ordered on orderkey .
There is a narrow win for columns with dynamic vector size. When access becomes scattered, rows win by 2.5x, because there is only one page to access instead of 1 + 3 for columns. This is compensated for if the next item is found on the same page, which happens if the access pattern is denser.
3 — Index-based, lineitem indexed on L_partkey, l_suppkey
Vector |
Dynamic |
10k |
1 |
Column-wise
|
16.9 s
|
47.2 s
|
151 s
|
Row-wise
|
22.4 s
|
20.7 s
|
89 s
|
This is similar to the previous, except that now only lineitems that match ps_partkey, ps_suppkey are accessed, as the secondary index has two columns. Access is more local. Columns thus win more with dynamic vector size.
4 — Decomposed, index on l_partkey
Vector |
Dynamic |
10k |
1 |
Column-wise
|
35.7 s
|
170 s
|
601 s
|
Row-wise
|
44.5 s
|
56.2 s
|
130 s
|
Now, each of the l_extendedprice, l_discount, l_quantity and l_suppkey is a separate index lookup. The times are slightly higher but the dynamic is the same.
The non-vectored columns case is hit the hardest.
5 — Decomposed, index on l_partkey, l_suppkey
Vector |
Dynamic |
10k |
1 |
Column-wise
|
19.6 s
|
111 s
|
257 s
|
Row-wise
|
32.0 s
|
37 s
|
74.9 s
|
Again, we see the same dynamic as with a multicolumn table. Columns win slightly more at long vector sizes because of overall better index performance in the presence of locality.
Space Utilization
The following tables list the space consumption in megabytes of allocated pages. Unallocated space in database files is not counted.
The row-wise table also contains entries for column-wise structures (DB.* ) since these have a row-wise sparse index. The size of this is however negligible, under 1% of the column-wise structures.
Row-Wise | |
Column-Wise |
MB |
structure |
73515
|
R.DBA.LINEITEM
|
14768
|
R.DBA.ORDERS
|
11728
|
R.DBA.PARTSUPP
|
10161 |
r_lpk_pk
|
10003 |
r_l_pksk
|
9908 |
R.DBA.l_partkey
|
8761 |
R.DBA.l_extendedprice
|
8745 |
R.DBA.l_discount
|
8738 |
r_l_pk
|
8713 |
R.DBA.l_suppkey
|
6267 |
R.DBA.l_quantity
|
2223 |
R.DBA.CUSTOMER
|
2180 |
R.DBA.o_orderdate
|
2041 |
r_O_CK
|
1911 |
R.DBA.PART
|
1281 |
R.DBA.ps_supplycost
|
811 |
R.DBA.p_name
|
127 |
R.DBA.SUPPLIER
|
88 |
DB.DBA.LINEITEM
|
24 |
DB.DBA.ORDERS
|
11 |
DB.DBA.PARTSUPP
|
9 |
R.DBA.s_nationkey
|
5 |
l_pksk
|
4 |
DB.DBA.l_partkey
|
4 |
lpk_pk
|
4 |
DB.DBA.l_extendedprice
|
3 |
l_pk
|
3 |
DB.DBA.l_suppkey
|
2 |
DB.DBA.CUSTOMER
|
2 |
DB.DBA.l_quantity
|
1 |
DB.DBA.PART
|
1 |
O_CK
|
1 |
DB.DBA.l_discount
|
|
|
MB |
structure |
36482
|
DB.DBA.LINEITEM
|
13087 |
DB.DBA.ORDERS
|
11587 |
DB.DBA.PARTSUPP
|
5181 |
DB.DBA.l_extendedprice
|
4431 |
l_pksk
|
3072 |
DB.DBA.l_partkey
|
2958 |
lpk_pk
|
2918 |
l_pk
|
2835 |
DB.DBA.l_suppkey
|
2067 |
DB.DBA.CUSTOMER
|
1618 |
DB.DBA.PART
|
1156 |
DB.DBA.l_quantity
|
961 |
DB.DBA.ps_supplycost
|
814 |
O_CK
|
798 |
DB.DBA.l_discount
|
724 |
DB.DBA.p_name
|
436 |
DB.DBA.o_orderdate
|
126 |
DB.DBA.SUPPLIER
|
1 |
DB.DBA.s_nationkey
|
|
In both cases, the large tables are on top, but the column-wise case takes only half the space due to compression.
We note that the single column projections are smaller column-wise. The l_extendedprice is not very compressible hence column-wise takes much more space than l_quantity ; the row-wise difference is less. Since the leading key parts l_orderkey, l_linenumber are ordered and very compressible, the column-wise structures are in all cases noticeably more compact.
The same applies to the multipart index l_pksk and r_l_pksk (l_partkey, l_suppkey, l_orderkey, l_linenumber ) in column- and row-wise representations.
Note that STRING columns (e.g., l_comment ) are not compressed. If they were, the overall space ratio would be even more to the advantage of the column store.
Conclusions
Column stores and vectorization inextricably belong together. Column-wise compression yields great gains also for indices, since sorted data is easy to compress. Also for non-sorted data, adaptive use of dictionaries, run lengths, etc., produce great space savings. Columns also win with indexed access if there is locality.
Row stores have less dependence on locality, but they also will win by a factor of 3 from dropping interpretation overhead and exploiting join locality.
For point lookups, columns lose by 2+x but considering their better space efficiency, they will still win if space savings prevent going to secondary storage. For bulk random access, like in graph analytics, columns will win because of being able to operate on a large vector of keys to fetch.
For many workloads, from TPC-H to LDBC social network, multi-part keys are a necessary component of physical design for performance if indexed access predominates. Triple stores and most graph databases do not have such and are therefore at a disadvantage. Self-joining, like in RDF or other vertically decomposed structures, can cost up to a factor of 10-20 over a column-wise multicolumn table. This depends however on the density of access.
For analytical workloads, where the dominant join pattern is the scan with selective hash join, column stores are unbeatable, as per common wisdom. There are good physical reasons for this and the row store even with well implemented vectorization loses by a factor of 5.
For decomposed structures, like RDF quads or single column projections of tables, column stores are relatively more advantageous because the key columns are extensively repeated, and these compress better with columns than with rows. In all the RDF workloads we have tried, columns never lose, but there is often a draw between rows and columns for lookup workloads. The longer the query, the more columns win.
07/13/2015 13:49 GMT
|
Modified:
07/13/2015 13:56 GMT
|
Vectored Execution in Column/Row Stores
[
Orri Erling
]
This article discusses the relationship between vectored execution and column- and row-wise data representations. Column stores are traditionally considered to be good for big scans but poor at indexed access. This is not necessarily so, though. We take TPC-H Q9 as a starting point, working with different row- and column-wise data representations and index choices. The goal of the article is to provide a primer on the performance implications of different physical designs.
All the experiments are against the TPC-H 100G dataset hosted in Virtuoso on the test system used before in the TPC-H series: dual Xeon E5-2630, 2x6 cores x 2 threads, 2.3GHz, 192 GB RAM. The Virtuoso version corresponds to the feature/analytics branch in the v7fasttrack github project. All run times are from memory, and queries generally run at full platform, 24 concurrent threads.
We note that RDF stores and graph databases usually do not have secondary indices with multiple key parts. However, these do predominantly index-based access as opposed to big scans and hash joins. To explore the impact of this, we have decomposed the tables into projections with a single dependent column, which approximates a triple store or a vertically-decomposed graph database like Sparksee.
So, in these experiments, we store the relevant data four times over, as follows:
-
100G TPC-H dataset in the column-wise schema as discussed in the TPC-H series, now complemented with indices on l_partkey and on l_partkey, l_suppkey
-
The same in row-wise data representation
-
Column-wise tables with a single dependent column for l_partkey, l_suppkey, l_extendedprice, l_quantity, l_discount, ps_supplycost, s_nationkey, p_name . These all have the original tables primary key, e.g., l_orderkey, l_linenumber for the l_ prefixed tables
-
The same with row-wise tables
The column-wise structures are in the DB qualifier, and the row-wise are in the R qualifier. There is a summary of space consumption at the end of the article. This is relevant for scalability, since even if row-wise structures can be faster for scattered random access, they will fit less data in RAM, typically 2 to 3x less. Thus, if "faster" rows cause the working set not to fit, "slower" columns will still win.
As a starting point, we know that the best Q9 is the one in the Virtuoso TPC-H implementation which is described in Part 10 of the TPC-H blog series. This is a scan of lineitem with a selective hash join followed ordered index access of orders , then hash joins against the smaller tables. There are special tricks to keep the hash tables small by propagating restrictions from the probe side to the build side.
The query texts are available here, along with the table declarations and scripts for populating the single-column projections. rs.sql makes the tables and indices, rsload.sql copies the data from the TPC-H tables.
The business question is to calculate the profit from sale of selected parts grouped by year and country of the supplier . This touches most of the tables, aggregates over 1/17 of all sales, and touches at least every page of the tables concerned, if not every row.
SELECT
n_name AS nation,
EXTRACT(year FROM o_orderdate) AS o_year,
SUM (l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity) AS sum_profit
FROM lineitem, part, partsupp, orders, supplier, nation
WHERE s_suppkey = l_suppkey
AND ps_suppkey = l_suppkey
AND ps_partkey = l_partkey
AND p_partkey = l_partkey
AND o_orderkey = l_orderkey
AND s_nationkey = n_nationkey
AND p_name LIKE '%green%'
GROUP BY nation, o_year
ORDER BY nation, o_year DESC
Query Variants
The query variants discussed here are:
-
Hash based, the best plan -- 9h.sql
-
Index based with multicolumn rows, with lineitem index on l_partkey -- 9i.sql, 9ir.sql
-
Index based with multicolumn rows, lineitem index on l_partkey, l_suppkey -- 9ip.sql, 9ipr.sql
-
Index based with one table per dependent column, index on l_partkey -- 9p.sql
-
index based with one table per dependent column, with materialized l_partkey, l_suppkey -> l_orderkey, l_minenumber -- 9pp.sql, 9ppr.sql
These are done against row- and column-wise data representations with 3 different vectorization settings. The dynamic vector size starts at 10,000 values in a vector, and adaptively upgrades this to 1,000,000 if it finds that index access is too sparse. Accessing rows close to each other is more efficient than widely scattered rows in vectored index access, so using a larger vector will likely cause a denser, hence more efficient, access pattern.
The 10K vector size corresponds to running with a fixed vector size. The Vector 1 sets vector size to 1, effectively running a tuple at a time, which corresponds to a non-vectorized engine.
We note that lineitem and its single column projections contain 600M rows. So, a vector of 10K values will hit, on the average, every 60,000th row. A vector of 1,000,000 will thus hit every 600th. This is when doing random lookups that are in no specific order, e.g., getting lineitems by a secondary index on l_partkey .
1 — Hash-based plan
Vector |
Dynamic |
10k |
1 |
Column-wise
|
4.1 s
|
4.1 s
|
145 s
|
Row-wise
|
25.6 s
|
25.9 s
|
45.4 s
|
Dynamic vector size has no effect here, as there is no indexed access that would gain from more locality. The column store is much faster because of less memory access (just scan the l_partkey column, and filter this with a Bloom filter; and then hash table lookup to pick only items with the desired part ). The other columns are accessed only for the matching rows. The hash lookup is vectored since there are hundreds of compressed l_partkey values available at each time. The row store does the hash lookup row by row, hence losing cache locality and instruction-level parallelism.
Without vectorization, we have a situation where the lineitem scan emits one row at a time. Restarting the scan with the column store takes much longer, since 5 buffers have to be located and pinned instead of one for the row store. The row store is thus slowed down less, but it too suffers almost a factor of 2 from interpretation overhead.
2 — Index-based, lineitem indexed on l_partkey
Vector |
Dynamic |
10k |
1 |
Column-wise
|
30.4 s
|
62.3 s
|
321 s
|
Row-wise
|
31.8 s
|
27.7 s
|
122 s
|
Here the plan scans part , then partsupp , which shares ordering with part ; both are ordered on partkey . Then lineitem is fetched by a secondary index on l_partkey . This produces l_orderkey, l_lineitem , which are used to get the l_suppkey . We then check if the l_suppkey matches the ps_suppkey from partsupp , which drops 3/4 of the rows. The next join is on orders , which shares ordering with lineitem ; both are ordered on orderkey .
There is a narrow win for columns with dynamic vector size. When access becomes scattered, rows win by 2.5x, because there is only one page to access instead of 1 + 3 for columns. This is compensated for if the next item is found on the same page, which happens if the access pattern is denser.
3 — Index-based, lineitem indexed on L_partkey, l_suppkey
Vector |
Dynamic |
10k |
1 |
Column-wise
|
16.9 s
|
47.2 s
|
151 s
|
Row-wise
|
22.4 s
|
20.7 s
|
89 s
|
This is similar to the previous, except that now only lineitems that match ps_partkey, ps_suppkey are accessed, as the secondary index has two columns. Access is more local. Columns thus win more with dynamic vector size.
4 — Decomposed, index on l_partkey
Vector |
Dynamic |
10k |
1 |
Column-wise
|
35.7 s
|
170 s
|
601 s
|
Row-wise
|
44.5 s
|
56.2 s
|
130 s
|
Now, each of the l_extendedprice, l_discount, l_quantity and l_suppkey is a separate index lookup. The times are slightly higher but the dynamic is the same.
The non-vectored columns case is hit the hardest.
5 — Decomposed, index on l_partkey, l_suppkey
Vector |
Dynamic |
10k |
1 |
Column-wise
|
19.6 s
|
111 s
|
257 s
|
Row-wise
|
32.0 s
|
37 s
|
74.9 s
|
Again, we see the same dynamic as with a multicolumn table. Columns win slightly more at long vector sizes because of overall better index performance in the presence of locality.
Space Utilization
The following tables list the space consumption in megabytes of allocated pages. Unallocated space in database files is not counted.
The row-wise table also contains entries for column-wise structures (DB.* ) since these have a row-wise sparse index. The size of this is however negligible, under 1% of the column-wise structures.
Row-Wise | |
Column-Wise |
MB |
structure |
73515
|
R.DBA.LINEITEM
|
14768
|
R.DBA.ORDERS
|
11728
|
R.DBA.PARTSUPP
|
10161 |
r_lpk_pk
|
10003 |
r_l_pksk
|
9908 |
R.DBA.l_partkey
|
8761 |
R.DBA.l_extendedprice
|
8745 |
R.DBA.l_discount
|
8738 |
r_l_pk
|
8713 |
R.DBA.l_suppkey
|
6267 |
R.DBA.l_quantity
|
2223 |
R.DBA.CUSTOMER
|
2180 |
R.DBA.o_orderdate
|
2041 |
r_O_CK
|
1911 |
R.DBA.PART
|
1281 |
R.DBA.ps_supplycost
|
811 |
R.DBA.p_name
|
127 |
R.DBA.SUPPLIER
|
88 |
DB.DBA.LINEITEM
|
24 |
DB.DBA.ORDERS
|
11 |
DB.DBA.PARTSUPP
|
9 |
R.DBA.s_nationkey
|
5 |
l_pksk
|
4 |
DB.DBA.l_partkey
|
4 |
lpk_pk
|
4 |
DB.DBA.l_extendedprice
|
3 |
l_pk
|
3 |
DB.DBA.l_suppkey
|
2 |
DB.DBA.CUSTOMER
|
2 |
DB.DBA.l_quantity
|
1 |
DB.DBA.PART
|
1 |
O_CK
|
1 |
DB.DBA.l_discount
|
|
|
MB |
structure |
36482
|
DB.DBA.LINEITEM
|
13087 |
DB.DBA.ORDERS
|
11587 |
DB.DBA.PARTSUPP
|
5181 |
DB.DBA.l_extendedprice
|
4431 |
l_pksk
|
3072 |
DB.DBA.l_partkey
|
2958 |
lpk_pk
|
2918 |
l_pk
|
2835 |
DB.DBA.l_suppkey
|
2067 |
DB.DBA.CUSTOMER
|
1618 |
DB.DBA.PART
|
1156 |
DB.DBA.l_quantity
|
961 |
DB.DBA.ps_supplycost
|
814 |
O_CK
|
798 |
DB.DBA.l_discount
|
724 |
DB.DBA.p_name
|
436 |
DB.DBA.o_orderdate
|
126 |
DB.DBA.SUPPLIER
|
1 |
DB.DBA.s_nationkey
|
|
In both cases, the large tables are on top, but the column-wise case takes only half the space due to compression.
We note that the single column projections are smaller column-wise. The l_extendedprice is not very compressible hence column-wise takes much more space than l_quantity ; the row-wise difference is less. Since the leading key parts l_orderkey, l_linenumber are ordered and very compressible, the column-wise structures are in all cases noticeably more compact.
The same applies to the multipart index l_pksk and r_l_pksk (l_partkey, l_suppkey, l_orderkey, l_linenumber ) in column- and row-wise representations.
Note that STRING columns (e.g., l_comment ) are not compressed. If they were, the overall space ratio would be even more to the advantage of the column store.
Conclusions
Column stores and vectorization inextricably belong together. Column-wise compression yields great gains also for indices, since sorted data is easy to compress. Also for non-sorted data, adaptive use of dictionaries, run lengths, etc., produce great space savings. Columns also win with indexed access if there is locality.
Row stores have less dependence on locality, but they also will win by a factor of 3 from dropping interpretation overhead and exploiting join locality.
For point lookups, columns lose by 2+x but considering their better space efficiency, they will still win if space savings prevent going to secondary storage. For bulk random access, like in graph analytics, columns will win because of being able to operate on a large vector of keys to fetch.
For many workloads, from TPC-H to LDBC social network, multi-part keys are a necessary component of physical design for performance if indexed access predominates. Triple stores and most graph databases do not have such and are therefore at a disadvantage. Self-joining, like in RDF or other vertically decomposed structures, can cost up to a factor of 10-20 over a column-wise multicolumn table. This depends however on the density of access.
For analytical workloads, where the dominant join pattern is the scan with selective hash join, column stores are unbeatable, as per common wisdom. There are good physical reasons for this and the row store even with well implemented vectorization loses by a factor of 5.
For decomposed structures, like RDF quads or single column projections of tables, column stores are relatively more advantageous because the key columns are extensively repeated, and these compress better with columns than with rows. In all the RDF workloads we have tried, columns never lose, but there is often a draw between rows and columns for lookup workloads. The longer the query, the more columns win.
07/13/2015 13:46 GMT
|
Modified:
07/13/2015 13:56 GMT
|
Virtuoso at SIGMOD 2015
[
Orri Erling
]
Two papers presented at SIGMOD 2015 have been added to the Virtuoso Science Library.
-
Orri Erling (OpenLink Software); Alex Averbuch (Neo Technology); Josep Larriba-Pey (Sparsity Technologies); Hassan Chafi (Oracle Labs); Andrey Gubichev (TU Munich); Arnau Prat-Pérez (Universitat Politècnica de Catalunya); Minh-Duc Pham (VU University Amsterdam); Peter Boncz (CWI): The LDBC Social Network Benchmark: Interactive Workload. Proceedings of SIGMOD 2015, Melbourne.
This paper is an overview of the challenges posed in the LDBC social network benchmark, from data generation to the interactive workload.
-
Mihai Capotă (Delft University of Technology), Tim Hegeman (Delft University of Technology), Alexandru Iosup (Delft University of Technology), Arnau Prat-Pérez (Universitat Politècnica de Catalunya), Orri Erling (OpenLink Software), Peter Boncz (CWI): Graphalytics: A Big Data Benchmark for Graph-Processing Platforms. Sigmod GRADES 2015.
This paper discusses the future evolution of the LDBC Social Network Benchmark and gives a preview of Virtuoso graph traversal performance.
|
|