Details
OpenLink Software
Burlington, United States
Subscribe
Post Categories
Recent Articles
Community Member Blogs
Display Settings
Translate
|
Showing posts in all categories Refresh
Benchmarks, Redux (part 14): BSBM BI Mix
[
Orri Erling
]
In this post, we look at how we run the BSBM-BI mix. We consider the 100 Mt and 1000 Mt scales with Virtuoso 7 using the same hardware and software as in the previous posts. The changes to workload and metric are given in the previous post.
Our intent here is to look at whether the metric works, and to see what results will look like in general. We are as much testing the benchmark as we are testing the system-under-test (SUT). The results shown here will likely not be comparable with future ones because we will most likely change the composition of the workload since it seems a bit out of balance. Anyway, for the sake of disclosure, we attach the query templates. The test driver we used will be made available soon, so the interested may still try a comparison with their systems. If you practice with this workload for the coming races, the effort will surely not be wasted.
Once we have come up with a rules document, we will redo all that we have published so far by-the-book, and have it audited as part of the LOD2 service we plan for this (see previous posts in this series). This will introduce comparability; but before we get that far with the BI workload, the workload needs to evolve a bit.
Below we show samples of test driver output; the whole output is downloadable.
100 Mt Single User
bsbm/testdriver -runs 1 -w 0 -idir /bs/1 -drill \
-ucf bsbm/usecases/businessIntelligence/sparql.txt \
-dg http://bsbm.org http://localhost:8604/sparql
0: 43348.14ms, total: 43440ms
Scale factor: 284826
Explore Endpoints: 1
Update Endpoints: 1
Drilldown: on
Number of warmup runs: 0
Seed: 808080
Number of query mix runs (without warmups): 1 times
min/max Querymix runtime: 43.3481s / 43.3481s
Elapsed runtime: 43.348 seconds
QMpH: 83.049 query mixes per hour
CQET: 43.348 seconds average runtime of query mix
CQET (geom.): 43.348 seconds geometric mean runtime of query mix
AQET (geom.): 0.492 seconds geometric mean runtime of query
Throughput: 1494.874 BSBM-BI throughput: qph*scale
BI Power: 7309.820 BSBM-BI Power: qph*scale (geom)
100 Mt 8 User
Thread 6: query mix 3: 195793.09ms, total: 196086.18ms
Thread 8: query mix 0: 197843.84ms, total: 198010.50ms
Thread 7: query mix 4: 201806.28ms, total: 201996.26ms
Thread 2: query mix 5: 221983.93ms, total: 222105.96ms
Thread 4: query mix 7: 225127.55ms, total: 225317.49ms
Thread 3: query mix 6: 225860.49ms, total: 226050.17ms
Thread 5: query mix 2: 230884.93ms, total: 231067.61ms
Thread 1: query mix 1: 237836.61ms, total: 237959.11ms
Benchmark run completed in 237.985427s
Scale factor: 284826
Explore Endpoints: 1
Update Endpoints: 1
Drilldown: on
Number of warmup runs: 0
Number of clients: 8
Seed: 808080
Number of query mix runs (without warmups): 8 times
min/max Querymix runtime: 195.7931s / 237.8366s
Total runtime (sum): 1737.137 seconds
Elapsed runtime: 1737.137 seconds
QMpH: 121.016 query mixes per hour
CQET: 217.142 seconds average runtime of query mix
CQET (geom.): 216.603 seconds geometric mean runtime of query mix
AQET (geom.): 2.156 seconds geometric mean runtime of query
Throughput: 2178.285 BSBM-BI throughput: qph*scale
BI Power: 1669.745 BSBM-BI Power: qph*scale (geom)
1000 Mt Single User
0: 608707.03ms, total: 608768ms
Scale factor: 2848260
Explore Endpoints: 1
Update Endpoints: 1
Drilldown: on
Number of warmup runs: 0
Seed: 808080
Number of query mix runs (without warmups): 1 times
min/max Querymix runtime: 608.7070s / 608.7070s
Elapsed runtime: 608.707 seconds
QMpH: 5.914 query mixes per hour
CQET: 608.707 seconds average runtime of query mix
CQET (geom.): 608.707 seconds geometric mean runtime of query mix
AQET (geom.): 5.167 seconds geometric mean runtime of query
Throughput: 1064.552 BSBM-BI throughput: qph*scale
BI Power: 6967.325 BSBM-BI Power: qph*scale (geom)
1000 Mt 8 User
bsbm/testdriver -runs 8 -mt 8 -w 0 -idir /bs/10 -drill \
-ucf bsbm/usecases/businessIntelligence/sparql.txt \
-dg http://bsbm.org http://localhost:8604/sparql
Thread 3: query mix 4: 2211275.25ms, total: 2211371.60ms
Thread 4: query mix 0: 2212316.87ms, total: 2212417.99ms
Thread 8: query mix 3: 2275942.63ms, total: 2276058.03ms
Thread 5: query mix 5: 2441378.35ms, total: 2441448.66ms
Thread 6: query mix 7: 2804001.05ms, total: 2804098.81ms
Thread 2: query mix 2: 2808374.66ms, total: 2808473.71ms
Thread 1: query mix 6: 2839407.12ms, total: 2839510.63ms
Thread 7: query mix 1: 2889199.23ms, total: 2889263.17ms
Benchmark run completed in 2889.302566s
Scale factor: 2848260
Explore Endpoints: 1
Update Endpoints: 1
Drilldown: on
Number of warmup runs: 0
Number of clients: 8
Seed: 808080
Number of query mix runs (without warmups): 8 times
min/max Querymix runtime: 2211.2753s / 2889.1992s
Total runtime (sum): 20481.895 seconds
Elapsed runtime: 20481.895 seconds
QMpH: 9.968 query mixes per hour
CQET: 2560.237 seconds average runtime of query mix
CQET (geom.): 2544.284 seconds geometric mean runtime of query mix
AQET (geom.): 13.556 seconds geometric mean runtime of query
Throughput: 1794.205 BSBM-BI throughput: qph*scale
BI Power: 2655.678 BSBM-BI Power: qph*scale (geom)
Metrics for Query: 1
Count: 8 times executed in whole run
Time share 2.120884% of total execution time
AQET: 54.299656 seconds (arithmetic mean)
AQET(geom.): 34.607302 seconds (geometric mean)
QPS: 0.13 Queries per second
minQET/maxQET: 11.71547600s / 148.65379700s
Metrics for Query: 2
Count: 8 times executed in whole run
Time share 0.207382% of total execution time
AQET: 5.309462 seconds (arithmetic mean)
AQET(geom.): 2.737696 seconds (geometric mean)
QPS: 1.34 Queries per second
minQET/maxQET: 0.78729800s / 25.80948200s
Metrics for Query: 3
Count: 8 times executed in whole run
Time share 17.650472% of total execution time
AQET: 451.893890 seconds (arithmetic mean)
AQET(geom.): 410.481088 seconds (geometric mean)
QPS: 0.02 Queries per second
minQET/maxQET: 171.07262500s / 721.72939200s
Metrics for Query: 5
Count: 32 times executed in whole run
Time share 6.196565% of total execution time
AQET: 39.661685 seconds (arithmetic mean)
AQET(geom.): 6.849882 seconds (geometric mean)
QPS: 0.18 Queries per second
minQET/maxQET: 0.15696500s / 189.00906200s
Metrics for Query: 6
Count: 8 times executed in whole run
Time share 0.119916% of total execution time
AQET: 3.070136 seconds (arithmetic mean)
AQET(geom.): 2.056059 seconds (geometric mean)
QPS: 2.31 Queries per second
minQET/maxQET: 0.41524400s / 7.55655300s
Metrics for Query: 7
Count: 40 times executed in whole run
Time share 1.577963% of total execution time
AQET: 8.079921 seconds (arithmetic mean)
AQET(geom.): 1.342079 seconds (geometric mean)
QPS: 0.88 Queries per second
minQET/maxQET: 0.02205800s / 40.27761500s
Metrics for Query: 8
Count: 40 times executed in whole run
Time share 72.126818% of total execution time
AQET: 369.323481 seconds (arithmetic mean)
AQET(geom.): 114.431863 seconds (geometric mean)
QPS: 0.02 Queries per second
minQET/maxQET: 5.94377300s / 1824.57867400s
The CPU for the multiuser runs stays above 1500% for the whole run. The CPU for the single user 100 Mt run is 630%; for the 1000 Mt run, this is 574%. This can be improved since the queries usually have a lot of data to work on. But final optimization is not our goal yet; we are just surveying the race track. The difference between a warm single user run and a cold single user run is about 15% with data on SSD; with data on disk, this would be more. The numbers shown are with warm cache. The single-user and multi-user Throughput difference, 1064 single-user vs. 1794 multi-user, is about what one would expect from the CPU utilization.
With these numbers, the CPU does not appear badly memory-bound, else the increase would be less; also core multi-threading seems to bring some benefit. If the single-user run was at 800%, the Throughput would be 1488. The speed in excess of this may be attributed to core multi-threading, although we must remember that not every query mix is exactly the same length, so the figure is not exact. Core multi-threading does not seem to hurt, at the very least. Comparison of the same numbers with the column store will be interesting since it misses the cache a lot less and accordingly has better SMP scaling. The Intel Nehalem memory subsystem is really pretty good.
For reference, we show a run with Virtuoso 6 at 100Mt.
0: 424754.40ms, total: 424829ms
Scale factor: 284826
Explore Endpoints: 1
Update Endpoints: 1
Drilldown: on
Number of warmup runs: 0
Seed: 808080
Number of query mix runs (without warmups): 1 times
min/max Querymix runtime: 424.7544s / 424.7544s
Elapsed runtime: 424.754 seconds
QMpH: 8.475 query mixes per hour
CQET: 424.754 seconds average runtime of query mix
CQET (geom.): 424.754 seconds geometric mean runtime of query mix
AQET (geom.): 1.097 seconds geometric mean runtime of query
Throughput: 152.559 BSBM-BI throughput: qph*scale
BI Power: 3281.150 BSBM-BI Power: qph*scale (geom)
and 8 user
Thread 5: query mix 3: 616997.86ms, total: 617042.83ms
Thread 7: query mix 4: 625522.18ms, total: 625559.09ms
Thread 3: query mix 7: 626247.62ms, total: 626304.96ms
Thread 1: query mix 0: 629675.17ms, total: 629724.98ms
Thread 4: query mix 6: 667633.36ms, total: 667670.07ms
Thread 8: query mix 2: 674206.07ms, total: 674256.72ms
Thread 6: query mix 5: 695020.21ms, total: 695052.29ms
Thread 2: query mix 1: 701824.67ms, total: 701864.91ms
Benchmark run completed in 701.909341s
Scale factor: 284826
Explore Endpoints: 1
Update Endpoints: 1
Drilldown: on
Number of warmup runs: 0
Number of clients: 8
Seed: 808080
Number of query mix runs (without warmups): 8 times
min/max Querymix runtime: 616.9979s / 701.8247s
Total runtime (sum): 5237.127 seconds
Elapsed runtime: 5237.127 seconds
QMpH: 41.031 query mixes per hour
CQET: 654.641 seconds average runtime of query mix
CQET (geom.): 653.873 seconds geometric mean runtime of query mix
AQET (geom.): 2.557 seconds geometric mean runtime of query
Throughput: 738.557 BSBM-BI throughput: qph*scale
BI Power: 1408.133 BSBM-BI Power: qph*scale (geom)
Having the numbers, let us look at the metric and its scaling. We take the geometric mean of the single-user Power and the multiuser Throughput.
100 Mt: sqrt ( 7771 * 2178 ); = 4114
1000 Mt: sqrt ( 6967 * 1794 ); = 3535
Scaling seems to work; the results are in the same general ballpark. The real times for the 1000 Mt run are a bit over 10x the times for the 100Mt run, as expected. The relative percentages of the queries are about the same on both scales, with the drill-down in Q8 alone being 77% and 72% respectively. The Q8 drill-down starts at the root of the product hierarchy. If we made this start one level from the top, its share would drop. This seems reasonable.
Conversely, Q2 is out of place, with far too little share of the time. It takes a product as a starting point and shows a list of products with common features, sorted by descending count of common features. This would more appropriately be applied to a leaf product category instead, measuring how many of the products in the category have the top 20 features found in this category, to name an example.
Also there should be more queries.
At present it appears that BSBM-BI is definitely runnable, but a cursory look suffices to show that the workload needs more development and variety. We remember that I dreamt up the business questions last fall without much analysis, and that these questions were subsequently translated to SPARQL by FU Berlin. So, on one hand, BSBM-BI is of crucial importance because it is the first attempt at doing a benchmark with long running queries in SPARQL. On the other hand, BSBM-BI is not very good as a benchmark; TPC-H is a lot better. This stands to reason, as TPC-H has had years and years of development and participation by many people.
Benchmark queries are trick questions: For example, TPC-H Q18 cannot be done without changing an IN into a JOIN with the IN subquery in the outer loop and doing streaming aggregation. Q13 cannot be done without a well-optimized HASH JOIN which besides must be partitioned at the larger scales.
Having such trick questions in an important benchmark eventually results in everybody doing the optimizations that the benchmark clearly calls for. Making benchmarks thus entails a responsibility ultimately to the end user, because an irrelevant benchmark might in the worst case send developers chasing things that are beside the point.
In the following, we will look at what BSBM-BI requires from the database and how these requirements can be further developed and extended.
BSBM-BI does not have any clear trick questions, at least not premeditatedly. BSBM-BI just requires a cost model that can guess the fanout of a JOIN and the cardinality of a GROUP BY; it is enough to distinguish smaller from greater; the guess does not otherwise have to be very good. Further, the queries are written in the benchmark text so that joining from left to right would work, so not even a cost-based optimizer is strictly needed. I did however have to add some cardinality statistics to get reasonable JOIN order since we always reorder the query regardless of the source formulation.
BSBM-BI does have variable selectivity from the drill-downs; thus these may call for different JOIN orders for different parameter values. I have not looked into whether this really makes a difference, though.
There are places in BSBM-BI where using a HASH JOIN makes sense. We do not use HASH JOINs with RDF because there is an index for everything and making a HASH JOIN in the wrong place can have a large up-front cost, so one is more robust against cost model errors if one does not do HASH JOINs. This said, a HASH JOIN in the right place is a lot better than an index lookup. With TPC-H Q13, our best HASH JOIN is over 2x better than the best INDEX-based JOIN, both being well tuned. For questions like "count the hairballs made in Germany reviewed by Japanese Hello Kitty fans," where two ends of a JOIN path are fairly selective doing the other as a HASH JOIN is good. This can, if the JOIN is always cardinality-reducing, even be merged inside an INDEX lookup. We have such capabilities since we have been for a while gearing up for the relational races, but are not using any of these with BSBM-BI, although they would be useful.
Let us see the profile for a single user 100 Mt run.
The database activity summary is --
select db_activity (0, 'http');
161.3M rnd 210.2M seq 0 same seg 104.5M same pg 45.08M same par 0 disk 0 spec disk 0B / 0 messages 2.393K fork
See the post "What Does BSBM Explore Measure" for an explanation of the numbers. We see that there is more sequential access than random and the random has fair locality with over half on the same page as the previous and a lot of the rest falling under the same parent. Funnily enough, the explore mix has more locality. Running with a longer vector size would probably increase performance by getting better locality. There is an optimization that adjusts vector size on the fly if locality is not sufficient but this is not being used here. So we manually set vector size to 100000 instead of the default 10000. We get --
172.4M rnd 220.8M seq 0 same seg 149.6M same pg 10.99M same par 21 disk 861 spec disk 0B / 0 messages 754 fork
The throughput goes from 1494 to 1779. We see more hits on the same page, as expected. We do not make this setting a default since it raises the cost for small queries; therefore the vector size must be self-adjusting -- besides, expecting a DBA to tune this is not reasonable. We will just have to correctly tune the self-adjust logic, and we have again clear gains.
Let us now go back to the first run with vector size 10000.
The top of the CPU oprofile is as follows:
722309 15.4507 cmpf_iri64n_iri64n
434791 9.3005 cmpf_iri64n_iri64n_anyn_iri64n
294712 6.3041 itc_next_set
273488 5.8501 itc_vec_split_search
203970 4.3631 itc_dive_transit
199687 4.2714 itc_page_rcf_search
181614 3.8848 dc_itc_append_any
173043 3.7015 itc_bm_vec_row_check
146727 3.1386 cmpf_int64n
128224 2.7428 itc_vec_row_check
113515 2.4282 dk_alloc
97296 2.0812 page_wait_access
62523 1.3374 qst_vec_get_int64
59014 1.2623 itc_next_set_parent
53589 1.1463 sslr_qst_get
48003 1.0268 ds_add
46641 0.9977 dk_free_tree
44551 0.9530 kc_var_col
43650 0.9337 page_col_cmp_1
35297 0.7550 cmpf_iri64n_iri64n_anyn_gt_lt
34589 0.7399 dv_compare
25864 0.5532 cmpf_iri64n_anyn_iri64n_iri64n_lte
23088 0.4939 dk_free
The top 10 are all index traversal, with the key compare for two leading IRI keys in the lead, corresponding to a lookup with P and S given. The one after that is with all parts given, corresponding to an existence test. The existence tests could probably be converted to HASH JOIN lookups to good advantage. Aggregation and arithmetic are absent. We should probably add a query like TPC-H Q1 that does nothing but these two. Considering the overall profile, GROUP BY seems to be around 3%. We should probably put in a query that makes a very large number of groups and could make use of streaming aggregation, i.e., take advantage of a situation where aggregation input comes already grouped by the grouping columns.
A BI use case should offer no problem with including arithmetic, but there are not that many numbers in the BSBM set. Some code sections in the queries with conditional execution and costly tests inside ANDs and ORs would be good. TPC-H has such in Q21 and Q19. An OR with existences where there would be gain from good guesses of a subquery's selectivity would be appropriate. Also, there should be conditional expressions somewhere with a lot of data, like the CASE-WHEN in TPC-H Q12.
We can make BSBM-BI more interesting by putting in the above. Also we will have to see where we can profit from HASH JOIN, both small and large. There should be such places in the workload already so this is a matter of just playing a bit more.
This post amounts to a cheat sheet for the BSBM-BI runs a bit farther down the road. By then we should be operational with the column store and Virtuoso 7 Cluster, though, so not everything is yet on the table.
Benchmarks, Redux Series
-
Benchmarks, Redux (part 1): On RDF Benchmarks
-
Benchmarks, Redux (part 2): A Benchmarking Story
-
Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
-
Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
-
Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
-
Benchmarks, Redux (part 6): BSBM and I/O, continued
-
Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
-
Benchmarks, Redux (part 8): BSBM Explore and Update
-
Benchmarks, Redux (part 9): BSBM With Cluster
-
Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
-
Benchmarks, Redux (part 11): The Substance of Benchmarks
-
Benchmarks, Redux (part 12): Our Own BSBM Results Report
-
Benchmarks, Redux (part 13): BSBM-BI Modifications
-
Benchmarks, Redux (part 14): BSBM-BI Mix (this post)
-
Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
|
03/22/2011 18:31 GMT
|
Modified:
03/22/2011 17:04 GMT
|
Benchmarks, Redux (part 14): BSBM BI Mix
[
Virtuso Data Space Bot
]
In this post, we look at how we run the BSBM-BI mix. We consider the 100 Mt and 1000 Mt scales with Virtuoso 7 using the same hardware and software as in the previous posts. The changes to workload and metric are given in the previous post.
Our intent here is to look at whether the metric works, and to see what results will look like in general. We are as much testing the benchmark as we are testing the system-under-test (SUT). The results shown here will likely not be comparable with future ones because we will most likely change the composition of the workload since it seems a bit out of balance. Anyway, for the sake of disclosure, we attach the query templates. The test driver we used will be made available soon, so the interested may still try a comparison with their systems. If you practice with this workload for the coming races, the effort will surely not be wasted.
Once we have come up with a rules document, we will redo all that we have published so far by-the-book, and have it audited as part of the LOD2 service we plan for this (see previous posts in this series). This will introduce comparability; but before we get that far with the BI workload, the workload needs to evolve a bit.
Below we show samples of test driver output; the whole output is downloadable.
100 Mt Single User
bsbm/testdriver -runs 1 -w 0 -idir /bs/1 -drill \
-ucf bsbm/usecases/businessIntelligence/sparql.txt \
-dg http://bsbm.org http://localhost:8604/sparql
0: 43348.14ms, total: 43440ms
Scale factor: 284826
Explore Endpoints: 1
Update Endpoints: 1
Drilldown: on
Number of warmup runs: 0
Seed: 808080
Number of query mix runs (without warmups): 1 times
min/max Querymix runtime: 43.3481s / 43.3481s
Elapsed runtime: 43.348 seconds
QMpH: 83.049 query mixes per hour
CQET: 43.348 seconds average runtime of query mix
CQET (geom.): 43.348 seconds geometric mean runtime of query mix
AQET (geom.): 0.492 seconds geometric mean runtime of query
Throughput: 1494.874 BSBM-BI throughput: qph*scale
BI Power: 7309.820 BSBM-BI Power: qph*scale (geom)
100 Mt 8 User
Thread 6: query mix 3: 195793.09ms, total: 196086.18ms
Thread 8: query mix 0: 197843.84ms, total: 198010.50ms
Thread 7: query mix 4: 201806.28ms, total: 201996.26ms
Thread 2: query mix 5: 221983.93ms, total: 222105.96ms
Thread 4: query mix 7: 225127.55ms, total: 225317.49ms
Thread 3: query mix 6: 225860.49ms, total: 226050.17ms
Thread 5: query mix 2: 230884.93ms, total: 231067.61ms
Thread 1: query mix 1: 237836.61ms, total: 237959.11ms
Benchmark run completed in 237.985427s
Scale factor: 284826
Explore Endpoints: 1
Update Endpoints: 1
Drilldown: on
Number of warmup runs: 0
Number of clients: 8
Seed: 808080
Number of query mix runs (without warmups): 8 times
min/max Querymix runtime: 195.7931s / 237.8366s
Total runtime (sum): 1737.137 seconds
Elapsed runtime: 1737.137 seconds
QMpH: 121.016 query mixes per hour
CQET: 217.142 seconds average runtime of query mix
CQET (geom.): 216.603 seconds geometric mean runtime of query mix
AQET (geom.): 2.156 seconds geometric mean runtime of query
Throughput: 2178.285 BSBM-BI throughput: qph*scale
BI Power: 1669.745 BSBM-BI Power: qph*scale (geom)
1000 Mt Single User
0: 608707.03ms, total: 608768ms
Scale factor: 2848260
Explore Endpoints: 1
Update Endpoints: 1
Drilldown: on
Number of warmup runs: 0
Seed: 808080
Number of query mix runs (without warmups): 1 times
min/max Querymix runtime: 608.7070s / 608.7070s
Elapsed runtime: 608.707 seconds
QMpH: 5.914 query mixes per hour
CQET: 608.707 seconds average runtime of query mix
CQET (geom.): 608.707 seconds geometric mean runtime of query mix
AQET (geom.): 5.167 seconds geometric mean runtime of query
Throughput: 1064.552 BSBM-BI throughput: qph*scale
BI Power: 6967.325 BSBM-BI Power: qph*scale (geom)
1000 Mt 8 User
bsbm/testdriver -runs 8 -mt 8 -w 0 -idir /bs/10 -drill \
-ucf bsbm/usecases/businessIntelligence/sparql.txt \
-dg http://bsbm.org http://localhost:8604/sparql
Thread 3: query mix 4: 2211275.25ms, total: 2211371.60ms
Thread 4: query mix 0: 2212316.87ms, total: 2212417.99ms
Thread 8: query mix 3: 2275942.63ms, total: 2276058.03ms
Thread 5: query mix 5: 2441378.35ms, total: 2441448.66ms
Thread 6: query mix 7: 2804001.05ms, total: 2804098.81ms
Thread 2: query mix 2: 2808374.66ms, total: 2808473.71ms
Thread 1: query mix 6: 2839407.12ms, total: 2839510.63ms
Thread 7: query mix 1: 2889199.23ms, total: 2889263.17ms
Benchmark run completed in 2889.302566s
Scale factor: 2848260
Explore Endpoints: 1
Update Endpoints: 1
Drilldown: on
Number of warmup runs: 0
Number of clients: 8
Seed: 808080
Number of query mix runs (without warmups): 8 times
min/max Querymix runtime: 2211.2753s / 2889.1992s
Total runtime (sum): 20481.895 seconds
Elapsed runtime: 20481.895 seconds
QMpH: 9.968 query mixes per hour
CQET: 2560.237 seconds average runtime of query mix
CQET (geom.): 2544.284 seconds geometric mean runtime of query mix
AQET (geom.): 13.556 seconds geometric mean runtime of query
Throughput: 1794.205 BSBM-BI throughput: qph*scale
BI Power: 2655.678 BSBM-BI Power: qph*scale (geom)
Metrics for Query: 1
Count: 8 times executed in whole run
Time share 2.120884% of total execution time
AQET: 54.299656 seconds (arithmetic mean)
AQET(geom.): 34.607302 seconds (geometric mean)
QPS: 0.13 Queries per second
minQET/maxQET: 11.71547600s / 148.65379700s
Metrics for Query: 2
Count: 8 times executed in whole run
Time share 0.207382% of total execution time
AQET: 5.309462 seconds (arithmetic mean)
AQET(geom.): 2.737696 seconds (geometric mean)
QPS: 1.34 Queries per second
minQET/maxQET: 0.78729800s / 25.80948200s
Metrics for Query: 3
Count: 8 times executed in whole run
Time share 17.650472% of total execution time
AQET: 451.893890 seconds (arithmetic mean)
AQET(geom.): 410.481088 seconds (geometric mean)
QPS: 0.02 Queries per second
minQET/maxQET: 171.07262500s / 721.72939200s
Metrics for Query: 5
Count: 32 times executed in whole run
Time share 6.196565% of total execution time
AQET: 39.661685 seconds (arithmetic mean)
AQET(geom.): 6.849882 seconds (geometric mean)
QPS: 0.18 Queries per second
minQET/maxQET: 0.15696500s / 189.00906200s
Metrics for Query: 6
Count: 8 times executed in whole run
Time share 0.119916% of total execution time
AQET: 3.070136 seconds (arithmetic mean)
AQET(geom.): 2.056059 seconds (geometric mean)
QPS: 2.31 Queries per second
minQET/maxQET: 0.41524400s / 7.55655300s
Metrics for Query: 7
Count: 40 times executed in whole run
Time share 1.577963% of total execution time
AQET: 8.079921 seconds (arithmetic mean)
AQET(geom.): 1.342079 seconds (geometric mean)
QPS: 0.88 Queries per second
minQET/maxQET: 0.02205800s / 40.27761500s
Metrics for Query: 8
Count: 40 times executed in whole run
Time share 72.126818% of total execution time
AQET: 369.323481 seconds (arithmetic mean)
AQET(geom.): 114.431863 seconds (geometric mean)
QPS: 0.02 Queries per second
minQET/maxQET: 5.94377300s / 1824.57867400s
The CPU for the multiuser runs stays above 1500% for the whole run. The CPU for the single user 100 Mt run is 630%; for the 1000 Mt run, this is 574%. This can be improved since the queries usually have a lot of data to work on. But final optimization is not our goal yet; we are just surveying the race track. The difference between a warm single user run and a cold single user run is about 15% with data on SSD; with data on disk, this would be more. The numbers shown are with warm cache. The single-user and multi-user Throughput difference, 1064 single-user vs. 1794 multi-user, is about what one would expect from the CPU utilization.
With these numbers, the CPU does not appear badly memory-bound, else the increase would be less; also core multi-threading seems to bring some benefit. If the single-user run was at 800%, the Throughput would be 1488. The speed in excess of this may be attributed to core multi-threading, although we must remember that not every query mix is exactly the same length, so the figure is not exact. Core multi-threading does not seem to hurt, at the very least. Comparison of the same numbers with the column store will be interesting since it misses the cache a lot less and accordingly has better SMP scaling. The Intel Nehalem memory subsystem is really pretty good.
For reference, we show a run with Virtuoso 6 at 100Mt.
0: 424754.40ms, total: 424829ms
Scale factor: 284826
Explore Endpoints: 1
Update Endpoints: 1
Drilldown: on
Number of warmup runs: 0
Seed: 808080
Number of query mix runs (without warmups): 1 times
min/max Querymix runtime: 424.7544s / 424.7544s
Elapsed runtime: 424.754 seconds
QMpH: 8.475 query mixes per hour
CQET: 424.754 seconds average runtime of query mix
CQET (geom.): 424.754 seconds geometric mean runtime of query mix
AQET (geom.): 1.097 seconds geometric mean runtime of query
Throughput: 152.559 BSBM-BI throughput: qph*scale
BI Power: 3281.150 BSBM-BI Power: qph*scale (geom)
and 8 user
Thread 5: query mix 3: 616997.86ms, total: 617042.83ms
Thread 7: query mix 4: 625522.18ms, total: 625559.09ms
Thread 3: query mix 7: 626247.62ms, total: 626304.96ms
Thread 1: query mix 0: 629675.17ms, total: 629724.98ms
Thread 4: query mix 6: 667633.36ms, total: 667670.07ms
Thread 8: query mix 2: 674206.07ms, total: 674256.72ms
Thread 6: query mix 5: 695020.21ms, total: 695052.29ms
Thread 2: query mix 1: 701824.67ms, total: 701864.91ms
Benchmark run completed in 701.909341s
Scale factor: 284826
Explore Endpoints: 1
Update Endpoints: 1
Drilldown: on
Number of warmup runs: 0
Number of clients: 8
Seed: 808080
Number of query mix runs (without warmups): 8 times
min/max Querymix runtime: 616.9979s / 701.8247s
Total runtime (sum): 5237.127 seconds
Elapsed runtime: 5237.127 seconds
QMpH: 41.031 query mixes per hour
CQET: 654.641 seconds average runtime of query mix
CQET (geom.): 653.873 seconds geometric mean runtime of query mix
AQET (geom.): 2.557 seconds geometric mean runtime of query
Throughput: 738.557 BSBM-BI throughput: qph*scale
BI Power: 1408.133 BSBM-BI Power: qph*scale (geom)
Having the numbers, let us look at the metric and its scaling. We take the geometric mean of the single-user Power and the multiuser Throughput.
100 Mt: sqrt ( 7771 * 2178 ); = 4114
1000 Mt: sqrt ( 6967 * 1794 ); = 3535
Scaling seems to work; the results are in the same general ballpark. The real times for the 1000 Mt run are a bit over 10x the times for the 100Mt run, as expected. The relative percentages of the queries are about the same on both scales, with the drill-down in Q8 alone being 77% and 72% respectively. The Q8 drill-down starts at the root of the product hierarchy. If we made this start one level from the top, its share would drop. This seems reasonable.
Conversely, Q2 is out of place, with far too little share of the time. It takes a product as a starting point and shows a list of products with common features, sorted by descending count of common features. This would more appropriately be applied to a leaf product category instead, measuring how many of the products in the category have the top 20 features found in this category, to name an example.
Also there should be more queries.
At present it appears that BSBM-BI is definitely runnable, but a cursory look suffices to show that the workload needs more development and variety. We remember that I dreamt up the business questions last fall without much analysis, and that these questions were subsequently translated to SPARQL by FU Berlin. So, on one hand, BSBM-BI is of crucial importance because it is the first attempt at doing a benchmark with long running queries in SPARQL. On the other hand, BSBM-BI is not very good as a benchmark; TPC-H is a lot better. This stands to reason, as TPC-H has had years and years of development and participation by many people.
Benchmark queries are trick questions: For example, TPC-H Q18 cannot be done without changing an IN into a JOIN with the IN subquery in the outer loop and doing streaming aggregation. Q13 cannot be done without a well-optimized HASH JOIN which besides must be partitioned at the larger scales.
Having such trick questions in an important benchmark eventually results in everybody doing the optimizations that the benchmark clearly calls for. Making benchmarks thus entails a responsibility ultimately to the end user, because an irrelevant benchmark might in the worst case send developers chasing things that are beside the point.
In the following, we will look at what BSBM-BI requires from the database and how these requirements can be further developed and extended.
BSBM-BI does not have any clear trick questions, at least not premeditatedly. BSBM-BI just requires a cost model that can guess the fanout of a JOIN and the cardinality of a GROUP BY; it is enough to distinguish smaller from greater; the guess does not otherwise have to be very good. Further, the queries are written in the benchmark text so that joining from left to right would work, so not even a cost-based optimizer is strictly needed. I did however have to add some cardinality statistics to get reasonable JOIN order since we always reorder the query regardless of the source formulation.
BSBM-BI does have variable selectivity from the drill-downs; thus these may call for different JOIN orders for different parameter values. I have not looked into whether this really makes a difference, though.
There are places in BSBM-BI where using a HASH JOIN makes sense. We do not use HASH JOINs with RDF because there is an index for everything and making a HASH JOIN in the wrong place can have a large up-front cost, so one is more robust against cost model errors if one does not do HASH JOINs. This said, a HASH JOIN in the right place is a lot better than an index lookup. With TPC-H Q13, our best HASH JOIN is over 2x better than the best INDEX-based JOIN, both being well tuned. For questions like "count the hairballs made in Germany reviewed by Japanese Hello Kitty fans," where two ends of a JOIN path are fairly selective doing the other as a HASH JOIN is good. This can, if the JOIN is always cardinality-reducing, even be merged inside an INDEX lookup. We have such capabilities since we have been for a while gearing up for the relational races, but are not using any of these with BSBM-BI, although they would be useful.
Let us see the profile for a single user 100 Mt run.
The database activity summary is --
select db_activity (0, 'http');
161.3M rnd 210.2M seq 0 same seg 104.5M same pg 45.08M same par 0 disk 0 spec disk 0B / 0 messages 2.393K fork
See the post "What Does BSBM Explore Measure" for an explanation of the numbers. We see that there is more sequential access than random and the random has fair locality with over half on the same page as the previous and a lot of the rest falling under the same parent. Funnily enough, the explore mix has more locality. Running with a longer vector size would probably increase performance by getting better locality. There is an optimization that adjusts vector size on the fly if locality is not sufficient but this is not being used here. So we manually set vector size to 100000 instead of the default 10000. We get --
172.4M rnd 220.8M seq 0 same seg 149.6M same pg 10.99M same par 21 disk 861 spec disk 0B / 0 messages 754 fork
The throughput goes from 1494 to 1779. We see more hits on the same page, as expected. We do not make this setting a default since it raises the cost for small queries; therefore the vector size must be self-adjusting -- besides, expecting a DBA to tune this is not reasonable. We will just have to correctly tune the self-adjust logic, and we have again clear gains.
Let us now go back to the first run with vector size 10000.
The top of the CPU oprofile is as follows:
722309 15.4507 cmpf_iri64n_iri64n
434791 9.3005 cmpf_iri64n_iri64n_anyn_iri64n
294712 6.3041 itc_next_set
273488 5.8501 itc_vec_split_search
203970 4.3631 itc_dive_transit
199687 4.2714 itc_page_rcf_search
181614 3.8848 dc_itc_append_any
173043 3.7015 itc_bm_vec_row_check
146727 3.1386 cmpf_int64n
128224 2.7428 itc_vec_row_check
113515 2.4282 dk_alloc
97296 2.0812 page_wait_access
62523 1.3374 qst_vec_get_int64
59014 1.2623 itc_next_set_parent
53589 1.1463 sslr_qst_get
48003 1.0268 ds_add
46641 0.9977 dk_free_tree
44551 0.9530 kc_var_col
43650 0.9337 page_col_cmp_1
35297 0.7550 cmpf_iri64n_iri64n_anyn_gt_lt
34589 0.7399 dv_compare
25864 0.5532 cmpf_iri64n_anyn_iri64n_iri64n_lte
23088 0.4939 dk_free
The top 10 are all index traversal, with the key compare for two leading IRI keys in the lead, corresponding to a lookup with P and S given. The one after that is with all parts given, corresponding to an existence test. The existence tests could probably be converted to HASH JOIN lookups to good advantage. Aggregation and arithmetic are absent. We should probably add a query like TPC-H Q1 that does nothing but these two. Considering the overall profile, GROUP BY seems to be around 3%. We should probably put in a query that makes a very large number of groups and could make use of streaming aggregation, i.e., take advantage of a situation where aggregation input comes already grouped by the grouping columns.
A BI use case should offer no problem with including arithmetic, but there are not that many numbers in the BSBM set. Some code sections in the queries with conditional execution and costly tests inside ANDs and ORs would be good. TPC-H has such in Q21 and Q19. An OR with existences where there would be gain from good guesses of a subquery's selectivity would be appropriate. Also, there should be conditional expressions somewhere with a lot of data, like the CASE-WHEN in TPC-H Q12.
We can make BSBM-BI more interesting by putting in the above. Also we will have to see where we can profit from HASH JOIN, both small and large. There should be such places in the workload already so this is a matter of just playing a bit more.
This post amounts to a cheat sheet for the BSBM-BI runs a bit farther down the road. By then we should be operational with the column store and Virtuoso 7 Cluster, though, so not everything is yet on the table.
Benchmarks, Redux Series
-
Benchmarks, Redux (part 1): On RDF Benchmarks
-
Benchmarks, Redux (part 2): A Benchmarking Story
-
Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
-
Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
-
Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
-
Benchmarks, Redux (part 6): BSBM and I/O, continued
-
Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
-
Benchmarks, Redux (part 8): BSBM Explore and Update
-
Benchmarks, Redux (part 9): BSBM With Cluster
-
Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
-
Benchmarks, Redux (part 11): The Substance of Benchmarks
-
Benchmarks, Redux (part 12): Our Own BSBM Results Report
-
Benchmarks, Redux (part 13): BSBM-BI Modifications
-
Benchmarks, Redux (part 14): BSBM-BI Mix (this post)
-
Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
|
03/22/2011 18:31 GMT
|
Modified:
03/22/2011 17:04 GMT
|
Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
[
Orri Erling
]
In the context of database benchmarks we cannot ignore I/O, as pretty much has been done so far by BSBM.
There are two approaches:
-
run twice or otherwise make sure one runs from memory and forget about I/O, or
-
make rules and metrics for warm-up.
We will see if the second is possible with BSBM.
From this starting point, we look at various ways of scheduling I/O in Virtuoso using a 1000 Mt BSBM database on sets of each of HDDs (hard disk devices) and SSDs (solid-state storage devices). We will see that SSDs in this specific application can make a significant difference.
In this test we have the same 4 stripes of a 1000 Mt BSBM database on each of two storage arrays.
| Storage Arrays |
| Type |
Quantity |
Maker |
Size |
Speed |
Interface speed |
Controller |
Drive Cache |
RAID |
| SSD |
4 |
Crucial |
128 GB |
N/A |
6Gbit SATA |
RocketRaid 640 |
128 MB |
None |
| HDD |
4 |
Samsung |
1000 GB |
7200 RPM |
3Gbit SATA |
Intel ICH on Supermicro motherboard |
16 MB |
None |
We make sure that the files are not in OS cache by filling it with other big files, reading a total of 120 GB off SSDs with `cat file > /dev/null`.
The configuration files are as in the report on the 1000 Mt run. We note as significant that we have a few file descriptors for each stripe, and that read-ahead for each is handled by its own thread.
Two different read-ahead schemes are used:
-
With 6 Single, if a 2MB extent gets a second read within a given time after the first, the whole extent is scheduled for background read.
-
With 7 Single, as an index search is vectored, we know a large number of values to fetch at one time and these values are sorted into an ascending sequence. Therefore, by looking at a node in an index tree, we can determine which sub-trees will be accessed and schedule these for read-ahead, skipping any that will not be accessed.
In either model, a sequential scan touching more than a couple of consecutive index leaf pages triggers a read-ahead, to the end of the scanned range or to the next 3000 index leaves, whichever comes first. However, there are no sequential scans of significant size in BSBM.
There are a few different possibilities for the physical I/O:
-
Using a separate read system call for each page. There may be several open file descriptors on a file so that many such calls can proceed concurrently on different threads; the OS will order the operations.
-
A thread finds it needs a page and reads it.
-
Using Unix asynchronous I/O, aio.h, with the aio_* and lio_listio functions.
-
Using single-read system calls for adjacent pages. In this way, the drive sees longer requests and should give better throughput. If there are short gaps in the sequence, the gaps are also read, wasting bandwidth but saving on latency.
The two latter apply only to bulk I/O that are scheduled on background threads, one per independently-addressable device (HDD, SSD, or RAID-set). These bulk-reads operate on an elevator model, keeping a sorted queue of things to read or write and moving through this queue from start to end. At any time, the queue may get more work from other threads.
There is a further choice when seeing single-page random requests. They can either go to the elevator or they can be done in place. Taking the elevator is presumably good for throughput but bad for latency. In general, the elevator should have a notion of fairness; these matters are discussed in the CWI collaborative scan paper. Here we do not have long queries, so we do not have to talk about elevator policies or scan sharing; there are no scans. We may touch on these questions later with the column store, the BSBM BI mix, and TPC-H.
While we may know principles, I/O has always given us surprises; the only way to optimize this is to measure.
The metric we try to optimize here is the time it takes for a multiuser BSBM run starting from cold cache to get to 1200% CPU. When running from memory, the CPU is around 1350% for the system in question.
This depends on getting I/O throughput, which in turn depends on having a lot of speculative reading since the workload itself does not give any long stretches to read.
The test driver is set at 16 clients, and the run continues for 2000 query mixes or until target throughput is reached. Target throughput is deemed reached after the first 20 second stretch with CPU at 1200% or higher.
The meter is a stored procedure that records the CPU time, count of reads, cumulative elapsed time spent waiting for I/O, and other metrics. The code for this procedure (for 7 Single; this file will not work on Virtuoso 6 or earlier) is available here.
The database space allocation gives each index a number of 2MB segments, each with 256 8K pages. When a page splits, the new page is allocated from the same extent if possible, or from a specific second extent which is designated as the overflow extent of this extent. This scheme provides for a sort of pseudo-locality within extents over random insert order. Thus there is a chance that pre-reading an extent will get key values in the same range a the ones on the page being requested in the first place. At least the pre-read pages will be from the same index tree. There are insertion orders that do not create good locality with this allocation scheme, though. In order to generally improve locality, one could shuffle pages of an all-dirty subtree before writing this out so as to have physical order match key order. We will look at some tricks in this vein with the column store.
For the sake of simplicity we only run 7 Single with the 1000 Mt scale.
The first experiment was with SSDs and the vectored read-ahead. The target throughput was reached after 280 seconds.
The next test was with HDDs and extent read-ahead. One hour into the experiment, the CPU was about 70% after processing around 1000 query mixes. It might have been hours before HDD reads became rare enough for hitting 1200% CPU. The test was not worth continuing.
The result with HDDs and vectored read-ahead would be worse since vectored read-ahead leads to smaller read-ahead batches and to less contiguous read patterns. The individual read times here, are over twice the individual read times with per-extent read-ahead. The fact that vectored read-ahead does not read potentially unneeded pages makes no difference. Hence this test is also not worth running to completion.
There are other possibilities for improving HDD I/O. If only 2MB read requests are made, a transfer will be about 20 ms at a sequential transfer speed of 50 MB/s. Then seeking to the next 2MB extent will be a few ms, most often less than 20, so the HDD should give at least half the nominal throughput.
We note that, when reading sequential 8K pages inside a single 2MB (256 page) extent, the seek latency is not 0 as one would expect but an extreme 5 ms. One would think that the drive would buffer a whole track, and a track would hold a large number of 2MB sections, but apparently this is not so.
Therefore, now if we have a sequential read pattern that is more dense than 1 page out of 10, we read all the pages and just keep the ones we want.
So now we set the read-ahead to merge reads that fall within 10 pages. This wastes bandwidth, but supposedly saves on latency. We will see.
So we try, and we find that read-ahead does not account for most pages since it does not get triggered. Thus, we change the triggering condition to be the 2nd read to fall in the extent within 20 seconds of the first.
The HDDs were in all cases 700% busy for 4 HDDs. But with the new setting we get longer requests, most often full extents, which gets a per-HDD transfer rate of about 5 MB/s. With the looser condition for starting read-ahead, 89% of all pages were read in a read-ahead batch. We see the I/O throughput decrease during the run because there are more single-page reads that do not trigger extent read-ahead. So HDDs have 1.7 concurrent operations pending, but the batch size drops, dropping the throughput.
Thus with the best settings, the test with 2000 query mixes finishes in 46 minutes, and the CPU utilization is steadily increasing, hitting 392% for the last minute. In comparison, with SSDs and our worst read-ahead setting we got 1200% CPU in under 5 minutes from cold start. The I/O system can be further tuned; for example, by only reading full extents as long as the buffer pool is not full. In the next post we will measure some more.
BSBM Note
We look at query times with semi-warm cache, with CPU around 400%. We note that Q8-Q12 are especially bad. Q5 runs at about half speed. Q12 runs at under 1/10th speed. The relatively slowest queries appear to be single-instance lookups. Nothing short of the most aggressive speculative reading can help there. Neither query nor workload has any exploitable pattern. Therefore if an I/O component is to be included in a BSBM metric, the only way to score in this is to use speculative read to the maximum.
Some of the queries take consecutive property values of a single instance. One could parallelize this pipeline, but this would be a one-off and would make sense only when reading from storage (whether HDD, SSD, or otherwise). Multithreading for single rows is not worth the overhead.
A metric for BSBM warm-up is not interesting for database science, but may still be of practical interest in the specific case of RDF stores. Specially reading large chunks at startup time is good, so putting a section in BSBM that would force one to implement this would be a service to most end users. Measuring and reporting such I/O performance would favor space efficiency in general. Space efficiency is generally a good thing, especially at larger scales, so we can put an optional section in the report for warm-up. This is also good for comparing HDDs and SSDs, and for testing read-ahead, which is still something a database is expected to do. Implementors have it easy; just speculatively read everything.
Looking at the BSBM fictional use case, anybody running such a portal would do this from RAM only, so it makes sense to define the primary metric as running from warm cache, in practice 100% from memory.
Benchmarks, Redux Series
- Benchmarks, Redux (part 1): On RDF Benchmarks
-
Benchmarks, Redux (part 2): A Benchmarking Story
-
Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
-
Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
-
Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs (this post)
-
Benchmarks, Redux (part 6): BSBM and I/O, continued
-
Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
-
Benchmarks, Redux (part 8): BSBM Explore and Update
-
Benchmarks, Redux (part 9): BSBM With Cluster
-
Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
-
Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks
-
Benchmarks, Redux (part 12): Our Own BSBM Results Report
-
Benchmarks, Redux (part 13): BSBM BI Modifications
-
Benchmarks, Redux (part 14): BSBM BI Mix
-
Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
|
03/07/2011 14:17 GMT
|
Modified:
03/14/2011 17:16 GMT
|
Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
[
Orri Erling
]
In the context of database benchmarks we cannot ignore I/O, as pretty much has been done so far by BSBM.
There are two approaches:
-
run twice or otherwise make sure one runs from memory and forget about I/O, or
-
make rules and metrics for warm-up.
We will see if the second is possible with BSBM.
From this starting point, we look at various ways of scheduling I/O in Virtuoso using a 1000 Mt BSBM database on sets of each of HDDs (hard disk devices) and SSDs (solid-state storage devices). We will see that SSDs in this specific application can make a significant difference.
In this test we have the same 4 stripes of a 1000 Mt BSBM database on each of two storage arrays.
| Storage Arrays |
| Type |
Quantity |
Maker |
Size |
Speed |
Interface speed |
Controller |
Drive Cache |
RAID |
| SSD |
4 |
Crucial |
128 GB |
N/A |
6Gbit SATA |
RocketRaid 640 |
128 MB |
None |
| HDD |
4 |
Samsung |
1000 GB |
7200 RPM |
3Gbit SATA |
Intel ICH on Supermicro motherboard |
16 MB |
None |
We make sure that the files are not in OS cache by filling it with other big files, reading a total of 120 GB off SSDs with `cat file > /dev/null`.
The configuration files are as in the report on the 1000 Mt run. We note as significant that we have a few file descriptors for each stripe, and that read-ahead for each is handled by its own thread.
Two different read-ahead schemes are used:
-
With 6 Single, if a 2MB extent gets a second read within a given time after the first, the whole extent is scheduled for background read.
-
With 7 Single, as an index search is vectored, we know a large number of values to fetch at one time and these values are sorted into an ascending sequence. Therefore, by looking at a node in an index tree, we can determine which sub-trees will be accessed and schedule these for read-ahead, skipping any that will not be accessed.
In either model, a sequential scan touching more than a couple of consecutive index leaf pages triggers a read-ahead, to the end of the scanned range or to the next 3000 index leaves, whichever comes first. However, there are no sequential scans of significant size in BSBM.
There are a few different possibilities for the physical I/O:
-
Using a separate read system call for each page. There may be several open file descriptors on a file so that many such calls can proceed concurrently on different threads; the OS will order the operations.
-
A thread finds it needs a page and reads it.
-
Using Unix asynchronous I/O, aio.h, with the aio_* and lio_listio functions.
-
Using single-read system calls for adjacent pages. In this way, the drive sees longer requests and should give better throughput. If there are short gaps in the sequence, the gaps are also read, wasting bandwidth but saving on latency.
The two latter apply only to bulk I/O that are scheduled on background threads, one per independently-addressable device (HDD, SSD, or RAID-set). These bulk-reads operate on an elevator model, keeping a sorted queue of things to read or write and moving through this queue from start to end. At any time, the queue may get more work from other threads.
There is a further choice when seeing single-page random requests. They can either go to the elevator or they can be done in place. Taking the elevator is presumably good for throughput but bad for latency. In general, the elevator should have a notion of fairness; these matters are discussed in the CWI collaborative scan paper. Here we do not have long queries, so we do not have to talk about elevator policies or scan sharing; there are no scans. We may touch on these questions later with the column store, the BSBM BI mix, and TPC-H.
While we may know principles, I/O has always given us surprises; the only way to optimize this is to measure.
The metric we try to optimize here is the time it takes for a multiuser BSBM run starting from cold cache to get to 1200% CPU. When running from memory, the CPU is around 1350% for the system in question.
This depends on getting I/O throughput, which in turn depends on having a lot of speculative reading since the workload itself does not give any long stretches to read.
The test driver is set at 16 clients, and the run continues for 2000 query mixes or until target throughput is reached. Target throughput is deemed reached after the first 20 second stretch with CPU at 1200% or higher.
The meter is a stored procedure that records the CPU time, count of reads, cumulative elapsed time spent waiting for I/O, and other metrics. The code for this procedure (for 7 Single; this file will not work on Virtuoso 6 or earlier) is available here.
The database space allocation gives each index a number of 2MB segments, each with 256 8K pages. When a page splits, the new page is allocated from the same extent if possible, or from a specific second extent which is designated as the overflow extent of this extent. This scheme provides for a sort of pseudo-locality within extents over random insert order. Thus there is a chance that pre-reading an extent will get key values in the same range a the ones on the page being requested in the first place. At least the pre-read pages will be from the same index tree. There are insertion orders that do not create good locality with this allocation scheme, though. In order to generally improve locality, one could shuffle pages of an all-dirty subtree before writing this out so as to have physical order match key order. We will look at some tricks in this vein with the column store.
For the sake of simplicity we only run 7 Single with the 1000 Mt scale.
The first experiment was with SSDs and the vectored read-ahead. The target throughput was reached after 280 seconds.
The next test was with HDDs and extent read-ahead. One hour into the experiment, the CPU was about 70% after processing around 1000 query mixes. It might have been hours before HDD reads became rare enough for hitting 1200% CPU. The test was not worth continuing.
The result with HDDs and vectored read-ahead would be worse since vectored read-ahead leads to smaller read-ahead batches and to less contiguous read patterns. The individual read times here, are over twice the individual read times with per-extent read-ahead. The fact that vectored read-ahead does not read potentially unneeded pages makes no difference. Hence this test is also not worth running to completion.
There are other possibilities for improving HDD I/O. If only 2MB read requests are made, a transfer will be about 20 ms at a sequential transfer speed of 50 MB/s. Then seeking to the next 2MB extent will be a few ms, most often less than 20, so the HDD should give at least half the nominal throughput.
We note that, when reading sequential 8K pages inside a single 2MB (256 page) extent, the seek latency is not 0 as one would expect but an extreme 5 ms. One would think that the drive would buffer a whole track, and a track would hold a large number of 2MB sections, but apparently this is not so.
Therefore, now if we have a sequential read pattern that is more dense than 1 page out of 10, we read all the pages and just keep the ones we want.
So now we set the read-ahead to merge reads that fall within 10 pages. This wastes bandwidth, but supposedly saves on latency. We will see.
So we try, and we find that read-ahead does not account for most pages since it does not get triggered. Thus, we change the triggering condition to be the 2nd read to fall in the extent within 20 seconds of the first.
The HDDs were in all cases 700% busy for 4 HDDs. But with the new setting we get longer requests, most often full extents, which gets a per-HDD transfer rate of about 5 MB/s. With the looser condition for starting read-ahead, 89% of all pages were read in a read-ahead batch. We see the I/O throughput decrease during the run because there are more single-page reads that do not trigger extent read-ahead. So HDDs have 1.7 concurrent operations pending, but the batch size drops, dropping the throughput.
Thus with the best settings, the test with 2000 query mixes finishes in 46 minutes, and the CPU utilization is steadily increasing, hitting 392% for the last minute. In comparison, with SSDs and our worst read-ahead setting we got 1200% CPU in under 5 minutes from cold start. The I/O system can be further tuned; for example, by only reading full extents as long as the buffer pool is not full. In the next post we will measure some more.
BSBM Note
We look at query times with semi-warm cache, with CPU around 400%. We note that Q8-Q12 are especially bad. Q5 runs at about half speed. Q12 runs at under 1/10th speed. The relatively slowest queries appear to be single-instance lookups. Nothing short of the most aggressive speculative reading can help there. Neither query nor workload has any exploitable pattern. Therefore if an I/O component is to be included in a BSBM metric, the only way to score in this is to use speculative read to the maximum.
Some of the queries take consecutive property values of a single instance. One could parallelize this pipeline, but this would be a one-off and would make sense only when reading from storage (whether HDD, SSD, or otherwise). Multithreading for single rows is not worth the overhead.
A metric for BSBM warm-up is not interesting for database science, but may still be of practical interest in the specific case of RDF stores. Specially reading large chunks at startup time is good, so putting a section in BSBM that would force one to implement this would be a service to most end users. Measuring and reporting such I/O performance would favor space efficiency in general. Space efficiency is generally a good thing, especially at larger scales, so we can put an optional section in the report for warm-up. This is also good for comparing HDDs and SSDs, and for testing read-ahead, which is still something a database is expected to do. Implementors have it easy; just speculatively read everything.
Looking at the BSBM fictional use case, anybody running such a portal would do this from RAM only, so it makes sense to define the primary metric as running from warm cache, in practice 100% from memory.
Benchmarks, Redux Series
- Benchmarks, Redux (part 1): On RDF Benchmarks
-
Benchmarks, Redux (part 2): A Benchmarking Story
-
Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
-
Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
-
Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs (this post)
-
Benchmarks, Redux (part 6): BSBM and I/O, continued
-
Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
-
Benchmarks, Redux (part 8): BSBM Explore and Update
-
Benchmarks, Redux (part 9): BSBM With Cluster
-
Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
-
Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks
-
Benchmarks, Redux (part 12): Our Own BSBM Results Report
-
Benchmarks, Redux (part 13): BSBM BI Modifications
-
Benchmarks, Redux (part 14): BSBM BI Mix
-
Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
|
03/07/2011 14:17 GMT
|
Modified:
03/14/2011 17:16 GMT
|
Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
[
Virtuso Data Space Bot
]
In the context of database benchmarks we cannot ignore I/O, as pretty much has been done so far by BSBM.
There are two approaches:
-
run twice or otherwise make sure one runs from memory and forget about I/O, or
-
make rules and metrics for warm-up.
We will see if the second is possible with BSBM.
From this starting point, we look at various ways of scheduling I/O in Virtuoso using a 1000 Mt BSBM database on sets of each of HDDs (hard disk devices) and SSDs (solid-state storage devices). We will see that SSDs in this specific application can make a significant difference.
In this test we have the same 4 stripes of a 1000 Mt BSBM database on each of two storage arrays.
| Storage Arrays |
| Type |
Quantity |
Maker |
Size |
Speed |
Interface speed |
Controller |
Drive Cache |
RAID |
| SSD |
4 |
Crucial |
128 GB |
N/A |
6Gbit SATA |
RocketRaid 640 |
128 MB |
None |
| HDD |
4 |
Samsung |
1000 GB |
7200 RPM |
3Gbit SATA |
Intel ICH on Supermicro motherboard |
16 MB |
None |
We make sure that the files are not in OS cache by filling it with other big files, reading a total of 120 GB off SSDs with `cat file > /dev/null`.
The configuration files are as in the report on the 1000 Mt run. We note as significant that we have a few file descriptors for each stripe, and that read-ahead for each is handled by its own thread.
Two different read-ahead schemes are used:
-
With 6 Single, if a 2MB extent gets a second read within a given time after the first, the whole extent is scheduled for background read.
-
With 7 Single, as an index search is vectored, we know a large number of values to fetch at one time and these values are sorted into an ascending sequence. Therefore, by looking at a node in an index tree, we can determine which sub-trees will be accessed and schedule these for read-ahead, skipping any that will not be accessed.
In either model, a sequential scan touching more than a couple of consecutive index leaf pages triggers a read-ahead, to the end of the scanned range or to the next 3000 index leaves, whichever comes first. However, there are no sequential scans of significant size in BSBM.
There are a few different possibilities for the physical I/O:
-
Using a separate read system call for each page. There may be several open file descriptors on a file so that many such calls can proceed concurrently on different threads; the OS will order the operations.
-
A thread finds it needs a page and reads it.
-
Using Unix asynchronous I/O, aio.h, with the aio_* and lio_listio functions.
-
Using single-read system calls for adjacent pages. In this way, the drive sees longer requests and should give better throughput. If there are short gaps in the sequence, the gaps are also read, wasting bandwidth but saving on latency.
The two latter apply only to bulk I/O that are scheduled on background threads, one per independently-addressable device (HDD, SSD, or RAID-set). These bulk-reads operate on an elevator model, keeping a sorted queue of things to read or write and moving through this queue from start to end. At any time, the queue may get more work from other threads.
There is a further choice when seeing single-page random requests. They can either go to the elevator or they can be done in place. Taking the elevator is presumably good for throughput but bad for latency. In general, the elevator should have a notion of fairness; these matters are discussed in the CWI collaborative scan paper. Here we do not have long queries, so we do not have to talk about elevator policies or scan sharing; there are no scans. We may touch on these questions later with the column store, the BSBM BI mix, and TPC-H.
While we may know principles, I/O has always given us surprises; the only way to optimize this is to measure.
The metric we try to optimize here is the time it takes for a multiuser BSBM run starting from cold cache to get to 1200% CPU. When running from memory, the CPU is around 1350% for the system in question.
This depends on getting I/O throughput, which in turn depends on having a lot of speculative reading since the workload itself does not give any long stretches to read.
The test driver is set at 16 clients, and the run continues for 2000 query mixes or until target throughput is reached. Target throughput is deemed reached after the first 20 second stretch with CPU at 1200% or higher.
The meter is a stored procedure that records the CPU time, count of reads, cumulative elapsed time spent waiting for I/O, and other metrics. The code for this procedure (for 7 Single; this file will not work on Virtuoso 6 or earlier) is available here.
The database space allocation gives each index a number of 2MB segments, each with 256 8K pages. When a page splits, the new page is allocated from the same extent if possible, or from a specific second extent which is designated as the overflow extent of this extent. This scheme provides for a sort of pseudo-locality within extents over random insert order. Thus there is a chance that pre-reading an extent will get key values in the same range a the ones on the page being requested in the first place. At least the pre-read pages will be from the same index tree. There are insertion orders that do not create good locality with this allocation scheme, though. In order to generally improve locality, one could shuffle pages of an all-dirty subtree before writing this out so as to have physical order match key order. We will look at some tricks in this vein with the column store.
For the sake of simplicity we only run 7 Single with the 1000 Mt scale.
The first experiment was with SSDs and the vectored read-ahead. The target throughput was reached after 280 seconds.
The next test was with HDDs and extent read-ahead. One hour into the experiment, the CPU was about 70% after processing around 1000 query mixes. It might have been hours before HDD reads became rare enough for hitting 1200% CPU. The test was not worth continuing.
The result with HDDs and vectored read-ahead would be worse since vectored read-ahead leads to smaller read-ahead batches and to less contiguous read patterns. The individual read times here, are over twice the individual read times with per-extent read-ahead. The fact that vectored read-ahead does not read potentially unneeded pages makes no difference. Hence this test is also not worth running to completion.
There are other possibilities for improving HDD I/O. If only 2MB read requests are made, a transfer will be about 20 ms at a sequential transfer speed of 50 MB/s. Then seeking to the next 2MB extent will be a few ms, most often less than 20, so the HDD should give at least half the nominal throughput.
We note that, when reading sequential 8K pages inside a single 2MB (256 page) extent, the seek latency is not 0 as one would expect but an extreme 5 ms. One would think that the drive would buffer a whole track, and a track would hold a large number of 2MB sections, but apparently this is not so.
Therefore, now if we have a sequential read pattern that is more dense than 1 page out of 10, we read all the pages and just keep the ones we want.
So now we set the read-ahead to merge reads that fall within 10 pages. This wastes bandwidth, but supposedly saves on latency. We will see.
So we try, and we find that read-ahead does not account for most pages since it does not get triggered. Thus, we change the triggering condition to be the 2nd read to fall in the extent within 20 seconds of the first.
The HDDs were in all cases 700% busy for 4 HDDs. But with the new setting we get longer requests, most often full extents, which gets a per-HDD transfer rate of about 5 MB/s. With the looser condition for starting read-ahead, 89% of all pages were read in a read-ahead batch. We see the I/O throughput decrease during the run because there are more single-page reads that do not trigger extent read-ahead. So HDDs have 1.7 concurrent operations pending, but the batch size drops, dropping the throughput.
Thus with the best settings, the test with 2000 query mixes finishes in 46 minutes, and the CPU utilization is steadily increasing, hitting 392% for the last minute. In comparison, with SSDs and our worst read-ahead setting we got 1200% CPU in under 5 minutes from cold start. The I/O system can be further tuned; for example, by only reading full extents as long as the buffer pool is not full. In the next post we will measure some more.
BSBM Note
We look at query times with semi-warm cache, with CPU around 400%. We note that Q8-Q12 are especially bad. Q5 runs at about half speed. Q12 runs at under 1/10th speed. The relatively slowest queries appear to be single-instance lookups. Nothing short of the most aggressive speculative reading can help there. Neither query nor workload has any exploitable pattern. Therefore if an I/O component is to be included in a BSBM metric, the only way to score in this is to use speculative read to the maximum.
Some of the queries take consecutive property values of a single instance. One could parallelize this pipeline, but this would be a one-off and would make sense only when reading from storage (whether HDD, SSD, or otherwise). Multithreading for single rows is not worth the overhead.
A metric for BSBM warm-up is not interesting for database science, but may still be of practical interest in the specific case of RDF stores. Specially reading large chunks at startup time is good, so putting a section in BSBM that would force one to implement this would be a service to most end users. Measuring and reporting such I/O performance would favor space efficiency in general. Space efficiency is generally a good thing, especially at larger scales, so we can put an optional section in the report for warm-up. This is also good for comparing HDDs and SSDs, and for testing read-ahead, which is still something a database is expected to do. Implementors have it easy; just speculatively read everything.
Looking at the BSBM fictional use case, anybody running such a portal would do this from RAM only, so it makes sense to define the primary metric as running from warm cache, in practice 100% from memory.
Benchmarks, Redux Series
- Benchmarks, Redux (part 1): On RDF Benchmarks
-
Benchmarks, Redux (part 2): A Benchmarking Story
-
Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
-
Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
-
Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs (this post)
-
Benchmarks, Redux (part 6): BSBM and I/O, continued
-
Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
-
Benchmarks, Redux (part 8): BSBM Explore and Update
-
Benchmarks, Redux (part 9): BSBM With Cluster
-
Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
-
Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks
-
Benchmarks, Redux (part 12): Our Own BSBM Results Report
-
Benchmarks, Redux (part 13): BSBM BI Modifications
-
Benchmarks, Redux (part 14): BSBM BI Mix
-
Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
|
03/07/2011 14:17 GMT
|
Modified:
03/14/2011 17:56 GMT
|
Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
[
Orri Erling
]
Below is a questionnaire I sent to the BSBM participants in order to get tuning instructions for the runs we were planning. I have filled in the answers for Virtuoso, here. This can be a checklist for pretty much any RDF database tuning.
-
Threading - What settings should be used (e.g., for query parallelization, I/O parallelization [e.g., prefetch, flush of dirty], thread pools [e,.g. web server], any other thread related)? We will run with 8 and 32 cores, so if there are settings controlling number of read/write (R/W) locks or mutexes or such for serializing diverse things, these should be set accordingly to minimize contention.
The following three settings are all in the [Parameters] section of the virtuoso.ini file.
-
AsyncQueueMaxThreads
controls the size of a pool of extra threads that can be used for query parallelization. This should be set to either 1.5 * the number of cores or 1.5 * the number of core threads; see which works better.
-
ThreadsPerQuery
is the maximum number of threads a single query will take. This should be set to either the number of cores or the number of core threads; see which works better.
-
IndexTreeMaps
is the number of mutexes over which control for buffering an index tree is split. This can generally be left at default (256 in normal operation; valid settings are powers of 2 from 2 to 1024), but setting to 64, 128, or 512 may be beneficial.
A low number will lead to frequent contention; upwards of 64 will have little contention. We have sometimes seen a multiuser workload go 10% faster when setting this to 64 (down from 256), which seems counter-intuitive. This may be a cache artifact.
In the [HTTPServer] section of the virtuoso.ini file, the ServerThreads setting is the number of web server threads, i.e., the maximum number of concurrent SPARQL protocol requests. Having a value larger than the number of concurrent clients is OK; for large numbers of concurrent clients a lower value may be better, which will result in requests waiting for a thread to be available.
Note — The [HTTPServer] ServerThreads are taken from the total pool made available by the [Parameters] ServerThreads. Thus, the [Parameters] ServerThreads should always be at least as large as (and is best set greater than) the [HTTPServer] ServerThreads, and if using the closed-source Commercial Version, [Parameters] ServerThreads cannot exceed the licensed thread count.
-
File layout - Are there settings for striping over multiple devices? Settings for other file access parallelism? Settings for SSDs (e.g., SSD based cache of hot set of larger db files on disk)? The target config is for 4 independent disks and 4 independent SSDs. If you depend on RAID, are there settings for this? If you need RAID to be set up, please provide the settings/script for doing this with 4 SSDs on Linux (RH and Debian). This will be software RAID, as we find the hardware RAID to be much worse than an independent disk setup on the system in question.
It is best to stripe database files over all available disks, and to not use RAID. If RAID is desired, then stripe database files across many RAID sets. Use the segment declaration in the virtuoso.ini file. It is very important to give each independently seekable device its own I/O queue thread. See the documentation on the TPC-C sample for examples.
in the [Parameters] section of the virtuoso.ini file, set FDsPerFile to be (the number of concurrent threads * 1.5) ÷ the number of distinct database files.
There are no SSD specific settings.
-
Loading - How many parallel streams work best? We are looking for non-transactional bulk load, with no inference materialization. For partitioned cluster settings, do we divide the load streams over server processes?
Use one stream per core (not per core thread). In the case of a cluster, divide load streams evenly across all processes. The total number of streams on a cluster can equal the total number of cores; adjust up or down depending on what is observed.
Use the built-in bulk load facility, i.e.,
ld_dir ('<source-filename-or-directory>', '<file name pattern>', '<destination graph iri>');
For example,
SQL> ld_dir ('/path/to/files', '*.n3', 'http://dbpedia.org');
Then do a rdf_loader_run () on enough connections. For example, you can use the shell command
isql rdf_loader_run () &
to start one in a background isql process. When starting background load commands from the shell, you can use the shell wait command to wait for completion. If starting from isql, use the wait_for_children; command (see isql documentation for details).
See the BSBM disclosure report for an example load script.
-
What command should be used after non-transactional bulk load, to ensure a consistent persistent state on disk, like a log checkpoint or similar? Load and checkpoint will be timed separately, load being CPU-bound and checkpoint being I/O-bound. No roll-forward log or similar is required; the load does not have to recover if it fails before the checkpoint.
Execute
CHECKPOINT;
through a SQL client, e.g., isql. This is not a SPARQL statement and cannot be executed over the SPARQL protocol.
-
What settings should be used for trickle load of small triple sets into a pre-existing graph? This should be as transactional as supported; at least there should be a roll forward log, unlike the case for the bulk load.
No special settings are needed for load testing; defaults will produce transactional behavior with a roll forward log. Default transaction isolation is REPEATABLE READ, but this may be altered via SQL session settings or at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini file, with
DefaultIsolation = 4
Transaction isolation cannot be set over the SPARQL protocol.
NOTE: When testing full CRUD operations, other isolation settings may be preferable, due to ACID considerations. See answer #12, below, and detailed discussion in part 8 of this series, BSBM Explore and Update.
-
What settings control allocation of memory for database caching? We will be running mostly from memory, so we need to make sure that there is enough memory configured.
In the [Parameters] section of the virtuoso.ini file, NumberOfBuffers controls the amount of RAM used by Virtuoso to cache database files. One buffer caches an 8KB database page. In practice, count 10KB of memory per page. If "swappiness" on Linux is low (e.g., 2), two-thirds or more of physical memory can be used for database buffers. If swapping occurs, decrease the setting.
-
What command gives status on memory allocation (e.g., number of buffers, number of dirty buffers, etc.) so that we can verify that things are indeed in server memory and not, for example, being served from OS disk cache. If the cached format is different from the disk layout (e.g., decompression after disk read), is there a command for space statistics for database cache?
In an isql session, execute
STATUS ( ? ? );
The second result paragraph gives counts of total, used, and dirty buffers. If used buffers is steady and less than total, and if the disk read count on the line below does not increase, the system is running from memory. The cached format is the same as the disk based format.
-
What command gives information on disk allocation for different things? We are looking for the total size of allocated database pages for quads (including table, indices, anything else associated with quads) and dictionaries for literals, IRI names, etc. If there is a text index on literals, what command gives space stats for this? We count used pages, excluding any preallocated unused pages or other gaps. There is one number for quads and another for the dictionaries or other such structures, optionally a third for text index.
Execute on an isql session:
CHECKPOINT;
SELECT TOP 20 * FROM sys_index_space_stats ORDER BY iss_pages DESC;
The iss_pages column is the total pages for each index, including blob pages. Pages are 8KB. Only used pages are reported, gaps and unused pages are not counted. The rows pertaining to RDF_QUAD are for quads; RDF_IRI, RDF_PREFIX, RO_START, RDF_OBJ are for dictionaries; RDF_OBJ_RO_FLAGS_WORDS and VTLOG_DB_DBA_RDF_OBJ are for text index.
-
If there is a choice between triples and quads, we will run with quads. How do we ascertain that the run is with quads? How do we find out the index scheme? Should be use an alternate index scheme? Most of the data will be in a single big graph.
The default scheme uses quads. The default index layout is PSOG, POGS, GS, SP, OP. To see the current index scheme, use an isql session to execute
STATISTICS DB.DBA.RDF_QUAD;
-
For partitioned cluster settings, are there partitioning-related settings to control even distribution of data between partitions? For example, is there a way to set partitioning by S or O depending on which is first in key order for each index?
The default partitioning settings are good, i.e., partitioning is on O or S, whichever is first in key order.
-
For partitioned clusters, are there settings to control message batching or similar? What are the statistics available for checking interconnect operation, e.g. message counts, latencies, total aggregate throughput of interconnect?
In the [Cluster] section of the cluster.ini file, ReqBatchSize is the number of query states dispatched between cluster nodes per message round trip. This may be incremented from the default of 10000 to 50000 or so if this is seen to be useful.
To change this on the fly, the following can be issued through an isql session:
cl_exec ( ' __dbf_set (''cl_request_batch_size'', 50000) ' );
The commands below may be executed through an isql session to get a summary of CPU and message traffic for the whole cluster or process-by-process, respectively. The documentation details the fields.
STATUS ('cluster') ;; whole cluster STATUS ('cluster_d') ;; process-by-process
-
Other settings - Are there settings for limiting query planning, when appropriate? For example, the BSBM Explore mix has a large component of unnecessary query optimizer time, since the queries themselves access almost no data. Any other relevant settings?
-
For BSBM, needless query optimization should be capped at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini, with
StopCompilerWhenXOverRun = 1
-
When testing full CRUD operations (not simply CREATE, i.e., load, as discussed in #5, above), it is essential to make queries run with transaction isolation of READ COMMITTED, to remove most lock contention. Transaction isolation cannot be adjusted via SPARQL. This can be changed through SQL session settings, or at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini file, with
DefaultIsolation = 2
Benchmarks, Redux Series
-
Benchmarks, Redux (part 1): On RDF Benchmarks
-
Benchmarks, Redux (part 2): A Benchmarking Story
-
Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
-
Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire (this post)
-
Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
-
Benchmarks, Redux (part 6): BSBM and I/O, continued
-
Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
-
Benchmarks, Redux (part 8): BSBM Explore and Update
-
Benchmarks, Redux (part 9): BSBM With Cluster
-
Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
-
Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks
-
Benchmarks, Redux (part 12): Our Own BSBM Results Report
-
Benchmarks, Redux (part 13): BSBM BI Modifications
-
Benchmarks, Redux (part 14): BSBM BI Mix
-
Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
|
03/04/2011 15:28 GMT
|
Modified:
03/14/2011 17:55 GMT
|
Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
[
Orri Erling
]
Below is a questionnaire I sent to the BSBM participants in order to get tuning instructions for the runs we were planning. I have filled in the answers for Virtuoso, here. This can be a checklist for pretty much any RDF database tuning.
-
Threading - What settings should be used (e.g., for query parallelization, I/O parallelization [e.g., prefetch, flush of dirty], thread pools [e,.g. web server], any other thread related)? We will run with 8 and 32 cores, so if there are settings controlling number of read/write (R/W) locks or mutexes or such for serializing diverse things, these should be set accordingly to minimize contention.
The following three settings are all in the [Parameters] section of the virtuoso.ini file.
-
AsyncQueueMaxThreads
controls the size of a pool of extra threads that can be used for query parallelization. This should be set to either 1.5 * the number of cores or 1.5 * the number of core threads; see which works better.
-
ThreadsPerQuery
is the maximum number of threads a single query will take. This should be set to either the number of cores or the number of core threads; see which works better.
-
IndexTreeMaps
is the number of mutexes over which control for buffering an index tree is split. This can generally be left at default (256 in normal operation; valid settings are powers of 2 from 2 to 1024), but setting to 64, 128, or 512 may be beneficial.
A low number will lead to frequent contention; upwards of 64 will have little contention. We have sometimes seen a multiuser workload go 10% faster when setting this to 64 (down from 256), which seems counter-intuitive. This may be a cache artifact.
In the [HTTPServer] section of the virtuoso.ini file, the ServerThreads setting is the number of web server threads, i.e., the maximum number of concurrent SPARQL protocol requests. Having a value larger than the number of concurrent clients is OK; for large numbers of concurrent clients a lower value may be better, which will result in requests waiting for a thread to be available.
Note — The [HTTPServer] ServerThreads are taken from the total pool made available by the [Parameters] ServerThreads. Thus, the [Parameters] ServerThreads should always be at least as large as (and is best set greater than) the [HTTPServer] ServerThreads, and if using the closed-source Commercial Version, [Parameters] ServerThreads cannot exceed the licensed thread count.
-
File layout - Are there settings for striping over multiple devices? Settings for other file access parallelism? Settings for SSDs (e.g., SSD based cache of hot set of larger db files on disk)? The target config is for 4 independent disks and 4 independent SSDs. If you depend on RAID, are there settings for this? If you need RAID to be set up, please provide the settings/script for doing this with 4 SSDs on Linux (RH and Debian). This will be software RAID, as we find the hardware RAID to be much worse than an independent disk setup on the system in question.
It is best to stripe database files over all available disks, and to not use RAID. If RAID is desired, then stripe database files across many RAID sets. Use the segment declaration in the virtuoso.ini file. It is very important to give each independently seekable device its own I/O queue thread. See the documentation on the TPC-C sample for examples.
in the [Parameters] section of the virtuoso.ini file, set FDsPerFile to be (the number of concurrent threads * 1.5) ÷ the number of distinct database files.
There are no SSD specific settings.
-
Loading - How many parallel streams work best? We are looking for non-transactional bulk load, with no inference materialization. For partitioned cluster settings, do we divide the load streams over server processes?
Use one stream per core (not per core thread). In the case of a cluster, divide load streams evenly across all processes. The total number of streams on a cluster can equal the total number of cores; adjust up or down depending on what is observed.
Use the built-in bulk load facility, i.e.,
ld_dir ('<source-filename-or-directory>', '<file name pattern>', '<destination graph iri>');
For example,
SQL> ld_dir ('/path/to/files', '*.n3', 'http://dbpedia.org');
Then do a rdf_loader_run () on enough connections. For example, you can use the shell command
isql rdf_loader_run () &
to start one in a background isql process. When starting background load commands from the shell, you can use the shell wait command to wait for completion. If starting from isql, use the wait_for_children; command (see isql documentation for details).
See the BSBM disclosure report for an example load script.
-
What command should be used after non-transactional bulk load, to ensure a consistent persistent state on disk, like a log checkpoint or similar? Load and checkpoint will be timed separately, load being CPU-bound and checkpoint being I/O-bound. No roll-forward log or similar is required; the load does not have to recover if it fails before the checkpoint.
Execute
CHECKPOINT;
through a SQL client, e.g., isql. This is not a SPARQL statement and cannot be executed over the SPARQL protocol.
-
What settings should be used for trickle load of small triple sets into a pre-existing graph? This should be as transactional as supported; at least there should be a roll forward log, unlike the case for the bulk load.
No special settings are needed for load testing; defaults will produce transactional behavior with a roll forward log. Default transaction isolation is REPEATABLE READ, but this may be altered via SQL session settings or at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini file, with
DefaultIsolation = 4
Transaction isolation cannot be set over the SPARQL protocol.
NOTE: When testing full CRUD operations, other isolation settings may be preferable, due to ACID considerations. See answer #12, below, and detailed discussion in part 8 of this series, BSBM Explore and Update.
-
What settings control allocation of memory for database caching? We will be running mostly from memory, so we need to make sure that there is enough memory configured.
In the [Parameters] section of the virtuoso.ini file, NumberOfBuffers controls the amount of RAM used by Virtuoso to cache database files. One buffer caches an 8KB database page. In practice, count 10KB of memory per page. If "swappiness" on Linux is low (e.g., 2), two-thirds or more of physical memory can be used for database buffers. If swapping occurs, decrease the setting.
-
What command gives status on memory allocation (e.g., number of buffers, number of dirty buffers, etc.) so that we can verify that things are indeed in server memory and not, for example, being served from OS disk cache. If the cached format is different from the disk layout (e.g., decompression after disk read), is there a command for space statistics for database cache?
In an isql session, execute
STATUS ( ? ? );
The second result paragraph gives counts of total, used, and dirty buffers. If used buffers is steady and less than total, and if the disk read count on the line below does not increase, the system is running from memory. The cached format is the same as the disk based format.
-
What command gives information on disk allocation for different things? We are looking for the total size of allocated database pages for quads (including table, indices, anything else associated with quads) and dictionaries for literals, IRI names, etc. If there is a text index on literals, what command gives space stats for this? We count used pages, excluding any preallocated unused pages or other gaps. There is one number for quads and another for the dictionaries or other such structures, optionally a third for text index.
Execute on an isql session:
CHECKPOINT;
SELECT TOP 20 * FROM sys_index_space_stats ORDER BY iss_pages DESC;
The iss_pages column is the total pages for each index, including blob pages. Pages are 8KB. Only used pages are reported, gaps and unused pages are not counted. The rows pertaining to RDF_QUAD are for quads; RDF_IRI, RDF_PREFIX, RO_START, RDF_OBJ are for dictionaries; RDF_OBJ_RO_FLAGS_WORDS and VTLOG_DB_DBA_RDF_OBJ are for text index.
-
If there is a choice between triples and quads, we will run with quads. How do we ascertain that the run is with quads? How do we find out the index scheme? Should be use an alternate index scheme? Most of the data will be in a single big graph.
The default scheme uses quads. The default index layout is PSOG, POGS, GS, SP, OP. To see the current index scheme, use an isql session to execute
STATISTICS DB.DBA.RDF_QUAD;
-
For partitioned cluster settings, are there partitioning-related settings to control even distribution of data between partitions? For example, is there a way to set partitioning by S or O depending on which is first in key order for each index?
The default partitioning settings are good, i.e., partitioning is on O or S, whichever is first in key order.
-
For partitioned clusters, are there settings to control message batching or similar? What are the statistics available for checking interconnect operation, e.g. message counts, latencies, total aggregate throughput of interconnect?
In the [Cluster] section of the cluster.ini file, ReqBatchSize is the number of query states dispatched between cluster nodes per message round trip. This may be incremented from the default of 10000 to 50000 or so if this is seen to be useful.
To change this on the fly, the following can be issued through an isql session:
cl_exec ( ' __dbf_set (''cl_request_batch_size'', 50000) ' );
The commands below may be executed through an isql session to get a summary of CPU and message traffic for the whole cluster or process-by-process, respectively. The documentation details the fields.
STATUS ('cluster') ;; whole cluster STATUS ('cluster_d') ;; process-by-process
-
Other settings - Are there settings for limiting query planning, when appropriate? For example, the BSBM Explore mix has a large component of unnecessary query optimizer time, since the queries themselves access almost no data. Any other relevant settings?
-
For BSBM, needless query optimization should be capped at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini, with
StopCompilerWhenXOverRun = 1
-
When testing full CRUD operations (not simply CREATE, i.e., load, as discussed in #5, above), it is essential to make queries run with transaction isolation of READ COMMITTED, to remove most lock contention. Transaction isolation cannot be adjusted via SPARQL. This can be changed through SQL session settings, or at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini file, with
DefaultIsolation = 2
Benchmarks, Redux Series
-
Benchmarks, Redux (part 1): On RDF Benchmarks
-
Benchmarks, Redux (part 2): A Benchmarking Story
-
Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
-
Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire (this post)
-
Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
-
Benchmarks, Redux (part 6): BSBM and I/O, continued
-
Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
-
Benchmarks, Redux (part 8): BSBM Explore and Update
-
Benchmarks, Redux (part 9): BSBM With Cluster
-
Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
-
Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks
-
Benchmarks, Redux (part 12): Our Own BSBM Results Report
-
Benchmarks, Redux (part 13): BSBM BI Modifications
-
Benchmarks, Redux (part 14): BSBM BI Mix
-
Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
|
03/04/2011 15:28 GMT
|
Modified:
03/14/2011 17:55 GMT
|
Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
[
Orri Erling
]
Below is a questionnaire I sent to the BSBM participants in order to get tuning instructions for the runs we were planning. I have filled in the answers for Virtuoso, here. This can be a checklist for pretty much any RDF database tuning.
-
Threading - What settings should be used (e.g., for query parallelization, I/O parallelization [e.g., prefetch, flush of dirty], thread pools [e,.g. web server], any other thread related)? We will run with 8 and 32 cores, so if there are settings controlling number of read/write (R/W) locks or mutexes or such for serializing diverse things, these should be set accordingly to minimize contention.
The following three settings are all in the [Parameters] section of the virtuoso.ini file.
-
AsyncQueueMaxThreads
controls the size of a pool of extra threads that can be used for query parallelization. This should be set to either 1.5 * the number of cores or 1.5 * the number of core threads; see which works better.
-
ThreadsPerQuery
is the maximum number of threads a single query will take. This should be set to either the number of cores or the number of core threads; see which works better.
-
IndexTreeMaps
is the number of mutexes over which control for buffering an index tree is split. This can generally be left at default (256 in normal operation; valid settings are powers of 2 from 2 to 1024), but setting to 64, 128, or 512 may be beneficial.
A low number will lead to frequent contention; upwards of 64 will have little contention. We have sometimes seen a multiuser workload go 10% faster when setting this to 64 (down from 256), which seems counter-intuitive. This may be a cache artifact.
In the [HTTPServer] section of the virtuoso.ini file, the ServerThreads setting is the number of web server threads, i.e., the maximum number of concurrent SPARQL protocol requests. Having a value larger than the number of concurrent clients is OK; for large numbers of concurrent clients a lower value may be better, which will result in requests waiting for a thread to be available.
Note — The [HTTPServer] ServerThreads are taken from the total pool made available by the [Parameters] ServerThreads. Thus, the [Parameters] ServerThreads should always be at least as large as (and is best set greater than) the [HTTPServer] ServerThreads, and if using the closed-source Commercial Version, [Parameters] ServerThreads cannot exceed the licensed thread count.
-
File layout - Are there settings for striping over multiple devices? Settings for other file access parallelism? Settings for SSDs (e.g., SSD based cache of hot set of larger db files on disk)? The target config is for 4 independent disks and 4 independent SSDs. If you depend on RAID, are there settings for this? If you need RAID to be set up, please provide the settings/script for doing this with 4 SSDs on Linux (RH and Debian). This will be software RAID, as we find the hardware RAID to be much worse than an independent disk setup on the system in question.
It is best to stripe database files over all available disks, and to not use RAID. If RAID is desired, then stripe database files across many RAID sets. Use the segment declaration in the virtuoso.ini file. It is very important to give each independently seekable device its own I/O queue thread. See the documentation on the TPC-C sample for examples.
in the [Parameters] section of the virtuoso.ini file, set FDsPerFile to be (the number of concurrent threads * 1.5) ÷ the number of distinct database files.
There are no SSD specific settings.
-
Loading - How many parallel streams work best? We are looking for non-transactional bulk load, with no inference materialization. For partitioned cluster settings, do we divide the load streams over server processes?
Use one stream per core (not per core thread). In the case of a cluster, divide load streams evenly across all processes. The total number of streams on a cluster can equal the total number of cores; adjust up or down depending on what is observed.
Use the built-in bulk load facility, i.e.,
ld_dir ('<source-filename-or-directory>', '<file name pattern>', '<destination graph iri>');
For example,
SQL> ld_dir ('/path/to/files', '*.n3', 'http://dbpedia.org');
Then do a rdf_loader_run () on enough connections. For example, you can use the shell command
isql rdf_loader_run () &
to start one in a background isql process. When starting background load commands from the shell, you can use the shell wait command to wait for completion. If starting from isql, use the wait_for_children; command (see isql documentation for details).
See the BSBM disclosure report for an example load script.
-
What command should be used after non-transactional bulk load, to ensure a consistent persistent state on disk, like a log checkpoint or similar? Load and checkpoint will be timed separately, load being CPU-bound and checkpoint being I/O-bound. No roll-forward log or similar is required; the load does not have to recover if it fails before the checkpoint.
Execute
CHECKPOINT;
through a SQL client, e.g., isql. This is not a SPARQL statement and cannot be executed over the SPARQL protocol.
-
What settings should be used for trickle load of small triple sets into a pre-existing graph? This should be as transactional as supported; at least there should be a roll forward log, unlike the case for the bulk load.
No special settings are needed for load testing; defaults will produce transactional behavior with a roll forward log. Default transaction isolation is REPEATABLE READ, but this may be altered via SQL session settings or at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini file, with
DefaultIsolation = 4
Transaction isolation cannot be set over the SPARQL protocol.
NOTE: When testing full CRUD operations, other isolation settings may be preferable, due to ACID considerations. See answer #12, below, and detailed discussion in part 8 of this series, BSBM Explore and Update.
-
What settings control allocation of memory for database caching? We will be running mostly from memory, so we need to make sure that there is enough memory configured.
In the [Parameters] section of the virtuoso.ini file, NumberOfBuffers controls the amount of RAM used by Virtuoso to cache database files. One buffer caches an 8KB database page. In practice, count 10KB of memory per page. If "swappiness" on Linux is low (e.g., 2), two-thirds or more of physical memory can be used for database buffers. If swapping occurs, decrease the setting.
-
What command gives status on memory allocation (e.g., number of buffers, number of dirty buffers, etc.) so that we can verify that things are indeed in server memory and not, for example, being served from OS disk cache. If the cached format is different from the disk layout (e.g., decompression after disk read), is there a command for space statistics for database cache?
In an isql session, execute
STATUS ( ? ? );
The second result paragraph gives counts of total, used, and dirty buffers. If used buffers is steady and less than total, and if the disk read count on the line below does not increase, the system is running from memory. The cached format is the same as the disk based format.
-
What command gives information on disk allocation for different things? We are looking for the total size of allocated database pages for quads (including table, indices, anything else associated with quads) and dictionaries for literals, IRI names, etc. If there is a text index on literals, what command gives space stats for this? We count used pages, excluding any preallocated unused pages or other gaps. There is one number for quads and another for the dictionaries or other such structures, optionally a third for text index.
Execute on an isql session:
CHECKPOINT;
SELECT TOP 20 * FROM sys_index_space_stats ORDER BY iss_pages DESC;
The iss_pages column is the total pages for each index, including blob pages. Pages are 8KB. Only used pages are reported, gaps and unused pages are not counted. The rows pertaining to RDF_QUAD are for quads; RDF_IRI, RDF_PREFIX, RO_START, RDF_OBJ are for dictionaries; RDF_OBJ_RO_FLAGS_WORDS and VTLOG_DB_DBA_RDF_OBJ are for text index.
-
If there is a choice between triples and quads, we will run with quads. How do we ascertain that the run is with quads? How do we find out the index scheme? Should be use an alternate index scheme? Most of the data will be in a single big graph.
The default scheme uses quads. The default index layout is PSOG, POGS, GS, SP, OP. To see the current index scheme, use an isql session to execute
STATISTICS DB.DBA.RDF_QUAD;
-
For partitioned cluster settings, are there partitioning-related settings to control even distribution of data between partitions? For example, is there a way to set partitioning by S or O depending on which is first in key order for each index?
The default partitioning settings are good, i.e., partitioning is on O or S, whichever is first in key order.
-
For partitioned clusters, are there settings to control message batching or similar? What are the statistics available for checking interconnect operation, e.g. message counts, latencies, total aggregate throughput of interconnect?
In the [Cluster] section of the cluster.ini file, ReqBatchSize is the number of query states dispatched between cluster nodes per message round trip. This may be incremented from the default of 10000 to 50000 or so if this is seen to be useful.
To change this on the fly, the following can be issued through an isql session:
cl_exec ( ' __dbf_set (''cl_request_batch_size'', 50000) ' );
The commands below may be executed through an isql session to get a summary of CPU and message traffic for the whole cluster or process-by-process, respectively. The documentation details the fields.
STATUS ('cluster') ;; whole cluster STATUS ('cluster_d') ;; process-by-process
-
Other settings - Are there settings for limiting query planning, when appropriate? For example, the BSBM Explore mix has a large component of unnecessary query optimizer time, since the queries themselves access almost no data. Any other relevant settings?
-
For BSBM, needless query optimization should be capped at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini, with
StopCompilerWhenXOverRun = 1
-
When testing full CRUD operations (not simply CREATE, i.e., load, as discussed in #5, above), it is essential to make queries run with transaction isolation of READ COMMITTED, to remove most lock contention. Transaction isolation cannot be adjusted via SPARQL. This can be changed through SQL session settings, or at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini file, with
DefaultIsolation = 2
Benchmarks, Redux Series
-
Benchmarks, Redux (part 1): On RDF Benchmarks
-
Benchmarks, Redux (part 2): A Benchmarking Story
-
Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
-
Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire (this post)
-
Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
-
Benchmarks, Redux (part 6): BSBM and I/O, continued
-
Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
-
Benchmarks, Redux (part 8): BSBM Explore and Update
-
Benchmarks, Redux (part 9): BSBM With Cluster
-
Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
-
Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks
-
Benchmarks, Redux (part 12): Our Own BSBM Results Report
-
Benchmarks, Redux (part 13): BSBM BI Modifications
-
Benchmarks, Redux (part 14): BSBM BI Mix
-
Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
|
03/04/2011 15:28 GMT
|
Modified:
03/14/2011 17:55 GMT
|
Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
[
Virtuso Data Space Bot
]
Below is a questionnaire I sent to the BSBM participants in order to get tuning instructions for the runs we were planning. I have filled in the answers for Virtuoso, here. This can be a checklist for pretty much any RDF database tuning.
-
Threading - What settings should be used (e.g., for query parallelization, I/O parallelization [e.g., prefetch, flush of dirty], thread pools [e,.g. web server], any other thread related)? We will run with 8 and 32 cores, so if there are settings controlling number of read/write (R/W) locks or mutexes or such for serializing diverse things, these should be set accordingly to minimize contention.
The following three settings are all in the [Parameters] section of the virtuoso.ini file.
-
AsyncQueueMaxThreads
controls the size of a pool of extra threads that can be used for query parallelization. This should be set to either 1.5 * the number of cores or 1.5 * the number of core threads; see which works better.
-
ThreadsPerQuery
is the maximum number of threads a single query will take. This should be set to either the number of cores or the number of core threads; see which works better.
-
IndexTreeMaps
is the number of mutexes over which control for buffering an index tree is split. This can generally be left at default (256 in normal operation; valid settings are powers of 2 from 2 to 1024), but setting to 64, 128, or 512 may be beneficial.
A low number will lead to frequent contention; upwards of 64 will have little contention. We have sometimes seen a multiuser workload go 10% faster when setting this to 64 (down from 256), which seems counter-intuitive. This may be a cache artifact.
In the [HTTPServer] section of the virtuoso.ini file, the ServerThreads setting is the number of web server threads, i.e., the maximum number of concurrent SPARQL protocol requests. Having a value larger than the number of concurrent clients is OK; for large numbers of concurrent clients a lower value may be better, which will result in requests waiting for a thread to be available.
Note — The [HTTPServer] ServerThreads are taken from the total pool made available by the [Parameters] ServerThreads. Thus, the [Parameters] ServerThreads should always be at least as large as (and is best set greater than) the [HTTPServer] ServerThreads, and if using the closed-source Commercial Version, [Parameters] ServerThreads cannot exceed the licensed thread count.
-
File layout - Are there settings for striping over multiple devices? Settings for other file access parallelism? Settings for SSDs (e.g., SSD based cache of hot set of larger db files on disk)? The target config is for 4 independent disks and 4 independent SSDs. If you depend on RAID, are there settings for this? If you need RAID to be set up, please provide the settings/script for doing this with 4 SSDs on Linux (RH and Debian). This will be software RAID, as we find the hardware RAID to be much worse than an independent disk setup on the system in question.
It is best to stripe database files over all available disks, and to not use RAID. If RAID is desired, then stripe database files across many RAID sets. Use the segment declaration in the virtuoso.ini file. It is very important to give each independently seekable device its own I/O queue thread. See the documentation on the TPC-C sample for examples.
in the [Parameters] section of the virtuoso.ini file, set FDsPerFile to be (the number of concurrent threads * 1.5) ÷ the number of distinct database files.
There are no SSD specific settings.
-
Loading - How many parallel streams work best? We are looking for non-transactional bulk load, with no inference materialization. For partitioned cluster settings, do we divide the load streams over server processes?
Use one stream per core (not per core thread). In the case of a cluster, divide load streams evenly across all processes. The total number of streams on a cluster can equal the total number of cores; adjust up or down depending on what is observed.
Use the built-in bulk load facility, i.e.,
ld_dir ('<source-filename-or-directory>', '<file name pattern>', '<destination graph iri>');
For example,
SQL> ld_dir ('/path/to/files', '*.n3', 'http://dbpedia.org');
Then do a rdf_loader_run () on enough connections. For example, you can use the shell command
isql rdf_loader_run () &
to start one in a background isql process. When starting background load commands from the shell, you can use the shell wait command to wait for completion. If starting from isql, use the wait_for_children; command (see isql documentation for details).
See the BSBM disclosure report for an example load script.
-
What command should be used after non-transactional bulk load, to ensure a consistent persistent state on disk, like a log checkpoint or similar? Load and checkpoint will be timed separately, load being CPU-bound and checkpoint being I/O-bound. No roll-forward log or similar is required; the load does not have to recover if it fails before the checkpoint.
Execute
CHECKPOINT;
through a SQL client, e.g., isql. This is not a SPARQL statement and cannot be executed over the SPARQL protocol.
-
What settings should be used for trickle load of small triple sets into a pre-existing graph? This should be as transactional as supported; at least there should be a roll forward log, unlike the case for the bulk load.
No special settings are needed for load testing; defaults will produce transactional behavior with a roll forward log. Default transaction isolation is REPEATABLE READ, but this may be altered via SQL session settings or at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini file, with
DefaultIsolation = 4
Transaction isolation cannot be set over the SPARQL protocol.
NOTE: When testing full CRUD operations, other isolation settings may be preferable, due to ACID considerations. See answer #12, below, and detailed discussion in part 8 of this series, BSBM Explore and Update.
-
What settings control allocation of memory for database caching? We will be running mostly from memory, so we need to make sure that there is enough memory configured.
In the [Parameters] section of the virtuoso.ini file, NumberOfBuffers controls the amount of RAM used by Virtuoso to cache database files. One buffer caches an 8KB database page. In practice, count 10KB of memory per page. If "swappiness" on Linux is low (e.g., 2), two-thirds or more of physical memory can be used for database buffers. If swapping occurs, decrease the setting.
-
What command gives status on memory allocation (e.g., number of buffers, number of dirty buffers, etc.) so that we can verify that things are indeed in server memory and not, for example, being served from OS disk cache. If the cached format is different from the disk layout (e.g., decompression after disk read), is there a command for space statistics for database cache?
In an isql session, execute
STATUS ( ? ? );
The second result paragraph gives counts of total, used, and dirty buffers. If used buffers is steady and less than total, and if the disk read count on the line below does not increase, the system is running from memory. The cached format is the same as the disk based format.
-
What command gives information on disk allocation for different things? We are looking for the total size of allocated database pages for quads (including table, indices, anything else associated with quads) and dictionaries for literals, IRI names, etc. If there is a text index on literals, what command gives space stats for this? We count used pages, excluding any preallocated unused pages or other gaps. There is one number for quads and another for the dictionaries or other such structures, optionally a third for text index.
Execute on an isql session:
CHECKPOINT;
SELECT TOP 20 * FROM sys_index_space_stats ORDER BY iss_pages DESC;
The iss_pages column is the total pages for each index, including blob pages. Pages are 8KB. Only used pages are reported, gaps and unused pages are not counted. The rows pertaining to RDF_QUAD are for quads; RDF_IRI, RDF_PREFIX, RO_START, RDF_OBJ are for dictionaries; RDF_OBJ_RO_FLAGS_WORDS and VTLOG_DB_DBA_RDF_OBJ are for text index.
-
If there is a choice between triples and quads, we will run with quads. How do we ascertain that the run is with quads? How do we find out the index scheme? Should be use an alternate index scheme? Most of the data will be in a single big graph.
The default scheme uses quads. The default index layout is PSOG, POGS, GS, SP, OP. To see the current index scheme, use an isql session to execute
STATISTICS DB.DBA.RDF_QUAD;
-
For partitioned cluster settings, are there partitioning-related settings to control even distribution of data between partitions? For example, is there a way to set partitioning by S or O depending on which is first in key order for each index?
The default partitioning settings are good, i.e., partitioning is on O or S, whichever is first in key order.
-
For partitioned clusters, are there settings to control message batching or similar? What are the statistics available for checking interconnect operation, e.g. message counts, latencies, total aggregate throughput of interconnect?
In the [Cluster] section of the cluster.ini file, ReqBatchSize is the number of query states dispatched between cluster nodes per message round trip. This may be incremented from the default of 10000 to 50000 or so if this is seen to be useful.
To change this on the fly, the following can be issued through an isql session:
cl_exec ( ' __dbf_set (''cl_request_batch_size'', 50000) ' );
The commands below may be executed through an isql session to get a summary of CPU and message traffic for the whole cluster or process-by-process, respectively. The documentation details the fields.
STATUS ('cluster') ;; whole cluster STATUS ('cluster_d') ;; process-by-process
-
Other settings - Are there settings for limiting query planning, when appropriate? For example, the BSBM Explore mix has a large component of unnecessary query optimizer time, since the queries themselves access almost no data. Any other relevant settings?
-
For BSBM, needless query optimization should be capped at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini, with
StopCompilerWhenXOverRun = 1
-
When testing full CRUD operations (not simply CREATE, i.e., load, as discussed in #5, above), it is essential to make queries run with transaction isolation of READ COMMITTED, to remove most lock contention. Transaction isolation cannot be adjusted via SPARQL. This can be changed through SQL session settings, or at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini file, with
DefaultIsolation = 2
Benchmarks, Redux Series
-
Benchmarks, Redux (part 1): On RDF Benchmarks
-
Benchmarks, Redux (part 2): A Benchmarking Story
-
Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
-
Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire (this post)
-
Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
-
Benchmarks, Redux (part 6): BSBM and I/O, continued
-
Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
-
Benchmarks, Redux (part 8): BSBM Explore and Update
-
Benchmarks, Redux (part 9): BSBM With Cluster
-
Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
-
Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks
-
Benchmarks, Redux (part 12): Our Own BSBM Results Report
-
Benchmarks, Redux (part 13): BSBM BI Modifications
-
Benchmarks, Redux (part 14): BSBM BI Mix
-
Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
|
03/04/2011 15:28 GMT
|
Modified:
03/14/2011 17:56 GMT
|
Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
[
Orri Erling
]
In this post I will summarize the figures for BSBM Load and Explore mixes at 100 Mt, 200 Mt, and 1000 Mt. (1 Mt = 1 Megatriple, or one million triples.) The measurements were made on a 72GB 2xXeon 5520 with 4 SSDs. The exact specifications and configurations are in the raw reports to follow.
The load time in the recent Berlin report was measured with the wrong function, and so far as we can tell, without multiple threads. The intermediate cut of Virtuoso they tested also had broken SPARQL/Update (also known as SPARUL) features. We have fixed this since, and give here the right numbers.
In the course of the discussion to follow, we talk about 3 different kinds of Virtuoso:
-
6 Single is the generally available single server configuration of Virtuoso. Whether this is open source or not does not make a difference.
-
6 Cluster is the generally available commercial only cluster-capable Virtuoso.
-
7 Single is the next generation single server Virtuoso, about to be released as a preview.
To understand the numbers, we must explain how these differ from each other in execution:
-
6 Single has one thread-per-query, and operates on one state of the query at a time.
-
6 Cluster has one thread-per-query-per-process, and between processes it operates on batches of some tens-of-thousands of simultaneous query states. Within each node, these batches run through the execution pipeline one state at a time. Aggregation is distributed, and the query optimizer is generally smart about shipping colocated functions together.
-
7 Single has multiple threads-per-query and in all situations operates on batches of 10,000 or more simultaneous query states. This means, for example, that index lookups get large numbers of parameters which then are sorted to get an ascending search pattern which benefits from locality, so the n * log(n) index access for the batch becomes more like linear if the data accessed has any locality. Furthermore, if there are many operands to an operator, these can be split on multiple threads. Also, scans of consecutive rows can be split before the scan on multiple threads, each doing a range of the scan. These features are called vectored execution and query parallelization. These techniques will also be applied to the cluster variant in due time.
The version 6 and 7 variants discussed here use the same physical storage layout with row-wise key compression. Additionally, there exists a column-wise storage option in 7 that can fit 4x the number of quads in the same space. This column store option is not used here because it still has some problems with random order inserts.
We will first consider loading. Below are the load times and rates for 7 at each scale.
| 7 Single |
| Scale |
Rate (quads per second) |
Load time (seconds) |
Checkpoint time (seconds) |
| 100 Mt |
261,366 |
301 |
82 |
| 200 Mt |
216,000 |
802 |
123 |
| 1000 Mt |
130,378 |
6641 |
1012 |
In each case the load was made on 8 concurrent streams, each reading a file from a pool of 80 files for the two smaller scales and 360 files for the larger scale.
We also loaded the smallest data set with 6 Single using the same load script.
| 6 Single |
| Scale |
Rate (quads per second) |
Load time (seconds) |
Checkpoint time (seconds) |
| 100 Mt |
74,713 |
1192 |
145 |
CPU time with 6 Single was 8047 seconds. We compare this to 4453 seconds of CPU for the same load on 7 Single. The CPU% during the run was on either side of 700% for 6 Single and 1300% for 7 Single. Note that high percentages involve core threads, not real cores.
The difference is mostly attributable to vectoring and the introduction of a non-transactional insert. The 6 Single inserts transactionally but makes very frequent commits and writes no log, resulting in de facto non-transactional behavior but still there is a lock and commit cycle. Inserts in RDF load usually exhibit locality on all SPOG. Sorting by value gives ascending insert order and eliminates much of the lookup time for deciding where the next row will go. Contention on page read-write locks is less because the engine stays longer on a page, inserting multiple values in one go, instead of re-acquiring the read-write lock and possible transaction locks for each row.
Furthermore, for single stream loading the non-transactional mode can serve one thread doing the parsing with many threads doing the inserting; hence, in practice the speed is bounded by the parsing speed. In multi-stream load this parallelization also happens but is less significant, as adding threads past the count of core threads is not useful. Writes are all in-place, and no delta-merge mechanism is involved. For transactional inserts, the uncommitted rows are not visible to read-committed readers, which do not block. Repeatable and serializable readers would block before an uncommitted insert.
Now for the run (larger numbers indicate more queries executed, and are therefore better):
6 Single Throughput (QMpH, query mixes per hour) |
| Scale |
Single User |
16 User |
| 100 Mt |
7641 |
29433 |
| 200 Mt |
6017 |
13335 |
| 1000 Mt |
1770 |
2487 |
7 Single Throughput (QMpH, query mixes per hour) |
| Scale |
Single User |
16 User |
| 100 Mt |
11742 |
72278 |
| 200 Mt |
10225 |
60951 |
| 1000 Mt |
6262 |
24672 |
The 100 Mt and 200 Mt runs are entirely in memory; the 1000 Mt run is mostly in memory, with about a 1.6 MB/s trickle from SSD in steady state. Accordingly, the 1000 Mt run is longer, with 2000 query mixes in the timed period, preceded by a warm-up of 2000 mixes with a different seed. For the memory-only scales, we run 500 mixes twice, and take the timing of the second run.
Looking at single user speeds, 6 Single and 7 Single are closest at the small end and drift farther apart at the larger scales. This comes from the increased opportunity to parallelize Q5, since this works on more data and is relatively more important as the scale gets larger. The 100 Mt run of 7 Single has about 130% CPU, and the 1000 Mt run has about 270%. This also explains why adding clients gives a larger boost at the smaller scale.
Now let us look at the relative effects of parallelizing and vectoring in 7 Single. We run 50 mixes of Single User Explore: 6132 QMpH with both parallelizing and vectoring on; 2805 QMpH with execution limited to a single thread. Then we set the vector size to 1, meaning that the query pipeline runs one row at a time. This gets us 1319 QMpH which is a bit worse than 6 Single. This is to be expected since there is some overhead to running vectored with single-element vectors. Q5 on 7 Single with vectoring and a single thread runs at 1.9 qps; with single-element vectors, at 0.8 qps. The 6 Single engine runs Q5 at 1.13 qps.
The 100 Mt scale 7 Single gains the most from adding clients; the 1000 Mt 6 Single gains the least. The reason for the latter is covered in detail in A Benchmarking Story. We note that while vectoring is primarily geared to better single-thread speed and better cache hit rates, it delivers a huge multithreaded benefit by eliminating the mutex contention at the index tree top which stops 6 Single dead at 1000 Mt.
In conclusion, we see that even with a workload of short queries and little opportunity for parallelism, we get substantial benefits from query parallelization and vectoring. When moving to more complex workloads, the benefits become more pronounced. For a single user complex query load, we can get 7x speed-up from parallelism (8 core), plus up to 3x from vectoring. These numbers do not take into account the benefits of the column store; those will be analyzed separately a bit later.
The full run details will be supplied at the end of this blog series.
Benchmarks, Redux Series
-
Benchmarks, Redux (part 1): On RDF Benchmarks
-
Benchmarks, Redux (part 2): A Benchmarking Story
- Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore (this post)
-
Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
-
Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
-
Benchmarks, Redux (part 6): BSBM and I/O, continued
-
Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
-
Benchmarks, Redux (part 8): BSBM Explore and Update
-
Benchmarks, Redux (part 9): BSBM With Cluster
-
Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
-
Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks
-
Benchmarks, Redux (part 12): Our Own BSBM Results Report
-
Benchmarks, Redux (part 13): BSBM BI Modifications
-
Benchmarks, Redux (part 14): BSBM BI Mix
-
Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
|
03/02/2011 18:23 GMT
|
Modified:
03/14/2011 17:16 GMT
|
|
|