<?xml version="1.0" encoding="UTF-8" ?>
<!--RDF based XML document generated By OpenLink Virtuoso-->
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
 <rss:channel xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/">
  <rss:title>Orri Erling&#39;s Weblog</rss:title>
  <rss:link>http://www.openlinksw.com/weblog/oerling/</rss:link>
  <rss:description />
  <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">oerling@openlinksw.com</dc:creator>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2013-05-19T09:55:35Z</dc:date>
  <rss:items>
   <rdf:Seq>
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2011-03-22#1683" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2011-03-07#1667" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2011-03-04#1665" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2011-03-02#1663" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2010-09-13#1626" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2010-04-05#1618" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2009-08-14#1568" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2009-03-25#1537" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2009-03-24#1535" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2009-01-02#1510" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-12-11#1494" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-11-04#1476" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-10-02#1448" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-08-06#1409" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-06-09#1374" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-03-06#1321" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-02-04#1308" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-02-01#1304" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2007-11-08#1269" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2007-11-08#1268" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2007-09-06#1250" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2007-08-28#1246" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2007-08-27#1244" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2007-05-23#1196" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2007-03-16#1159" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2007-02-05#1131" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2007-01-10#1116" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2006-12-22#1108" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2006-11-01#1074" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2006-07-18#1010" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2006-07-17#1007" />
   </rdf:Seq>
  </rss:items>
 </rss:channel>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2011-03-22#1683">
  <rss:title>Benchmarks, Redux (part 14): BSBM BI Mix</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-03-22T22:31:32Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">In this post, we look at how we run the BSBM-BI mix. We consider the 100 Mt and 1000 Mt scales with Virtuoso 7 using the same hardware and software as in the previous posts. The changes to workload and metric are given in the previous post. Our intent here is to look at whether the metric works, and to see what results will look like in general. We are as much testing the benchmark as we are testing the system-under-test (SUT). The results shown here will likely not be comparable with future ones because we will most likely change the composition of the workload since it seems a bit out of balance. Anyway, for the sake of disclosure, we attach the query templates. The test driver we used will be made available soon, so the interested may still try a comparison with their systems. If you practice with this workload for the coming races, the effort will surely not be wasted. Once we have come up with a rules document, we will redo all that we have published so far by-the-book, and have it audited as part of the LOD2 service we plan for this (see previous posts in this series). This will introduce comparability; but before we get that far with the BI workload, the workload needs to evolve a bit. Below we show samples of test driver output; the whole output is downloadable. 100 Mt Single User bsbm/testdriver -runs 1 -w 0 -idir /bs/1 -drill \ -ucf bsbm/usecases/businessIntelligence/sparql.txt \ -dg http://bsbm.org http://localhost:8604/sparql 0: 43348.14ms, total: 43440ms Scale factor: 284826 Explore Endpoints: 1 Update Endpoints: 1 Drilldown: on Number of warmup runs: 0 Seed: 808080 Number of query mix runs (without warmups): 1 times min/max Querymix runtime: 43.3481s / 43.3481s Elapsed runtime: 43.348 seconds QMpH: 83.049 query mixes per hour CQET: 43.348 seconds average runtime of query mix CQET (geom.): 43.348 seconds geometric mean runtime of query mix AQET (geom.): 0.492 seconds geometric mean runtime of query Throughput: 1494.874 BSBM-BI throughput: qph*scale BI Power: 7309.820 BSBM-BI Power: qph*scale (geom) 100 Mt 8 User Thread 6: query mix 3: 195793.09ms, total: 196086.18ms Thread 8: query mix 0: 197843.84ms, total: 198010.50ms Thread 7: query mix 4: 201806.28ms, total: 201996.26ms Thread 2: query mix 5: 221983.93ms, total: 222105.96ms Thread 4: query mix 7: 225127.55ms, total: 225317.49ms Thread 3: query mix 6: 225860.49ms, total: 226050.17ms Thread 5: query mix 2: 230884.93ms, total: 231067.61ms Thread 1: query mix 1: 237836.61ms, total: 237959.11ms Benchmark run completed in 237.985427s Scale factor: 284826 Explore Endpoints: 1 Update Endpoints: 1 Drilldown: on Number of warmup runs: 0 Number of clients: 8 Seed: 808080 Number of query mix runs (without warmups): 8 times min/max Querymix runtime: 195.7931s / 237.8366s Total runtime (sum): 1737.137 seconds Elapsed runtime: 1737.137 seconds QMpH: 121.016 query mixes per hour CQET: 217.142 seconds average runtime of query mix CQET (geom.): 216.603 seconds geometric mean runtime of query mix AQET (geom.): 2.156 seconds geometric mean runtime of query Throughput: 2178.285 BSBM-BI throughput: qph*scale BI Power: 1669.745 BSBM-BI Power: qph*scale (geom) 1000 Mt Single User 0: 608707.03ms, total: 608768ms Scale factor: 2848260 Explore Endpoints: 1 Update Endpoints: 1 Drilldown: on Number of warmup runs: 0 Seed: 808080 Number of query mix runs (without warmups): 1 times min/max Querymix runtime: 608.7070s / 608.7070s Elapsed runtime: 608.707 seconds QMpH: 5.914 query mixes per hour CQET: 608.707 seconds average runtime of query mix CQET (geom.): 608.707 seconds geometric mean runtime of query mix AQET (geom.): 5.167 seconds geometric mean runtime of query Throughput: 1064.552 BSBM-BI throughput: qph*scale BI Power: 6967.325 BSBM-BI Power: qph*scale (geom) 1000 Mt 8 User bsbm/testdriver -runs 8 -mt 8 -w 0 -idir /bs/10 -drill \ -ucf bsbm/usecases/businessIntelligence/sparql.txt \ -dg http://bsbm.org http://localhost:8604/sparql Thread 3: query mix 4: 2211275.25ms, total: 2211371.60ms Thread 4: query mix 0: 2212316.87ms, total: 2212417.99ms Thread 8: query mix 3: 2275942.63ms, total: 2276058.03ms Thread 5: query mix 5: 2441378.35ms, total: 2441448.66ms Thread 6: query mix 7: 2804001.05ms, total: 2804098.81ms Thread 2: query mix 2: 2808374.66ms, total: 2808473.71ms Thread 1: query mix 6: 2839407.12ms, total: 2839510.63ms Thread 7: query mix 1: 2889199.23ms, total: 2889263.17ms Benchmark run completed in 2889.302566s Scale factor: 2848260 Explore Endpoints: 1 Update Endpoints: 1 Drilldown: on Number of warmup runs: 0 Number of clients: 8 Seed: 808080 Number of query mix runs (without warmups): 8 times min/max Querymix runtime: 2211.2753s / 2889.1992s Total runtime (sum): 20481.895 seconds Elapsed runtime: 20481.895 seconds QMpH: 9.968 query mixes per hour CQET: 2560.237 seconds average runtime of query mix CQET (geom.): 2544.284 seconds geometric mean runtime of query mix AQET (geom.): 13.556 seconds geometric mean runtime of query Throughput: 1794.205 BSBM-BI throughput: qph*scale BI Power: 2655.678 BSBM-BI Power: qph*scale (geom) Metrics for Query: 1 Count: 8 times executed in whole run Time share 2.120884% of total execution time AQET: 54.299656 seconds (arithmetic mean) AQET(geom.): 34.607302 seconds (geometric mean) QPS: 0.13 Queries per second minQET/maxQET: 11.71547600s / 148.65379700s Metrics for Query: 2 Count: 8 times executed in whole run Time share 0.207382% of total execution time AQET: 5.309462 seconds (arithmetic mean) AQET(geom.): 2.737696 seconds (geometric mean) QPS: 1.34 Queries per second minQET/maxQET: 0.78729800s / 25.80948200s Metrics for Query: 3 Count: 8 times executed in whole run Time share 17.650472% of total execution time AQET: 451.893890 seconds (arithmetic mean) AQET(geom.): 410.481088 seconds (geometric mean) QPS: 0.02 Queries per second minQET/maxQET: 171.07262500s / 721.72939200s Metrics for Query: 5 Count: 32 times executed in whole run Time share 6.196565% of total execution time AQET: 39.661685 seconds (arithmetic mean) AQET(geom.): 6.849882 seconds (geometric mean) QPS: 0.18 Queries per second minQET/maxQET: 0.15696500s / 189.00906200s Metrics for Query: 6 Count: 8 times executed in whole run Time share 0.119916% of total execution time AQET: 3.070136 seconds (arithmetic mean) AQET(geom.): 2.056059 seconds (geometric mean) QPS: 2.31 Queries per second minQET/maxQET: 0.41524400s / 7.55655300s Metrics for Query: 7 Count: 40 times executed in whole run Time share 1.577963% of total execution time AQET: 8.079921 seconds (arithmetic mean) AQET(geom.): 1.342079 seconds (geometric mean) QPS: 0.88 Queries per second minQET/maxQET: 0.02205800s / 40.27761500s Metrics for Query: 8 Count: 40 times executed in whole run Time share 72.126818% of total execution time AQET: 369.323481 seconds (arithmetic mean) AQET(geom.): 114.431863 seconds (geometric mean) QPS: 0.02 Queries per second minQET/maxQET: 5.94377300s / 1824.57867400s The CPU for the multiuser runs stays above 1500% for the whole run. The CPU for the single user 100 Mt run is 630%; for the 1000 Mt run, this is 574%. This can be improved since the queries usually have a lot of data to work on. But final optimization is not our goal yet; we are just surveying the race track. The difference between a warm single user run and a cold single user run is about 15% with data on SSD; with data on disk, this would be more. The numbers shown are with warm cache. The single-user and multi-user Throughput difference, 1064 single-user vs. 1794 multi-user, is about what one would expect from the CPU utilization. With these numbers, the CPU does not appear badly memory-bound, else the increase would be less; also core multi-threading seems to bring some benefit. If the single-user run was at 800%, the Throughput would be 1488. The speed in excess of this may be attributed to core multi-threading, although we must remember that not every query mix is exactly the same length, so the figure is not exact. Core multi-threading does not seem to hurt, at the very least. Comparison of the same numbers with the column store will be interesting since it misses the cache a lot less and accordingly has better SMP scaling. The Intel Nehalem memory subsystem is really pretty good. For reference, we show a run with Virtuoso 6 at 100Mt. 0: 424754.40ms, total: 424829ms Scale factor: 284826 Explore Endpoints: 1 Update Endpoints: 1 Drilldown: on Number of warmup runs: 0 Seed: 808080 Number of query mix runs (without warmups): 1 times min/max Querymix runtime: 424.7544s / 424.7544s Elapsed runtime: 424.754 seconds QMpH: 8.475 query mixes per hour CQET: 424.754 seconds average runtime of query mix CQET (geom.): 424.754 seconds geometric mean runtime of query mix AQET (geom.): 1.097 seconds geometric mean runtime of query Throughput: 152.559 BSBM-BI throughput: qph*scale BI Power: 3281.150 BSBM-BI Power: qph*scale (geom) and 8 user Thread 5: query mix 3: 616997.86ms, total: 617042.83ms Thread 7: query mix 4: 625522.18ms, total: 625559.09ms Thread 3: query mix 7: 626247.62ms, total: 626304.96ms Thread 1: query mix 0: 629675.17ms, total: 629724.98ms Thread 4: query mix 6: 667633.36ms, total: 667670.07ms Thread 8: query mix 2: 674206.07ms, total: 674256.72ms Thread 6: query mix 5: 695020.21ms, total: 695052.29ms Thread 2: query mix 1: 701824.67ms, total: 701864.91ms Benchmark run completed in 701.909341s Scale factor: 284826 Explore Endpoints: 1 Update Endpoints: 1 Drilldown: on Number of warmup runs: 0 Number of clients: 8 Seed: 808080 Number of query mix runs (without warmups): 8 times min/max Querymix runtime: 616.9979s / 701.8247s Total runtime (sum): 5237.127 seconds Elapsed runtime: 5237.127 seconds QMpH: 41.031 query mixes per hour CQET: 654.641 seconds average runtime of query mix CQET (geom.): 653.873 seconds geometric mean runtime of query mix AQET (geom.): 2.557 seconds geometric mean runtime of query Throughput: 738.557 BSBM-BI throughput: qph*scale BI Power: 1408.133 BSBM-BI Power: qph*scale (geom) Having the numbers, let us look at the metric and its scaling. We take the geometric mean of the single-user Power and the multiuser Throughput. 100 Mt: sqrt ( 7771 * 2178 ); = 4114 1000 Mt: sqrt ( 6967 * 1794 ); = 3535 Scaling seems to work; the results are in the same general ballpark. The real times for the 1000 Mt run are a bit over 10x the times for the 100Mt run, as expected. The relative percentages of the queries are about the same on both scales, with the drill-down in Q8 alone being 77% and 72% respectively. The Q8 drill-down starts at the root of the product hierarchy. If we made this start one level from the top, its share would drop. This seems reasonable. Conversely, Q2 is out of place, with far too little share of the time. It takes a product as a starting point and shows a list of products with common features, sorted by descending count of common features. This would more appropriately be applied to a leaf product category instead, measuring how many of the products in the category have the top 20 features found in this category, to name an example. Also there should be more queries. At present it appears that BSBM-BI is definitely runnable, but a cursory look suffices to show that the workload needs more development and variety. We remember that I dreamt up the business questions last fall without much analysis, and that these questions were subsequently translated to SPARQL by FU Berlin. So, on one hand, BSBM-BI is of crucial importance because it is the first attempt at doing a benchmark with long running queries in SPARQL. On the other hand, BSBM-BI is not very good as a benchmark; TPC-H is a lot better. This stands to reason, as TPC-H has had years and years of development and participation by many people. Benchmark queries are trick questions: For example, TPC-H Q18 cannot be done without changing an IN into a JOIN with the IN subquery in the outer loop and doing streaming aggregation. Q13 cannot be done without a well-optimized HASH JOIN which besides must be partitioned at the larger scales. Having such trick questions in an important benchmark eventually results in everybody doing the optimizations that the benchmark clearly calls for. Making benchmarks thus entails a responsibility ultimately to the end user, because an irrelevant benchmark might in the worst case send developers chasing things that are beside the point. In the following, we will look at what BSBM-BI requires from the database and how these requirements can be further developed and extended. BSBM-BI does not have any clear trick questions, at least not premeditatedly. BSBM-BI just requires a cost model that can guess the fanout of a JOIN and the cardinality of a GROUP BY; it is enough to distinguish smaller from greater; the guess does not otherwise have to be very good. Further, the queries are written in the benchmark text so that joining from left to right would work, so not even a cost-based optimizer is strictly needed. I did however have to add some cardinality statistics to get reasonable JOIN order since we always reorder the query regardless of the source formulation. BSBM-BI does have variable selectivity from the drill-downs; thus these may call for different JOIN orders for different parameter values. I have not looked into whether this really makes a difference, though. There are places in BSBM-BI where using a HASH JOIN makes sense. We do not use HASH JOINs with RDF because there is an index for everything and making a HASH JOIN in the wrong place can have a large up-front cost, so one is more robust against cost model errors if one does not do HASH JOINs. This said, a HASH JOIN in the right place is a lot better than an index lookup. With TPC-H Q13, our best HASH JOIN is over 2x better than the best INDEX-based JOIN, both being well tuned. For questions like &quot;count the hairballs made in Germany reviewed by Japanese Hello Kitty fans,&quot; where two ends of a JOIN path are fairly selective doing the other as a HASH JOIN is good. This can, if the JOIN is always cardinality-reducing, even be merged inside an INDEX lookup. We have such capabilities since we have been for a while gearing up for the relational races, but are not using any of these with BSBM-BI, although they would be useful. Let us see the profile for a single user 100 Mt run. The database activity summary is -- select db_activity (0, &#39;http&#39;); 161.3MÂ rndÂ  210.2MÂ seqÂ  Â  Â  0Â sameÂ segÂ  Â 104.5MÂ sameÂ pgÂ  45.08MÂ sameÂ parÂ  Â  Â  0Â diskÂ  Â  Â Â 0Â specÂ diskÂ  Â  Â  0BÂ /Â  Â  Â  0Â messagesÂ  2.393KÂ fork See the post &quot;What Does BSBM Explore Measure&quot; for an explanation of the numbers. We see that there is more sequential access than random and the random has fair locality with over half on the same page as the previous and a lot of the rest falling under the same parent. Funnily enough, the explore mix has more locality. Running with a longer vector size would probably increase performance by getting better locality. There is an optimization that adjusts vector size on the fly if locality is not sufficient but this is not being used here. So we manually set vector size to 100000 instead of the default 10000. We get -- 172.4MÂ rndÂ  220.8MÂ seqÂ  Â  Â  0Â sameÂ segÂ  Â 149.6MÂ sameÂ pgÂ  10.99MÂ sameÂ parÂ  Â  Â 21Â diskÂ  Â  861Â specÂ diskÂ  Â  Â  0BÂ /Â  Â  Â  0Â messagesÂ  Â  Â 754Â fork The throughput goes from 1494 to 1779. We see more hits on the same page, as expected. We do not make this setting a default since it raises the cost for small queries; therefore the vector size must be self-adjusting -- besides, expecting a DBA to tune this is not reasonable. We will just have to correctly tune the self-adjust logic, and we have again clear gains. Let us now go back to the first run with vector size 10000. The top of the CPU oprofile is as follows: 722309 15.4507 cmpf_iri64n_iri64n 434791 9.3005 cmpf_iri64n_iri64n_anyn_iri64n 294712 6.3041 itc_next_set 273488 5.8501 itc_vec_split_search 203970 4.3631 itc_dive_transit 199687 4.2714 itc_page_rcf_search 181614 3.8848 dc_itc_append_any 173043 3.7015 itc_bm_vec_row_check 146727 3.1386 cmpf_int64n 128224 2.7428 itc_vec_row_check 113515 2.4282 dk_alloc 97296 2.0812 page_wait_access 62523 1.3374 qst_vec_get_int64 59014 1.2623 itc_next_set_parent 53589 1.1463 sslr_qst_get 48003 1.0268 ds_add 46641 0.9977 dk_free_tree 44551 0.9530 kc_var_col 43650 0.9337 page_col_cmp_1 35297 0.7550 cmpf_iri64n_iri64n_anyn_gt_lt 34589 0.7399 dv_compare 25864 0.5532 cmpf_iri64n_anyn_iri64n_iri64n_lte 23088 0.4939 dk_free The top 10 are all index traversal, with the key compare for two leading IRI keys in the lead, corresponding to a lookup with P and S given. The one after that is with all parts given, corresponding to an existence test. The existence tests could probably be converted to HASH JOIN lookups to good advantage. Aggregation and arithmetic are absent. We should probably add a query like TPC-H Q1 that does nothing but these two. Considering the overall profile, GROUP BY seems to be around 3%. We should probably put in a query that makes a very large number of groups and could make use of streaming aggregation, i.e., take advantage of a situation where aggregation input comes already grouped by the grouping columns. A BI use case should offer no problem with including arithmetic, but there are not that many numbers in the BSBM set. Some code sections in the queries with conditional execution and costly tests inside ANDs and ORs would be good. TPC-H has such in Q21 and Q19. An OR with existences where there would be gain from good guesses of a subquery&#39;s selectivity would be appropriate. Also, there should be conditional expressions somewhere with a lot of data, like the CASE-WHEN in TPC-H Q12. We can make BSBM-BI more interesting by putting in the above. Also we will have to see where we can profit from HASH JOIN, both small and large. There should be such places in the workload already so this is a matter of just playing a bit more. This post amounts to a cheat sheet for the BSBM-BI runs a bit farther down the road. By then we should be operational with the column store and Virtuoso 7 Cluster, though, so not everything is yet on the table. Benchmarks, Redux Series Benchmarks, Redux (part 1): On RDF Benchmarks Benchmarks, Redux (part 2): A Benchmarking Story Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs Benchmarks, Redux (part 6): BSBM and I/O, continued Benchmarks, Redux (part 7): What Does BSBM Explore Measure? Benchmarks, Redux (part 8): BSBM Explore and Update Benchmarks, Redux (part 9): BSBM With Cluster Benchmarks, Redux (part 10): LOD2 and the Benchmark Process Benchmarks, Redux (part 11): The Substance of Benchmarks Benchmarks, Redux (part 12): Our Own BSBM Results Report Benchmarks, Redux (part 13): BSBM-BI Modifications Benchmarks, Redux (part 14): BSBM-BI Mix (this post) Benchmarks, Redux (part 15): BSBM Test Driver Enhancements</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>In this post, we look at how we run the <a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x23be8d28">BSBM</a>-BI mix.  We consider the 100 Mt and 1000 Mt scales with <a class="auto-href" href="http://virtuoso.openlinksw.com" id="link-id0x23b69e40">Virtuoso</a> 7 using the same hardware and software as in the previous posts.  The changes to workload and metric are given in the previous post.</p>

<p>Our intent here is to look at whether the metric works, and to see what results will look like in general.  We are as much testing the benchmark as we are testing the system-under-test (SUT).  The results shown here will likely not be comparable with future ones because we will most likely change the composition of the workload since it seems a bit out of balance.  Anyway, for the sake of disclosure, we attach the query templates.  The test driver we used will be made available soon, so the interested may still try a comparison with their systems. If you practice with this workload for the coming races, the effort will surely not be wasted.</p>


<p>Once we have come up with a rules document, we will redo all that we have published so far by-the-book, and have it audited as part of the <a class="auto-href" href="http://lod2.eu/" id="link-id0x23a74c40">LOD2</a> service we plan for this (see previous posts in this series).  This will introduce comparability; but before we get that far with the BI workload, the workload needs to evolve a bit.</p>

<p>Below we show samples of test driver output; the whole output is <a href="http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/BenchmarksReduxSupportingFiles/br.tar.gz" id="link-id0x1b703ad8">downloadable</a>.</p>

<p>100 Mt Single User</p>

<blockquote>
 <code><pre>
bsbm/testdriver   -runs 1   -w 0 -idir /bs/1  -drill  \  
   -ucf bsbm/usecases/businessIntelligence/<a class="auto-href" href="http://dbpedia.org/resource/SPARQL" id="link-id0x247b7e08">sparql</a>.txt  \  
   -dg <a class="auto-href" href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x232a86b0">http</a>://bsbm.org http://localhost:8604/sparql
</pre>
 </code>
</blockquote>

<blockquote>
 <code><pre>
0: 43348.14ms, total: 43440ms

Scale factor:           284826
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Seed:                   808080
Number of query mix runs (without warmups): 1 times
min/max Querymix runtime:    43.3481s / 43.3481s
Elapsed runtime:        43.348 seconds
QMpH:                   83.049 query mixes per hour
CQET:                   43.348 seconds average runtime of query mix
CQET (geom.):           43.348 seconds geometric mean runtime of query mix
AQET (geom.):           0.492 seconds geometric mean runtime of query
Throughput:             1494.874 BSBM-BI throughput: qph*scale
BI Power:               7309.820 BSBM-BI Power: qph*scale (geom)
</pre>
 </code>
</blockquote>



<p>100 Mt 8 User </p>

<blockquote>
 <code><pre>
Thread 6: query mix 3: 195793.09ms, total: 196086.18ms
Thread 8: query mix 0: 197843.84ms, total: 198010.50ms
Thread 7: query mix 4: 201806.28ms, total: 201996.26ms
Thread 2: query mix 5: 221983.93ms, total: 222105.96ms
Thread 4: query mix 7: 225127.55ms, total: 225317.49ms
Thread 3: query mix 6: 225860.49ms, total: 226050.17ms
Thread 5: query mix 2: 230884.93ms, total: 231067.61ms
Thread 1: query mix 1: 237836.61ms, total: 237959.11ms
Benchmark run completed in 237.985427s

Scale factor:           284826
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Number of clients:      8
Seed:                   808080
Number of query mix runs (without warmups): 8 times
min/max Querymix runtime:    195.7931s / 237.8366s
Total runtime (sum):    1737.137 seconds
Elapsed runtime:        1737.137 seconds
QMpH:                   121.016 query mixes per hour
CQET:                   217.142 seconds average runtime of query mix
CQET (geom.):           216.603 seconds geometric mean runtime of query mix
AQET (geom.):           2.156 seconds geometric mean runtime of query
Throughput:             2178.285 BSBM-BI throughput: qph*scale
BI Power:               1669.745 BSBM-BI Power: qph*scale (geom)
</pre>
 </code>
</blockquote>


<p>1000 Mt Single User</p>

<blockquote>
 <code><pre>
0: 608707.03ms, total: 608768ms

Scale factor:           2848260
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Seed:                   808080
Number of query mix runs (without warmups): 1 times
min/max Querymix runtime:    608.7070s / 608.7070s
Elapsed runtime:        608.707 seconds
QMpH:                   5.914 query mixes per hour
CQET:                   608.707 seconds average runtime of query mix
CQET (geom.):           608.707 seconds geometric mean runtime of query mix
AQET (geom.):           5.167 seconds geometric mean runtime of query
Throughput:             1064.552 BSBM-BI throughput: qph*scale
BI Power:               6967.325 BSBM-BI Power: qph*scale (geom)
</pre>
 </code>
</blockquote>


<p>1000 Mt 8 User </p>

<blockquote>
 <code><pre>
bsbm/testdriver   -runs 8 -mt 8  -w 0 -idir /bs/10  -drill  \
   -ucf bsbm/usecases/businessIntelligence/sparql.txt   \
   -dg http://bsbm.org http://localhost:8604/sparql
</pre>
 </code>
</blockquote>

<blockquote>
 <code><pre>
Thread 3: query mix 4: 2211275.25ms, total: 2211371.60ms
Thread 4: query mix 0: 2212316.87ms, total: 2212417.99ms
Thread 8: query mix 3: 2275942.63ms, total: 2276058.03ms
Thread 5: query mix 5: 2441378.35ms, total: 2441448.66ms
Thread 6: query mix 7: 2804001.05ms, total: 2804098.81ms
Thread 2: query mix 2: 2808374.66ms, total: 2808473.71ms
Thread 1: query mix 6: 2839407.12ms, total: 2839510.63ms
Thread 7: query mix 1: 2889199.23ms, total: 2889263.17ms
Benchmark run completed in 2889.302566s

Scale factor:           2848260
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Number of clients:      8
Seed:                   808080
Number of query mix runs (without warmups): 8 times
min/max Querymix runtime:    2211.2753s / 2889.1992s
Total runtime (sum):    20481.895 seconds
Elapsed runtime:        20481.895 seconds
QMpH:                   9.968 query mixes per hour
CQET:                   2560.237 seconds average runtime of query mix
CQET (geom.):           2544.284 seconds geometric mean runtime of query mix
AQET (geom.):           13.556 seconds geometric mean runtime of query
Throughput:             1794.205 BSBM-BI throughput: qph*scale
BI Power:               2655.678 BSBM-BI Power: qph*scale (geom)

Metrics for Query:      1
Count:                  8 times executed in whole run
Time share              2.120884% of total execution time
AQET:                   54.299656 seconds (arithmetic mean)
AQET(geom.):            34.607302 seconds (geometric mean)
QPS:                    0.13 Queries per second
minQET/maxQET:          11.71547600s / 148.65379700s

Metrics for Query:      2
Count:                  8 times executed in whole run
Time share              0.207382% of total execution time
AQET:                   5.309462 seconds (arithmetic mean)
AQET(geom.):            2.737696 seconds (geometric mean)
QPS:                    1.34 Queries per second
minQET/maxQET:          0.78729800s / 25.80948200s

Metrics for Query:      3
Count:                  8 times executed in whole run
Time share              17.650472% of total execution time
AQET:                   451.893890 seconds (arithmetic mean)
AQET(geom.):            410.481088 seconds (geometric mean)
QPS:                    0.02 Queries per second
minQET/maxQET:          171.07262500s / 721.72939200s

Metrics for Query:      5
Count:                  32 times executed in whole run
Time share              6.196565% of total execution time
AQET:                   39.661685 seconds (arithmetic mean)
AQET(geom.):            6.849882 seconds (geometric mean)
QPS:                    0.18 Queries per second
minQET/maxQET:          0.15696500s / 189.00906200s

Metrics for Query:      6
Count:                  8 times executed in whole run
Time share              0.119916% of total execution time
AQET:                   3.070136 seconds (arithmetic mean)
AQET(geom.):            2.056059 seconds (geometric mean)
QPS:                    2.31 Queries per second
minQET/maxQET:          0.41524400s / 7.55655300s

Metrics for Query:      7
Count:                  40 times executed in whole run
Time share              1.577963% of total execution time
AQET:                   8.079921 seconds (arithmetic mean)
AQET(geom.):            1.342079 seconds (geometric mean)
QPS:                    0.88 Queries per second
minQET/maxQET:          0.02205800s / 40.27761500s

Metrics for Query:      8
Count:                  40 times executed in whole run
Time share              72.126818% of total execution time
AQET:                   369.323481 seconds (arithmetic mean)
AQET(geom.):            114.431863 seconds (geometric mean)
QPS:                    0.02 Queries per second
minQET/maxQET:          5.94377300s / 1824.57867400s
</pre>
 </code>
</blockquote>



<p>The <a class="auto-href" href="http://dbpedia.org/resource/Central_processing_unit" id="link-id0x249ce740">CPU</a> for the multiuser runs stays above 1500% for the whole run. The CPU for the single user 100 Mt run is 630%; for the 1000 Mt run, this is 574%. This can be improved since the queries usually have a lot of <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0x2871b1f0">data</a> to work on.  But final <a class="auto-href" href="http://dbpedia.org/resource/Program_optimization" id="link-id0x22c95b90">optimization</a> is not our goal yet; we are just surveying the race track. The difference between a warm single user run and a cold single user run is about 15% with data on SSD; with data on disk, this would be more.  The numbers shown are with warm <a class="auto-href" href="http://dbpedia.org/resource/Cache" id="link-id0x22ca4300">cache</a>.  The single-user and multi-user Throughput difference, 1064 single-user vs. 1794 multi-user, is about what one would expect from the CPU utilization.</p>

<p>With these numbers, the CPU does not appear badly memory-bound, else the increase would be less; also core multi-threading seems to bring some benefit.  If the single-user run was at 800%, the Throughput would be 1488.  The speed in excess of this may be attributed to core multi-threading, although we must remember that not every query mix is exactly the same length, so the figure is not exact.  Core multi-threading does not seem to hurt, at the very least.  Comparison of the same numbers with the column store will be interesting since it misses the cache a lot less and accordingly has better SMP scaling. The <a class="auto-href" href="http://dbpedia.org/resource/Intel_Corporation" id="link-id0x28814950">Intel</a> Nehalem memory subsystem is really pretty good.</p>
<p>




</p>
<p>For reference, we show a run with Virtuoso 6 at 100Mt. </p>

<blockquote>
 <code><pre>
0: 424754.40ms, total: 424829ms

Scale factor:           284826
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Seed:                   808080
Number of query mix runs (without warmups): 1 times
min/max Querymix runtime:    424.7544s / 424.7544s
Elapsed runtime:        424.754 seconds
QMpH:                   8.475 query mixes per hour
CQET:                   424.754 seconds average runtime of query mix
CQET (geom.):           424.754 seconds geometric mean runtime of query mix
AQET (geom.):           1.097 seconds geometric mean runtime of query
Throughput:             152.559 BSBM-BI throughput: qph*scale
BI Power:               3281.150 BSBM-BI Power: qph*scale (geom)
</pre>
 </code>
</blockquote>


<p>and 8 user </p>

<blockquote>
 <code><pre>
Thread 5: query mix 3: 616997.86ms, total: 617042.83ms
Thread 7: query mix 4: 625522.18ms, total: 625559.09ms
Thread 3: query mix 7: 626247.62ms, total: 626304.96ms
Thread 1: query mix 0: 629675.17ms, total: 629724.98ms
Thread 4: query mix 6: 667633.36ms, total: 667670.07ms
Thread 8: query mix 2: 674206.07ms, total: 674256.72ms
Thread 6: query mix 5: 695020.21ms, total: 695052.29ms
Thread 2: query mix 1: 701824.67ms, total: 701864.91ms
Benchmark run completed in 701.909341s

Scale factor:           284826
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Number of clients:      8
Seed:                   808080
Number of query mix runs (without warmups): 8 times
min/max Querymix runtime:    616.9979s / 701.8247s
Total runtime (sum):    5237.127 seconds
Elapsed runtime:        5237.127 seconds
QMpH:                   41.031 query mixes per hour
CQET:                   654.641 seconds average runtime of query mix
CQET (geom.):           653.873 seconds geometric mean runtime of query mix
AQET (geom.):           2.557 seconds geometric mean runtime of query
Throughput:             738.557 BSBM-BI throughput: qph*scale
BI Power:               1408.133 BSBM-BI Power: qph*scale (geom)
</pre>
 </code>
</blockquote>




<p>Having the numbers, let us look at the metric and its scaling.  We take the geometric mean of the single-user Power and the multiuser Throughput.</p>


<blockquote>
 <code><pre>
 100 Mt: sqrt ( 7771 * 2178 ); = 4114

1000 Mt: sqrt ( 6967 * 1794 ); = 3535
</pre>
 </code>
</blockquote>


<p>Scaling seems to work; the results are in the same general ballpark.  The real times for the 1000 Mt run are a bit over 10x the times for the 100Mt run, as expected. The relative percentages of the queries are about the same on both scales, with the drill-down in Q8 alone being 77% and 72% respectively. The Q8 drill-down starts at the root of the product hierarchy.  If we made this start one level from the top, its share would drop.  This seems reasonable.</p>

<p>Conversely, Q2 is out of place, with far too little share of the time. It takes a product as a starting point and shows a list of products with common features, sorted by descending count of common features. This would more appropriately be applied to a leaf product category instead, measuring how many of the products in the category have the top 20 features found in this category, to name an example.</p>

<p>Also there should be more queries.</p>

<p>At present it appears that BSBM-BI is definitely runnable, but a cursory look suffices to show that the workload needs more development and variety.  We remember that I dreamt up the business questions last fall without much analysis, and that these questions were subsequently translated to SPARQL by FU Berlin.  So, on one hand, BSBM-BI is of crucial importance because it is the first attempt at doing a benchmark with long running queries in SPARQL.  On the other hand, BSBM-BI is not very good as a benchmark; <a class="auto-href" href="http://www.tpc.org/" id="link-id0x23227ce0">TPC</a>-<a class="auto-href" href="http://dbpedia.org/resource/TPC-H" id="link-id0x279c6700">H</a> is a lot better.  This stands to reason, as TPC-H has had years and years of development and participation by many people.</p>

<p>Benchmark queries are trick questions: For example, TPC-H Q18 cannot be done without changing an <code>IN</code> into a <code>JOIN</code> with the <code>IN</code> subquery in the outer loop and doing streaming aggregation.  Q13 cannot be done without a well-optimized <code><a class="auto-href" href="http://dbpedia.org/resource/Hash_join" id="link-id0x238cbf88">HASH JOIN</a></code> which besides must be partitioned at the larger scales.</p>

<p>Having such trick questions in an important benchmark eventually results in everybody doing the optimizations that the benchmark clearly calls for.  Making benchmarks thus entails a responsibility ultimately to the end user, because an irrelevant benchmark might in the worst case send developers chasing things that are beside the point.</p>


<p>In the following, we will look at what BSBM-BI requires from the database and how these requirements can be further developed and extended.</p>

<p>BSBM-BI does not have any clear trick questions, at least not premeditatedly. BSBM-BI just requires a cost model that can guess the fanout of a <code>JOIN</code> and the cardinality of a <code>GROUP BY</code>; it is enough to distinguish smaller from greater; the guess does not otherwise have to be very good. Further, the queries are written in the benchmark text so that joining from left to right would work, so not even a cost-based optimizer is strictly needed.  I did however have to add some cardinality statistics to get reasonable <code>JOIN</code> order since we always reorder the query regardless of the source formulation.</p>

<p>BSBM-BI does have variable selectivity from the drill-downs; thus these may call for different <code>JOIN</code> orders for different parameter values.  I have not looked into whether this really makes a difference, though.</p>

<p>There are places in BSBM-BI where using a <code>HASH JOIN</code> makes sense.  We do not use <code>HASH JOINs</code> with <a class="auto-href" href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x235d8d88">RDF</a> because there is an index for everything and making a <code>HASH JOIN</code> in the wrong place can have a large up-front cost, so one is more robust against cost model errors if one does not do <code>HASH JOINs</code>.  This said, a <code>HASH JOIN</code> in the right place is a lot better than an index lookup.  With TPC-H Q13, our best <code>HASH JOIN</code> is over 2x better than the best <code>INDEX</code>-based <code>JOIN</code>, both being well tuned.  For questions like &quot;count the hairballs made in <a class="auto-href" href="http://dbpedia.org/resource/Germany" id="link-id0x2358ae60">Germany</a> reviewed by Japanese Hello Kitty fans,&quot; where two ends of a <code>JOIN</code> path are fairly selective doing the other as a <code>HASH JOIN</code> is good.  This can, if the <code>JOIN</code> is always cardinality-reducing, even be merged inside an <code>INDEX</code> lookup.  We have such capabilities since we have been for a while gearing up for the relational races, but are not using any of these with BSBM-BI, although they would be useful.</p>
 

<p>Let us see the profile for a single user 100 Mt run.</p>

<p>The database activity summary is --</p>

<p>
<code>select db_activity (0, &#39;http&#39;);</code>
</p>

<p>
<code> 161.3MÂ rndÂ  210.2MÂ seqÂ  Â  Â  0Â sameÂ segÂ  Â 104.5MÂ sameÂ pgÂ  45.08MÂ sameÂ parÂ  Â  Â  0Â diskÂ  Â  Â Â 0Â specÂ diskÂ  Â  Â  0BÂ /Â  Â  Â  0Â messagesÂ  2.393KÂ fork</code>
</p>


<p>See the post &quot;<a href="http://www.openlinksw.com/weblog/oerling/?id=1671" id="link-id0x1b1f3068">What Does BSBM Explore Measure</a>&quot; for an explanation of the numbers.  We see that there is more sequential access than random and the random has fair locality with over half on the same page as the previous and a lot of the rest falling under the same parent. Funnily enough, the explore mix has more locality.  Running with a longer vector size would probably increase performance by getting better locality.  There is an optimization that adjusts vector size on the fly if locality is not sufficient but this is not being used here. So we manually set vector size to 100000 instead of the default 10000. We get --</p>

<p>
<code> 172.4MÂ rndÂ  220.8MÂ seqÂ  Â  Â  0Â sameÂ segÂ  Â 149.6MÂ sameÂ pgÂ  10.99MÂ sameÂ parÂ  Â  Â 21Â diskÂ  Â  861Â specÂ diskÂ  Â  Â  0BÂ /Â  Â  Â  0Â messagesÂ  Â  Â 754Â fork</code>
</p>


<p>The throughput goes from 1494 to 1779.  We see more hits on the same page, as expected.  We do not make this setting a default since it raises the cost for small queries; therefore the vector size must be self-adjusting -- besides, expecting a DBA to tune this is not reasonable. We will just have to correctly tune the self-adjust logic, and we have again clear gains.</p>

<p>Let us now go back to the first run with vector size 10000.</p>

<p>The top of the CPU <code>oprofile</code> is as follows:</p>

<blockquote>
 <code><pre>
722309   15.4507  cmpf_iri64n_iri64n
434791    9.3005  cmpf_iri64n_iri64n_anyn_iri64n
294712    6.3041  itc_next_set
273488    5.8501  itc_vec_split_search
203970    4.3631  itc_dive_transit
199687    4.2714  itc_page_rcf_search
181614    3.8848  dc_itc_append_any
173043    3.7015  itc_bm_vec_row_check
146727    3.1386  cmpf_int64n
128224    2.7428  itc_vec_row_check
113515    2.4282  dk_alloc
97296     2.0812  page_wait_access
62523     1.3374  qst_vec_get_int64
59014     1.2623  itc_next_set_parent
53589     1.1463  sslr_qst_get
48003     1.0268  ds_add
46641     0.9977  dk_free_tree
44551     0.9530  kc_var_col
43650     0.9337  page_col_cmp_1
35297     0.7550  cmpf_iri64n_iri64n_anyn_gt_lt
34589     0.7399  dv_compare
25864     0.5532  cmpf_iri64n_anyn_iri64n_iri64n_lte
23088     0.4939  dk_free
</pre>
 </code>
</blockquote>

<p>The top 10 are all index traversal, with the key compare for two leading IRI keys in the lead, corresponding to a lookup with <code>P</code> and <code>S</code> given.  The one after that is with all parts given, corresponding to an existence test.  The existence tests could probably be converted to <code>HASH JOIN</code> lookups to good advantage.  Aggregation and arithmetic are absent.  We should probably add a query like TPC-H Q1 that does nothing but these two.  Considering the overall profile, <code>GROUP BY</code> seems to be around 3%.  We should probably put in a query that makes a very large number of groups and could make use of streaming aggregation, i.e., take advantage of a situation where aggregation input comes already grouped by the grouping columns.</p>

<p>A BI use case should offer no problem with including arithmetic, but there are not that many numbers in the BSBM set.  Some code sections in the queries with conditional execution and costly tests inside <code>ANDs</code> and <code>ORs</code> would be good.  TPC-H has such in Q21 and Q19.  An <code>OR</code> with existences where there would be gain from good guesses of a subquery&#39;s selectivity would be appropriate.  Also, there should be conditional expressions somewhere with a lot of data, like the <code>CASE-WHEN</code> in TPC-H Q12.</p>

<p>We can make BSBM-BI more interesting by putting in the above.  Also we will have to see where we can profit from <code>HASH JOIN</code>, both small and large.  There should be such places in the workload already so this is a matter of just playing a bit more.</p>

<p>This post amounts to a cheat sheet for the BSBM-BI runs a bit farther down the road. By then we should be operational with the column store and Virtuoso 7 Cluster, though, so not everything is yet on the table.</p>



<h3>
<i>Benchmarks, Redux</i> Series</h3>
<ul>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1658" id="link-id0x1fd1d4e0">Benchmarks, Redux (part 1): On RDF Benchmarks</a>
</li>

<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x1d5b07d8">Benchmarks, Redux (part 2): A Benchmarking Story</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1663" id="link-id0x1dfe6c48">Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1665" id="link-id0x197fce30">Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1667" id="link-id0x1fbf4210">Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1669" id="link-id0x1beeb1e0">Benchmarks, Redux (part 6): BSBM and I/O, continued</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1671" id="link-id0x1d7e1818">Benchmarks, Redux (part 7): What Does BSBM Explore Measure?</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1673" id="link-id0x1dfc1730">Benchmarks, Redux (part 8): BSBM Explore and Update </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1675" id="link-id0x1ea819a8">Benchmarks, Redux (part 9): BSBM With Cluster</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1677" id="link-id0x1ec73da0">Benchmarks, Redux (part 10): LOD2 and the Benchmark Process</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1678" id="link-id0x1fbdce90">Benchmarks, Redux (part 11): The Substance of Benchmarks</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x19928618">Benchmarks, Redux (part 12): Our Own BSBM Results Report</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1f3d8710">Benchmarks, Redux (part 13): BSBM-BI Modifications </a>
</li>
<li>
Benchmarks, Redux (part 14): BSBM-BI Mix  <i>(this post)</i>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1e627400">Benchmarks, Redux (part 15): BSBM Test Driver Enhancements </a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2011-03-07#1667">
  <rss:title>Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs </rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-03-07T19:17:36Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">In the context of database benchmarks we cannot ignore I/O, as pretty much has been done so far by BSBM. There are two approaches: run twice or otherwise make sure one runs from memory and forget about I/O, or make rules and metrics for warm-up. We will see if the second is possible with BSBM. From this starting point, we look at various ways of scheduling I/O in Virtuoso using a 1000 Mt BSBM database on sets of each of HDDs (hard disk devices) and SSDs (solid-state storage devices). We will see that SSDs in this specific application can make a significant difference. In this test we have the same 4 stripes of a 1000 Mt BSBM database on each of two storage arrays. Storage Arrays Type Quantity Maker Size Speed Interface speed Controller Drive Cache RAID SSD 4 Crucial 128 GB N/A 6Gbit SATA RocketRaid 640 128 MB None HDD 4 Samsung 1000 GB 7200 RPM 3Gbit SATA Intel ICH on Supermicro motherboard 16 MB None We make sure that the files are not in OS cache by filling it with other big files, reading a total of 120 GB off SSDs with `cat file &gt; /dev/null`. The configuration files are as in the report on the 1000 Mt run. We note as significant that we have a few file descriptors for each stripe, and that read-ahead for each is handled by its own thread. Two different read-ahead schemes are used: With 6 Single, if a 2MB extent gets a second read within a given time after the first, the whole extent is scheduled for background read. With 7 Single, as an index search is vectored, we know a large number of values to fetch at one time and these values are sorted into an ascending sequence. Therefore, by looking at a node in an index tree, we can determine which sub-trees will be accessed and schedule these for read-ahead, skipping any that will not be accessed. In either model, a sequential scan touching more than a couple of consecutive index leaf pages triggers a read-ahead, to the end of the scanned range or to the next 3000 index leaves, whichever comes first. However, there are no sequential scans of significant size in BSBM. There are a few different possibilities for the physical I/O: Using a separate read system call for each page. There may be several open file descriptors on a file so that many such calls can proceed concurrently on different threads; the OS will order the operations. A thread finds it needs a page and reads it. Using Unix asynchronous I/O, aio.h, with the aio_* and lio_listio functions. Using single-read system calls for adjacent pages. In this way, the drive sees longer requests and should give better throughput. If there are short gaps in the sequence, the gaps are also read, wasting bandwidth but saving on latency. The two latter apply only to bulk I/O that are scheduled on background threads, one per independently-addressable device (HDD, SSD, or RAID-set). These bulk-reads operate on an elevator model, keeping a sorted queue of things to read or write and moving through this queue from start to end. At any time, the queue may get more work from other threads. There is a further choice when seeing single-page random requests. They can either go to the elevator or they can be done in place. Taking the elevator is presumably good for throughput but bad for latency. In general, the elevator should have a notion of fairness; these matters are discussed in the CWI collaborative scan paper. Here we do not have long queries, so we do not have to talk about elevator policies or scan sharing; there are no scans. We may touch on these questions later with the column store, the BSBM BI mix, and TPC-H. While we may know principles, I/O has always given us surprises; the only way to optimize this is to measure. The metric we try to optimize here is the time it takes for a multiuser BSBM run starting from cold cache to get to 1200% CPU. When running from memory, the CPU is around 1350% for the system in question. This depends on getting I/O throughput, which in turn depends on having a lot of speculative reading since the workload itself does not give any long stretches to read. The test driver is set at 16 clients, and the run continues for 2000 query mixes or until target throughput is reached. Target throughput is deemed reached after the first 20 second stretch with CPU at 1200% or higher. The meter is a stored procedure that records the CPU time, count of reads, cumulative elapsed time spent waiting for I/O, and other metrics. The code for this procedure (for 7 Single; this file will not work on Virtuoso 6 or earlier) is available here. The database space allocation gives each index a number of 2MB segments, each with 256 8K pages. When a page splits, the new page is allocated from the same extent if possible, or from a specific second extent which is designated as the overflow extent of this extent. This scheme provides for a sort of pseudo-locality within extents over random insert order. Thus there is a chance that pre-reading an extent will get key values in the same range a the ones on the page being requested in the first place. At least the pre-read pages will be from the same index tree. There are insertion orders that do not create good locality with this allocation scheme, though. In order to generally improve locality, one could shuffle pages of an all-dirty subtree before writing this out so as to have physical order match key order. We will look at some tricks in this vein with the column store. For the sake of simplicity we only run 7 Single with the 1000 Mt scale. The first experiment was with SSDs and the vectored read-ahead. The target throughput was reached after 280 seconds. The next test was with HDDs and extent read-ahead. One hour into the experiment, the CPU was about 70% after processing around 1000 query mixes. It might have been hours before HDD reads became rare enough for hitting 1200% CPU. The test was not worth continuing. The result with HDDs and vectored read-ahead would be worse since vectored read-ahead leads to smaller read-ahead batches and to less contiguous read patterns. The individual read times here, are over twice the individual read times with per-extent read-ahead. The fact that vectored read-ahead does not read potentially unneeded pages makes no difference. Hence this test is also not worth running to completion. There are other possibilities for improving HDD I/O. If only 2MB read requests are made, a transfer will be about 20 ms at a sequential transfer speed of 50 MB/s. Then seeking to the next 2MB extent will be a few ms, most often less than 20, so the HDD should give at least half the nominal throughput. We note that, when reading sequential 8K pages inside a single 2MB (256 page) extent, the seek latency is not 0 as one would expect but an extreme 5 ms. One would think that the drive would buffer a whole track, and a track would hold a large number of 2MB sections, but apparently this is not so. Therefore, now if we have a sequential read pattern that is more dense than 1 page out of 10, we read all the pages and just keep the ones we want. So now we set the read-ahead to merge reads that fall within 10 pages. This wastes bandwidth, but supposedly saves on latency. We will see. So we try, and we find that read-ahead does not account for most pages since it does not get triggered. Thus, we change the triggering condition to be the 2nd read to fall in the extent within 20 seconds of the first. The HDDs were in all cases 700% busy for 4 HDDs. But with the new setting we get longer requests, most often full extents, which gets a per-HDD transfer rate of about 5 MB/s. With the looser condition for starting read-ahead, 89% of all pages were read in a read-ahead batch. We see the I/O throughput decrease during the run because there are more single-page reads that do not trigger extent read-ahead. So HDDs have 1.7 concurrent operations pending, but the batch size drops, dropping the throughput. Thus with the best settings, the test with 2000 query mixes finishes in 46 minutes, and the CPU utilization is steadily increasing, hitting 392% for the last minute. In comparison, with SSDs and our worst read-ahead setting we got 1200% CPU in under 5 minutes from cold start. The I/O system can be further tuned; for example, by only reading full extents as long as the buffer pool is not full. In the next post we will measure some more. BSBM Note We look at query times with semi-warm cache, with CPU around 400%. We note that Q8-Q12 are especially bad. Q5 runs at about half speed. Q12 runs at under 1/10th speed. The relatively slowest queries appear to be single-instance lookups. Nothing short of the most aggressive speculative reading can help there. Neither query nor workload has any exploitable pattern. Therefore if an I/O component is to be included in a BSBM metric, the only way to score in this is to use speculative read to the maximum. Some of the queries take consecutive property values of a single instance. One could parallelize this pipeline, but this would be a one-off and would make sense only when reading from storage (whether HDD, SSD, or otherwise). Multithreading for single rows is not worth the overhead. A metric for BSBM warm-up is not interesting for database science, but may still be of practical interest in the specific case of RDF stores. Specially reading large chunks at startup time is good, so putting a section in BSBM that would force one to implement this would be a service to most end users. Measuring and reporting such I/O performance would favor space efficiency in general. Space efficiency is generally a good thing, especially at larger scales, so we can put an optional section in the report for warm-up. This is also good for comparing HDDs and SSDs, and for testing read-ahead, which is still something a database is expected to do. Implementors have it easy; just speculatively read everything. Looking at the BSBM fictional use case, anybody running such a portal would do this from RAM only, so it makes sense to define the primary metric as running from warm cache, in practice 100% from memory. Benchmarks, Redux Series Benchmarks, Redux (part 1): On RDF Benchmarks Benchmarks, Redux (part 2): A Benchmarking Story Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs (this post) Benchmarks, Redux (part 6): BSBM and I/O, continued Benchmarks, Redux (part 7): What Does BSBM Explore Measure? Benchmarks, Redux (part 8): BSBM Explore and Update Benchmarks, Redux (part 9): BSBM With Cluster Benchmarks, Redux (part 10): LOD2 and the Benchmark Process Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks Benchmarks, Redux (part 12): Our Own BSBM Results Report Benchmarks, Redux (part 13): BSBM BI Modifications Benchmarks, Redux (part 14): BSBM BI Mix Benchmarks, Redux (part 15): BSBM Test Driver Enhancements</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>In the context of database benchmarks we cannot ignore I/O, as pretty much has been done so far by <a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x2a0452f8">BSBM</a>. </p>

<p>There are two approaches:</p> 

<ol>
<li>
  <p>run twice or otherwise make sure one runs from memory and forget about I/O, or</p>
</li>
<li>
  <p>make rules and metrics for warm-up.</p>
</li>
</ol>
<p>We will see if the second is possible with BSBM.</p>

<p>From this starting point, we look at various ways of scheduling I/O in <a class="auto-href" href="http://virtuoso.openlinksw.com" id="link-id0x2a9fdb88">Virtuoso</a> using a 1000 Mt BSBM database on sets of each of HDDs (hard disk devices) and SSDs (solid-state storage devices). We will see that SSDs in this specific application can make a significant difference. </p>


<p>In this test we have the same 4 stripes of a 1000 Mt BSBM database on each of two storage arrays.</p>

<table border="1" cellspacing="2" cellpadding="2" align="center" width="90%">
	<tr>
		<th colspan="9" align="center">Storage Arrays</th>
	</tr>
	<tr>
		<th align="center"> Type </th>
		<th align="center"> Quantity </th>
		<th align="center"> Maker </th>
		<th align="center"> Size </th>
		<th align="center"> Speed </th>
		<th align="center"> Interface speed </th>
		<th align="center"> Controller </th>
		<th align="center"> Drive <a class="auto-href" href="http://dbpedia.org/resource/Cache" id="link-id0x2ad20cd0">Cache</a> </th>
		<th align="center"> RAID </th>
	</tr>
	<tr>
		<td align="center"> SSD </td>
		<td align="center"> 4 </td>
		<td align="center"> Crucial </td>
		<td align="center"> 128 GB </td>
		<td align="center"> N/A </td>
		<td align="center"> 6Gbit SATA </td>
		<td align="center"> RocketRaid 640 </td>
		<td align="center"> 128 MB </td>
		<td align="center"> None </td>
	</tr>
	<tr>
		<td align="center"> HDD </td>
		<td align="center"> 4 </td>
		<td align="center"> Samsung </td>
		<td align="center"> 1000 GB </td>
		<td align="center"> 7200 RPM </td>
		<td align="center"> 3Gbit SATA </td>
		<td align="center"> <a class="auto-href" href="http://dbpedia.org/resource/Intel_Corporation" id="link-id0x2a8cfca0">Intel</a> ICH on Supermicro motherboard </td>
		<td align="center"> 16 MB </td>
		<td align="center"> None </td>
	</tr>
</table>


<p>We make sure that the files are not in OS cache by filling it with other big files, reading a total of 120 GB off SSDs with <code>`cat file &gt; /dev/null`</code>. </p>

<p>The configuration files are as in the report on the 1000 Mt run. We note as significant that we have a few file descriptors for each stripe, and that read-ahead for each is handled by its own thread.</p>

<p>Two different read-ahead schemes are used: </p>
<ul>
 <li>
  <p>With 6 Single, if a 2MB extent gets a second read within a given time after the first, the whole extent is scheduled for background read.</p>
 </li>
<li>
  <p>With 7 Single, as an index search is vectored, we know a large number of values to fetch at one time and these values are sorted into an ascending sequence. Therefore, by looking at a node in an index tree, we can determine which sub-trees will be accessed and schedule these for read-ahead, skipping any that will not be accessed.</p>
</li>
</ul>

<p>In either model, a sequential scan touching more than a couple of consecutive index leaf pages triggers a read-ahead, to the end of the scanned range or to the next 3000 index leaves, whichever comes first. However, there are no sequential scans of significant size in BSBM.</p>

<p>There are a few different possibilities for the physical I/O: </p>

<ol>
<li>
  <p>Using a separate read system call for each page. There may be several open file descriptors on a file so that many such calls can proceed concurrently on different threads; the OS will order the operations.</p>
</li>
<li>
  <p>A thread finds it needs a page and reads it.</p>
</li>
<li>
  <p>Using Unix asynchronous I/O, <code>aio.h</code>, with the <code>aio_*</code> and <code>lio_listio</code> functions.</p>
</li>
<li>
  <p>Using single-read system calls for adjacent pages. In this way, the drive sees longer requests and should give better throughput. If there are short gaps in the sequence, the gaps are also read, wasting bandwidth but saving on latency.</p>
</li>
</ol>

<p>The two latter apply only to bulk I/O that are scheduled on background threads, one per independently-addressable device (HDD, SSD, or RAID-set).  These bulk-reads operate on an elevator model, keeping a sorted queue of things to read or write and moving through this queue from start to end. At any time, the queue may get more work from other threads.</p>

<p>There is a further choice when seeing single-page random requests. They can either go to the elevator or they can be done in place. Taking the elevator is presumably good for throughput but bad for latency. In general, the elevator should have a notion of fairness; these matters are discussed in the <a href="http://www.cwi.nl/" id="link-id0x1f62abb8">CWI collaborative scan paper</a>. Here we do not have long queries, so we do not have to talk about elevator policies or scan sharing; there are no scans. We may touch on these questions later with the column store, the BSBM BI mix, and <a class="auto-href" href="http://www.tpc.org/" id="link-id0x2a9067d0">TPC</a>-<a class="auto-href" href="http://dbpedia.org/resource/TPC-H" id="link-id0x2a8874f0">H</a>.</p>

<p>While we may know principles, I/O has always given us surprises; the only way to optimize this is to measure.</p>

<p>The metric we try to optimize here is the time it takes for a multiuser BSBM run starting from cold cache to get to 1200% <a class="auto-href" href="http://dbpedia.org/resource/Central_processing_unit" id="link-id0x2997a660">CPU</a>. When running from memory, the CPU is around 1350% for the system in question. </p>

<p>This depends on getting I/O throughput, which in turn depends on having a lot of speculative reading since the workload itself does not give any long stretches to read. </p>

<p>The test driver is set at 16 clients, and the run continues for 2000 query mixes or until target throughput is reached. Target throughput is deemed reached after the first 20 second stretch with CPU at 1200% or higher.</p>

<p>The meter is a stored procedure that records the CPU time, count of reads, cumulative elapsed time spent waiting for I/O, and other metrics. The code for this procedure (for 7 Single; this file will not work on Virtuoso 6 or earlier) is <a href="http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/BenchmarksReduxSupportingFiles/ldmeter.sql" id="link-id0x1b5adb08">available here</a>. </p>


<p>The database space allocation gives each index a number of 2MB segments, each with 256 8K pages. When a page splits, the new page is allocated from the same extent if possible, or from a specific second extent which is designated as the overflow extent of this extent. This scheme provides for a sort of pseudo-locality within extents over random insert order. Thus there is a chance that pre-reading an extent will get key values in the same range a the ones on the page being requested in the first place. At least the pre-read pages will be from the same index tree. There are insertion orders that do not create good locality with this allocation scheme, though. In order to generally improve locality, one could shuffle pages of an all-dirty subtree before writing this out so as to have physical order match key order. We will look at some tricks in this vein with the column store.</p>

<p>For the sake of simplicity we only run 7 Single with the 1000 Mt scale.</p>


<p>The first experiment was with SSDs and the vectored read-ahead.  The target throughput was reached after 280 seconds. </p>

<p>The next test was with HDDs and extent read-ahead. One hour into the experiment, the CPU was about 70% after processing around 1000 query mixes. It might have been hours before HDD reads became rare enough for hitting 1200% CPU. The test was not worth continuing.</p>

<p>The result with HDDs and vectored read-ahead would be worse since vectored read-ahead leads to smaller read-ahead batches and to less contiguous read patterns. The individual read times here, are over twice the individual read times with per-extent read-ahead. The fact that vectored read-ahead does not read potentially unneeded pages makes no difference. Hence this test is also not worth running to completion.</p>

<p>There are other possibilities for improving HDD I/O. If only 2MB read requests are made, a transfer will be about 20 ms at a sequential transfer speed of 50 MB/s. Then seeking to the next 2MB extent will be a few ms, most often less than 20, so the HDD should give at least half the nominal throughput.</p>

<p>We note that, when reading sequential 8K pages inside a single 2MB (256 page) extent, the seek latency is not 0 as one would expect but an extreme 5 ms. One would think that the drive would buffer a whole track, and a track would hold a large number of 2MB sections, but apparently this is not so. </p>

<p>Therefore, now if we have a sequential read pattern that is more dense than 1 page out of 10, we read all the pages and just keep the ones we want.</p>

<p>So now we set the read-ahead to merge reads that fall within 10 pages. This wastes bandwidth, but supposedly saves on latency. We will see. </p>

<p>So we try, and we find that read-ahead does not account for most pages since it does not get triggered.  Thus, we change the triggering condition to be the 2nd read to fall in the extent within 20 seconds of the first.</p>

<p>The HDDs were in all cases 700% busy for 4 HDDs. But with the new setting we get longer requests, most often full extents, which gets a per-HDD transfer rate of about 5 MB/s. With the looser condition for starting read-ahead, 89% of all pages were read in a read-ahead batch. We see the I/O throughput decrease during the run because there are more single-page reads that do not trigger extent read-ahead. So HDDs have 1.7 concurrent operations pending, but the batch size drops, dropping the throughput.</p>
<p>

</p>
<p>Thus with the best settings, the test with 2000 query mixes finishes in 46 minutes, and the CPU utilization is steadily increasing, hitting 392% for the last minute. In comparison, with SSDs and our worst read-ahead setting we got 1200% CPU in under 5 minutes from cold start. The I/O system can be further tuned; for example, by only reading full extents as long as the buffer pool is not full. In the next post we will measure some more. </p>
<p>


</p>
<h3>BSBM Note </h3>

<p>We look at query times with semi-warm cache, with CPU around 400%. We note that Q8-Q12 are especially bad. Q5 runs at about half speed. Q12 runs at under 1/10th speed. The relatively slowest queries appear to be single-instance lookups. Nothing short of the most aggressive speculative reading can help there. Neither query nor workload has any exploitable pattern. Therefore if an I/O component is to be included in a BSBM metric, the only way to score in this is to use speculative read to the maximum.</p>

<p>Some of the queries take consecutive property values of a single instance. One could parallelize this pipeline, but this would be a one-off and would make sense only when reading from storage (whether HDD, SSD, or otherwise). Multithreading for single rows is not worth the overhead.</p>

<p>A metric for BSBM warm-up is not interesting for database science, but may still be of practical interest in the specific case of <a class="auto-href" href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x29987c90">RDF</a> stores. Specially reading large chunks at startup time is good, so putting a section in BSBM that would force one to implement this would be a service to most end users. Measuring and reporting such I/O performance would favor space efficiency in general. Space efficiency is generally a good thing, especially at larger scales, so we can put an optional section in the report for warm-up. This is also good for comparing HDDs and SSDs, and for testing read-ahead, which is still something a database is expected to do. Implementors have it easy; just speculatively read everything.</p>

<p>Looking at the BSBM fictional use case, anybody running such a portal would do this from RAM only, so it makes sense to define the primary metric as running from warm cache, in practice 100% from memory.</p>


<h3>
<i>Benchmarks, Redux</i> Series</h3>
<ul>
<li> <a href="http://www.openlinksw.com/weblog/oerling/?id=1658" id="link-id0x1ecb2af0">Benchmarks, Redux (part 1): On RDF Benchmarks</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x19d05678">Benchmarks, Redux (part 2): A Benchmarking Story</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1663" id="link-id0x1d542328">Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1665" id="link-id0x13947e08">Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire</a>
</li>
<li>
Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs <i>(this post)</i>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1669" id="link-id0x1a7f6b30">Benchmarks, Redux (part 6): BSBM and I/O, continued</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1671" id="link-id0x1d67dd40">Benchmarks, Redux (part 7): What Does BSBM Explore Measure?</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1673" id="link-id0x1ebcee68">Benchmarks, Redux (part 8): BSBM Explore and Update </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1675" id="link-id0x1a855ba0">Benchmarks, Redux (part 9): BSBM With Cluster</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1677" id="link-id0x1b081e70">Benchmarks, Redux (part 10): LOD2 and the Benchmark Process</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1678" id="link-id0x1d7a7940">Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1d7e2cd0">Benchmarks, Redux (part 12): Our Own BSBM Results Report</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1e375338">Benchmarks, Redux (part 13): BSBM BI Modifications </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1d199728">Benchmarks, Redux (part 14): BSBM BI Mix </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1e808818">Benchmarks, Redux (part 15): BSBM Test Driver Enhancements </a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2011-03-04#1665">
  <rss:title>Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-03-04T20:28:28Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Below is a questionnaire I sent to the BSBM participants in order to get tuning instructions for the runs we were planning. I have filled in the answers for Virtuoso, here. This can be a checklist for pretty much any RDF database tuning. Threading - What settings should be used (e.g., for query parallelization, I/O parallelization [e.g., prefetch, flush of dirty], thread pools [e,.g. web server], any other thread related)? We will run with 8 and 32 cores, so if there are settings controlling number of read/write (R/W) locks or mutexes or such for serializing diverse things, these should be set accordingly to minimize contention. The following three settings are all in the [Parameters] section of the virtuoso.ini file. AsyncQueueMaxThreads controls the size of a pool of extra threads that can be used for query parallelization. This should be set to either 1.5 * the number of cores or 1.5 * the number of core threads; see which works better. ThreadsPerQuery is the maximum number of threads a single query will take. This should be set to either the number of cores or the number of core threads; see which works better. IndexTreeMaps is the number of mutexes over which control for buffering an index tree is split. This can generally be left at default (256 in normal operation; valid settings are powers of 2 from 2 to 1024), but setting to 64, 128, or 512 may be beneficial. A low number will lead to frequent contention; upwards of 64 will have little contention. We have sometimes seen a multiuser workload go 10% faster when setting this to 64 (down from 256), which seems counter-intuitive. This may be a cache artifact. In the [HTTPServer] section of the virtuoso.ini file, the ServerThreads setting is the number of web server threads, i.e., the maximum number of concurrent SPARQL protocol requests. Having a value larger than the number of concurrent clients is OK; for large numbers of concurrent clients a lower value may be better, which will result in requests waiting for a thread to be available. Note â The [HTTPServer] ServerThreads are taken from the total pool made available by the [Parameters] ServerThreads. Thus, the [Parameters] ServerThreads should always be at least as large as (and is best set greater than) the [HTTPServer] ServerThreads, and if using the closed-source Commercial Version, [Parameters] ServerThreads cannot exceed the licensed thread count. File layout - Are there settings for striping over multiple devices? Settings for other file access parallelism? Settings for SSDs (e.g., SSD based cache of hot set of larger db files on disk)? The target config is for 4 independent disks and 4 independent SSDs. If you depend on RAID, are there settings for this? If you need RAID to be set up, please provide the settings/script for doing this with 4 SSDs on Linux (RH and Debian). This will be software RAID, as we find the hardware RAID to be much worse than an independent disk setup on the system in question. It is best to stripe database files over all available disks, and to not use RAID. If RAID is desired, then stripe database files across many RAID sets. Use the segment declaration in the virtuoso.ini file. It is very important to give each independently seekable device its own I/O queue thread. See the documentation on the TPC-C sample for examples. in the [Parameters] section of the virtuoso.ini file, set FDsPerFile to be (the number of concurrent threads * 1.5) Ã· the number of distinct database files. There are no SSD specific settings. Loading - How many parallel streams work best? We are looking for non-transactional bulk load, with no inference materialization. For partitioned cluster settings, do we divide the load streams over server processes? Use one stream per core (not per core thread). In the case of a cluster, divide load streams evenly across all processes. The total number of streams on a cluster can equal the total number of cores; adjust up or down depending on what is observed. Use the built-in bulk load facility, i.e., ld_dir (&#39;&lt;source-filename-or-directory&gt;&#39;, &#39;&lt;file name pattern&gt;&#39;, &#39;&lt;destination graph iri&gt;&#39;); For example, SQL&gt; ld_dir (&#39;/path/to/files&#39;, &#39;*.n3&#39;, &#39;http://dbpedia.org&#39;); Then do a rdf_loader_run () on enough connections. For example, you can use the shell command isql rdf_loader_run () &amp; to start one in a background isql process. When starting background load commands from the shell, you can use the shell wait command to wait for completion. If starting from isql, use the wait_for_children; command (see isql documentation for details). See the BSBM disclosure report for an example load script. What command should be used after non-transactional bulk load, to ensure a consistent persistent state on disk, like a log checkpoint or similar? Load and checkpoint will be timed separately, load being CPU-bound and checkpoint being I/O-bound. No roll-forward log or similar is required; the load does not have to recover if it fails before the checkpoint. Execute CHECKPOINT; through a SQL client, e.g., isql. This is not a SPARQL statement and cannot be executed over the SPARQL protocol. What settings should be used for trickle load of small triple sets into a pre-existing graph? This should be as transactional as supported; at least there should be a roll forward log, unlike the case for the bulk load. No special settings are needed for load testing; defaults will produce transactional behavior with a roll forward log. Default transaction isolation is REPEATABLE READ, but this may be altered via SQL session settings or at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini file, with DefaultIsolation = 4 Transaction isolation cannot be set over the SPARQL protocol. NOTE: When testing full CRUD operations, other isolation settings may be preferable, due to ACID considerations. See answer #12, below, and detailed discussion in part 8 of this series, BSBM Explore and Update. What settings control allocation of memory for database caching? We will be running mostly from memory, so we need to make sure that there is enough memory configured. In the [Parameters] section of the virtuoso.ini file, NumberOfBuffers controls the amount of RAM used by Virtuoso to cache database files. One buffer caches an 8KB database page. In practice, count 10KB of memory per page. If &quot;swappiness&quot; on Linux is low (e.g., 2), two-thirds or more of physical memory can be used for database buffers. If swapping occurs, decrease the setting. What command gives status on memory allocation (e.g., number of buffers, number of dirty buffers, etc.) so that we can verify that things are indeed in server memory and not, for example, being served from OS disk cache. If the cached format is different from the disk layout (e.g., decompression after disk read), is there a command for space statistics for database cache? In an isql session, execute STATUS ( ? ? ); The second result paragraph gives counts of total, used, and dirty buffers. If used buffers is steady and less than total, and if the disk read count on the line below does not increase, the system is running from memory. The cached format is the same as the disk based format. What command gives information on disk allocation for different things? We are looking for the total size of allocated database pages for quads (including table, indices, anything else associated with quads) and dictionaries for literals, IRI names, etc. If there is a text index on literals, what command gives space stats for this? We count used pages, excluding any preallocated unused pages or other gaps. There is one number for quads and another for the dictionaries or other such structures, optionally a third for text index. Execute on an isql session: CHECKPOINT; SELECT TOP 20 * FROM sys_index_space_stats ORDER BY iss_pages DESC; The iss_pages column is the total pages for each index, including blob pages. Pages are 8KB. Only used pages are reported, gaps and unused pages are not counted. The rows pertaining to RDF_QUAD are for quads; RDF_IRI, RDF_PREFIX, RO_START, RDF_OBJ are for dictionaries; RDF_OBJ_RO_FLAGS_WORDS and VTLOG_DB_DBA_RDF_OBJ are for text index. If there is a choice between triples and quads, we will run with quads. How do we ascertain that the run is with quads? How do we find out the index scheme? Should be use an alternate index scheme? Most of the data will be in a single big graph. The default scheme uses quads. The default index layout is PSOG, POGS, GS, SP, OP. To see the current index scheme, use an isql session to execute STATISTICS DB.DBA.RDF_QUAD; For partitioned cluster settings, are there partitioning-related settings to control even distribution of data between partitions? For example, is there a way to set partitioning by S or O depending on which is first in key order for each index? The default partitioning settings are good, i.e., partitioning is on O or S, whichever is first in key order. For partitioned clusters, are there settings to control message batching or similar? What are the statistics available for checking interconnect operation, e.g. message counts, latencies, total aggregate throughput of interconnect? In the [Cluster] section of the cluster.ini file, ReqBatchSize is the number of query states dispatched between cluster nodes per message round trip. This may be incremented from the default of 10000 to 50000 or so if this is seen to be useful. To change this on the fly, the following can be issued through an isql session: cl_exec ( &#39; __dbf_set (&#39;&#39;cl_request_batch_size&#39;&#39;, 50000) &#39; ); The commands below may be executed through an isql session to get a summary of CPU and message traffic for the whole cluster or process-by-process, respectively. The documentation details the fields. STATUS (&#39;cluster&#39;) ;; whole cluster STATUS (&#39;cluster_d&#39;) ;; process-by-process Other settings - Are there settings for limiting query planning, when appropriate? For example, the BSBM Explore mix has a large component of unnecessary query optimizer time, since the queries themselves access almost no data. Any other relevant settings? For BSBM, needless query optimization should be capped at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini, with StopCompilerWhenXOverRun = 1 When testing full CRUD operations (not simply CREATE, i.e., load, as discussed in #5, above), it is essential to make queries run with transaction isolation of READ COMMITTED, to remove most lock contention. Transaction isolation cannot be adjusted via SPARQL. This can be changed through SQL session settings, or at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini file, with DefaultIsolation = 2 Benchmarks, Redux Series Benchmarks, Redux (part 1): On RDF Benchmarks Benchmarks, Redux (part 2): A Benchmarking Story Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire (this post) Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs Benchmarks, Redux (part 6): BSBM and I/O, continued Benchmarks, Redux (part 7): What Does BSBM Explore Measure? Benchmarks, Redux (part 8): BSBM Explore and Update Benchmarks, Redux (part 9): BSBM With Cluster Benchmarks, Redux (part 10): LOD2 and the Benchmark Process Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks Benchmarks, Redux (part 12): Our Own BSBM Results Report Benchmarks, Redux (part 13): BSBM BI Modifications Benchmarks, Redux (part 14): BSBM BI Mix Benchmarks, Redux (part 15): BSBM Test Driver Enhancements</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Below is a questionnaire I sent to the <a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0xa2f6798">BSBM</a> participants in order to get tuning instructions for the runs we were planning. I have filled in the answers for <a class="auto-href" href="http://virtuoso.openlinksw.com" id="link-id0x195c2070">Virtuoso</a>, here. This can be a checklist for pretty much any <a class="auto-href" href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1e1c9bb0">RDF</a> database tuning.</p>


<ol>
<li>
<p>
    <b>Threading - What settings should be used (e.g., for query parallelization, I/O parallelization [e.g., prefetch, flush of dirty], thread pools [e,.g. web server], any other thread related)? We will run with 8 and 32 cores, so if there are settings controlling number of read/write (R/W) locks or mutexes or such for serializing diverse things, these should be set accordingly to minimize contention.</b>
  </p>

<p>The following three settings are all <a href="http://docs.openlinksw.com/virtuoso/databaseadmsrv.html#ini_Parameters" id="link-id0x1ed4fe10">in the <code>[Parameters]</code> section of the <code>virtuoso.ini</code> file</a>. </p>

<ul>
<li>
      <p>
     <b><code>AsyncQueueMaxThreads</code>
     </b> controls the size of a pool of extra threads that can be used for query parallelization. This should be set to either <b>1.5 * the number of cores</b> or <b>1.5 * the number of core threads</b>; see which works better.</p>
    </li>

<li>
      <p>
     <b><code>ThreadsPerQuery</code>
     </b> is the maximum number of threads a single query will take. This should be set to either <b>the number of cores</b> or <b>the number of core threads</b>; see which works better. </p>
    </li>

<li>
      <p>
     <b><code>IndexTreeMaps</code>
     </b> is the number of mutexes over which control for buffering an index tree is split. This can generally be left at default (<b>256</b> in normal operation; valid settings are powers of 2 from 2 to 1024), but setting to <b>64, 128, or 512</b> may be beneficial.</p>

<p>A low number will lead to frequent contention; upwards of 64 will have little contention. We have sometimes seen a multiuser workload go 10% faster when setting this to 64 (down from 256), which seems counter-intuitive. This may be a <a class="auto-href" href="http://dbpedia.org/resource/Cache" id="link-id0x1262b4b0">cache</a> artifact.</p>
    </li>
</ul>

<p></p>
  <p>
    <a href="http://docs.openlinksw.com/virtuoso/databaseadmsrv.html#ini_HTTPServer" id="link-id0x1f8960a0">In the <code>[HTTPServer]</code> section of the <code>virtuoso.ini</code> file</a>, the <b><code>ServerThreads</code></b> setting is the number of web server threads, i.e., the maximum number of concurrent <a class="auto-href" href="http://www.w3.org/TR/rdf-sparql-protocol/" id="link-id0x17c1bef0">SPARQL protocol</a> requests. Having a value larger than the number of concurrent clients is OK; for large numbers of concurrent clients a lower value may be better, which will result in requests waiting for a thread to be available.</p>
<p>Note â The <code>[HTTPServer] ServerThreads</code> are taken from the total pool made available by the <code>[Parameters] ServerThreads</code>. Thus, the <code>[Parameters] ServerThreads</code> should always be at least as large as (and is best set greater than) the <code>[HTTPServer] ServerThreads</code>, and if using the closed-source Commercial Version, <code>[Parameters] ServerThreads</code> cannot exceed the licensed thread count. </p>
</li>


<li>
<p>
    <b>File layout - Are there settings for striping over multiple devices? Settings for other file access parallelism? Settings for SSDs (e.g., SSD based cache of hot set of larger db files on disk)? The target config is for 4 independent disks and 4 independent SSDs. If you depend on RAID, are there settings for this? If you need RAID to be set up, please provide the settings/script for doing this with 4 SSDs on Linux (RH and Debian). This will be software RAID, as we find the hardware RAID to be much worse than an independent disk setup on the system in question.</b>
  </p>

<p>It is best to stripe database files over all available disks, and to not use RAID. If RAID is desired, then stripe database files across many RAID sets. Use the <code>segment</code> declaration in the <code>virtuoso.ini</code> file. It is very important to give each independently seekable device its own I/O queue thread. See the documentation on the <a class="auto-href" href="http://www.tpc.org/" id="link-id0x1e0deb38">TPC</a>-<a class="auto-href" href="http://dbpedia.org/resource/C%2B%2B" id="link-id0x1ddc2bf0">C</a> sample for examples. </p>

<p> <a href="http://docs.openlinksw.com/virtuoso/databaseadmsrv.html#ini_Parameters" id="link-id0x1f893f48">in the <code>[Parameters]</code> section of the <code>virtuoso.ini</code> file</a>, set <code>FDsPerFile</code> to be <code> (the number of concurrent threads * 1.5) Ã· the number of distinct database files</code>.</p>

<p>There are no SSD specific settings.</p>
</li>


<li>
<p>
    <b>Loading - How many parallel streams work best? We are looking for non-transactional bulk load, with no inference materialization. For partitioned cluster settings, do we divide the load streams over server processes? </b>
  </p>

<p>Use one stream per core (not per core thread). In the case of a cluster, divide load streams evenly across all processes. The total number of streams on a cluster can equal the total number of cores; adjust up or down depending on what is observed.</p>

<p>Use the built-in bulk load facility, i.e., </p>
<blockquote>
    <code>ld_dir (&#39;&lt;source-filename-or-directory&gt;&#39;, &#39;&lt;file name pattern&gt;&#39;, &#39;&lt;destination graph iri&gt;&#39;);</code>
  </blockquote>
<p>For example,</p>
<blockquote>
    <code><a class="auto-href" href="http://dbpedia.org/resource/SQL" id="link-id0x17c854c0">SQL</a>&gt; ld_dir (&#39;/path/to/files&#39;, &#39;*.n3&#39;, &#39;<a class="auto-href" href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x1f10c3d8">http</a>://<a class="auto-href" href="http://dbpedia.org/resource/DBpedia" id="link-id0x1c6378a0">dbpedia</a>.org&#39;);</code>
  </blockquote>
<p>Then do a <code>rdf_loader_run ()</code> on enough connections. For example, you can use the shell command </p>
<blockquote>
    <code>isql rdf_loader_run () &amp;</code> </blockquote>
<p>to start one in a background isql process. When starting background load commands from the shell, you can use the shell <code>wait</code> command to wait for completion. If starting from isql, use the <code>wait_for_children;</code> command (see <a href="http://docs.openlinksw.com/virtuoso/isql.html" id="link-id0x1ae0f230">isql documentation</a> for details). </p>
<p>See the <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1d635820">BSBM disclosure report</a> for an example load script.</p>
</li>


<li>
<p>
    <b>What command should be used after non-transactional bulk load, to ensure a consistent persistent state on disk, like a log checkpoint or similar? Load and checkpoint will be timed separately, load being <a class="auto-href" href="http://dbpedia.org/resource/Central_processing_unit" id="link-id0x1c522378">CPU</a>-bound and checkpoint being I/O-bound. No roll-forward log or similar is required; the load does not have to recover if it fails before the checkpoint.</b>
  </p>

<p>Execute </p>
<blockquote>
    <code> CHECKPOINT;</code>
  </blockquote> 
<p>through a SQL client, e.g., <code>isql</code>. This is not a <a class="auto-href" href="http://dbpedia.org/resource/SPARQL" id="link-id0x1c1e95b0">SPARQL</a> statement and cannot be executed over the SPARQL protocol.</p>
</li>


<li>
<p>
    <b>What settings should be used for trickle load of small triple sets into a pre-existing graph? This should be as transactional as supported; at least there should be a roll forward log, unlike the case for the bulk load.</b>
  </p>

<p>No special settings are needed for load testing; defaults will produce transactional behavior with a roll forward log. Default transaction isolation is <b><code>REPEATABLE READ</code></b>, but this may be altered via SQL session settings or at Virtuoso server start-up through <a href="http://docs.openlinksw.com/virtuoso/databaseadmsrv.html#ini_Parameters" id="link-id0x1a791b80">the <code>[Parameters]</code> section of the <code>virtuoso.ini</code> file</a>, with</p>
<blockquote>
   <b><code><a href="http://wikis.openlinksw.com/dataspace/owiki/wiki/VirtuosoWikiWeb/ChangeVirtuosoSDefaultTransactionIsolationLevel" id="link-id0x1e5536b8">DefaultIsolation</a> = 4</code>
   </b>
  </blockquote>
<p> Transaction isolation cannot be set over the SPARQL protocol.</p>
<p> NOTE: When testing full CRUD operations, other isolation settings may be preferable, due to <a class="auto-href" href="http://dbpedia.org/resource/ACID" id="link-id0x1c592f70">ACID</a> considerations.  See answer #12, below, and detailed discussion in part 8 of this series, <a href="http://www.openlinksw.com/weblog/oerling/?id=1673" id="link-id0x1b7eb5f0">BSBM <i>Explore and Update</i></a>.</p>
</li>


<li>
<p>
    <b>What settings control allocation of memory for database caching? We will be running mostly from memory, so we need to make sure that there is enough memory configured. </b>
  </p>

<p>
    <a href="http://docs.openlinksw.com/virtuoso/databaseadmsrv.html#ini_Parameters" id="link-id0x1acd8fe8">In the <code>[Parameters]</code> section of the <code>virtuoso.ini</code> file</a>, <b><code>NumberOfBuffers</code></b> controls the amount of RAM used by Virtuoso to cache database files. One buffer caches an 8KB database page. In practice, count 10KB of memory per page. If &quot;swappiness&quot; on Linux is low (e.g., 2), two-thirds or more of physical memory can be used for database buffers. If swapping occurs, decrease the setting.</p>
</li>


<li>
<p>
    <b>What command gives status on memory allocation (e.g., number of buffers, number of dirty buffers, etc.) so that we can verify that things are indeed in server memory and not, for example, being served from OS disk cache. If the cached format is different from the disk layout (e.g., decompression after disk read), is there a command for space statistics for database cache? </b>
  </p>

<p>In an <code>isql</code> session, execute </p>
<blockquote>
    <code>STATUS ( ? ? );</code>
  </blockquote> 
<p>The second result paragraph gives counts of total, used, and dirty buffers. If used buffers is steady and less than total, and if the disk read count on the line below does not increase, the system is running from memory. The cached format is the same as the disk based format.</p>
</li>


<li>
<p>
    <b>What command gives <a class="auto-href" href="http://dbpedia.org/resource/Information" id="link-id0x11bf3008">information</a> on disk allocation for different things? We are looking for the total size of allocated database pages for quads (including table, indices, anything else associated with quads) and dictionaries for literals, IRI names, etc. If there is a text index on literals, what command gives space stats for this? We count used pages, excluding any preallocated unused pages or other gaps. There is one number for quads and another for the dictionaries or other such structures, optionally a third for text index.</b>
  </p>


<p>Execute on an <code>isql</code> session: </p>

<blockquote>
   <code><pre>
CHECKPOINT;
SELECT TOP 20 * FROM sys_index_space_stats ORDER BY iss_pages DESC;
</pre>
   </code>
  </blockquote>

<p>The <code>iss_pages</code> column is the total pages for each index, including blob pages. Pages are 8KB. Only used pages are reported, gaps and unused pages are not counted. The rows pertaining to <code>RDF_QUAD</code> are for quads; <code>RDF_IRI</code>, <code>RDF_PREFIX</code>, <code>RO_START</code>, <code>RDF_OBJ</code> are for dictionaries; <code>RDF_OBJ_RO_FLAGS_WORDS</code> and <code>VTLOG_DB_DBA_RDF_OBJ</code> are for text index. </p>


</li>
<li>
<p>
    <b>If there is a choice between triples and quads, we will run with quads. How do we ascertain that the run is with quads? How do we find out the index scheme? Should be use an alternate index scheme? Most of the <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0x17eb98f8">data</a> will be in a single big graph.</b>
  </p>

<p>The default scheme uses quads. The default index layout is <code>PSOG</code>, <code>POGS</code>, <code>GS</code>, <code>SP</code>, <code>OP</code>. To see the current index scheme, use an <code>isql</code> session to execute</p>
<blockquote>
    <code>STATISTICS DB.DBA.RDF_QUAD;</code>
  </blockquote>


</li>
<li>
<p>
    <b>For partitioned cluster settings, are there partitioning-related settings to control even distribution of data between partitions? For example, is there a way to set partitioning by <code>S</code> or <code>O</code> depending on which is first in key order for each index? </b>
  </p>

<p>The default partitioning settings are good, i.e., partitioning is on <code>O</code> or <code>S</code>, whichever is first in key order.</p>


</li>
<li>
<p>
    <b>For partitioned clusters, are there settings to control message batching or similar? What are the statistics available for checking interconnect operation, e.g. message counts, latencies, total aggregate throughput of interconnect?</b>
  </p>

<p> <a href="http://docs.openlinksw.com/virtuoso/clusteroperation.html#clusteroperationgeneralclusterinifields" id="link-id0x1ec6dff0">In the <code>[Cluster]</code> section of the <code>cluster.ini</code> file</a>, <b><code>ReqBatchSize</code></b> is the number of query states dispatched between cluster nodes per message round trip. This may be incremented from the default of <code>10000</code> to <code>50000</code> or so if this is seen to be useful. </p>

<p>To change this on the fly, the following can be issued through an <code>isql</code> session:</p>
<blockquote>
<code>cl_exec ( &#39; __dbf_set (&#39;&#39;cl_request_batch_size&#39;&#39;, 50000) &#39; ); </code>
  </blockquote>

<p>The commands below may be executed through an <code>isql</code> session to get a summary of CPU and message traffic for the whole cluster or process-by-process, respectively. The documentation <a href="http://docs.openlinksw.com/virtuoso/clusteroperation.html#clusteroperationadminstdispl" id="link-id0x1dfccec0">details the fields</a>. </p>
<blockquote>
   <pre> <code>STATUS (&#39;cluster&#39;)      ;; whole cluster</code> <br /> <code>STATUS (&#39;cluster_d&#39;)    ;; process-by-process</code>
   </pre></blockquote>

</li>
<li>
<p>
    <b>Other settings - Are there settings for limiting query planning, when appropriate? For example, the BSBM <i>Explore</i> mix has a large component of unnecessary query optimizer time, since the queries themselves access almost no data. Any other relevant settings?</b>
  </p>

<ul>
<li>
      <p>For BSBM, needless query <a class="auto-href" href="http://dbpedia.org/resource/Program_optimization" id="link-id0x11be47b8">optimization</a> should be capped at Virtuoso server start-up through the <code>[Parameters]</code> section of the <code>virtuoso.ini</code>, with</p>
<blockquote>
     <b><code>StopCompilerWhenXOverRun = 1</code>
     </b>
      </blockquote> </li>
<li>
      <p>When testing full CRUD operations (not simply CREATE, i.e., load, as discussed in #5, above), it is essential to make queries run with transaction isolation of <code>READ COMMITTED</code>, to remove most lock contention.  Transaction isolation cannot be adjusted via SPARQL.  This can be changed through SQL session settings, or at Virtuoso server start-up <a href="http://docs.openlinksw.com/virtuoso/databaseadmsrv.html#ini_Parameters" id="link-id0x1f3a43c8">through the <code>[Parameters]</code> section of the <code>virtuoso.ini</code> file</a>, with</p>
<blockquote>
     <b><code><a href="http://wikis.openlinksw.com/dataspace/owiki/wiki/VirtuosoWikiWeb/ChangeVirtuosoSDefaultTransactionIsolationLevel" id="link-id0x1a5a51e0">DefaultIsolation</a> = 2</code>
     </b>
      </blockquote>
</li>
</ul>
</li>
</ol>



<h3>
<i>Benchmarks, Redux</i> Series</h3>
<ul>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1658" id="link-id0x1d6e5428">Benchmarks, Redux (part 1): On RDF Benchmarks</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x1c3ea770">Benchmarks, Redux (part 2): A Benchmarking Story</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1663" id="link-id0x1efeca30">Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore</a>
</li>
<li>
Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire <i>(this post)</i>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1667" id="link-id0x1bda5158">Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1669" id="link-id0x1ec74808">Benchmarks, Redux (part 6): BSBM and I/O, continued</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1671" id="link-id0x1ea253a0">Benchmarks, Redux (part 7): What Does BSBM Explore Measure?</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1673" id="link-id0x1b02d528">Benchmarks, Redux (part 8): BSBM Explore and Update </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1675" id="link-id0x1ae81fc0">Benchmarks, Redux (part 9): BSBM With Cluster</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1677" id="link-id0x197515c0">Benchmarks, Redux (part 10): LOD2 and the Benchmark Process</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1678" id="link-id0x1a78db90">Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1d32ae10">Benchmarks, Redux (part 12): Our Own BSBM Results Report</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1e8fcc18">Benchmarks, Redux (part 13): BSBM BI Modifications </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1ae95050">Benchmarks, Redux (part 14): BSBM BI Mix </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1dbf3158">Benchmarks, Redux (part 15): BSBM Test Driver Enhancements </a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2011-03-02#1663">
  <rss:title>Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-03-02T23:23:16Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">In this post I will summarize the figures for BSBM Load and Explore mixes at 100 Mt, 200 Mt, and 1000 Mt. (1 Mt = 1 Megatriple, or one million triples.) The measurements were made on a 72GB 2xXeon 5520 with 4 SSDs. The exact specifications and configurations are in the raw reports to follow. The load time in the recent Berlin report was measured with the wrong function, and so far as we can tell, without multiple threads. The intermediate cut of Virtuoso they tested also had broken SPARQL/Update (also known as SPARUL) features. We have fixed this since, and give here the right numbers. In the course of the discussion to follow, we talk about 3 different kinds of Virtuoso: 6 Single is the generally available single server configuration of Virtuoso. Whether this is open source or not does not make a difference. 6 Cluster is the generally available commercial only cluster-capable Virtuoso. 7 Single is the next generation single server Virtuoso, about to be released as a preview. To understand the numbers, we must explain how these differ from each other in execution: 6 Single has one thread-per-query, and operates on one state of the query at a time. 6 Cluster has one thread-per-query-per-process, and between processes it operates on batches of some tens-of-thousands of simultaneous query states. Within each node, these batches run through the execution pipeline one state at a time. Aggregation is distributed, and the query optimizer is generally smart about shipping colocated functions together. 7 Single has multiple threads-per-query and in all situations operates on batches of 10,000 or more simultaneous query states. This means, for example, that index lookups get large numbers of parameters which then are sorted to get an ascending search pattern which benefits from locality, so the n * log(n) index access for the batch becomes more like linear if the data accessed has any locality. Furthermore, if there are many operands to an operator, these can be split on multiple threads. Also, scans of consecutive rows can be split before the scan on multiple threads, each doing a range of the scan. These features are called vectored execution and query parallelization. These techniques will also be applied to the cluster variant in due time. The version 6 and 7 variants discussed here use the same physical storage layout with row-wise key compression. Additionally, there exists a column-wise storage option in 7 that can fit 4x the number of quads in the same space. This column store option is not used here because it still has some problems with random order inserts. We will first consider loading. Below are the load times and rates for 7 at each scale. 7 Single Scale Rate (quads per second) Load time (seconds) Checkpoint time (seconds) 100 Mt 261,366 301 82 200 Mt 216,000 802 123 1000 Mt 130,378 6641 1012 In each case the load was made on 8 concurrent streams, each reading a file from a pool of 80 files for the two smaller scales and 360 files for the larger scale. We also loaded the smallest data set with 6 Single using the same load script. 6 Single Scale Rate (quads per second) Load time (seconds) Checkpoint time (seconds) 100 Mt 74,713 1192 145 CPU time with 6 Single was 8047 seconds. We compare this to 4453 seconds of CPU for the same load on 7 Single. The CPU% during the run was on either side of 700% for 6 Single and 1300% for 7 Single. Note that high percentages involve core threads, not real cores. The difference is mostly attributable to vectoring and the introduction of a non-transactional insert. The 6 Single inserts transactionally but makes very frequent commits and writes no log, resulting in de facto non-transactional behavior but still there is a lock and commit cycle. Inserts in RDF load usually exhibit locality on all SPOG. Sorting by value gives ascending insert order and eliminates much of the lookup time for deciding where the next row will go. Contention on page read-write locks is less because the engine stays longer on a page, inserting multiple values in one go, instead of re-acquiring the read-write lock and possible transaction locks for each row. Furthermore, for single stream loading the non-transactional mode can serve one thread doing the parsing with many threads doing the inserting; hence, in practice the speed is bounded by the parsing speed. In multi-stream load this parallelization also happens but is less significant, as adding threads past the count of core threads is not useful. Writes are all in-place, and no delta-merge mechanism is involved. For transactional inserts, the uncommitted rows are not visible to read-committed readers, which do not block. Repeatable and serializable readers would block before an uncommitted insert. Now for the run (larger numbers indicate more queries executed, and are therefore better): 6 Single Throughput (QMpH, query mixes per hour) Scale Single User 16 User 100 Mt 7641 29433 200 Mt 6017 13335 1000 Mt 1770 2487 7 Single Throughput (QMpH, query mixes per hour) Scale Single User 16 User 100 Mt 11742 72278 200 Mt 10225 60951 1000 Mt 6262 24672 The 100 Mt and 200 Mt runs are entirely in memory; the 1000 Mt run is mostly in memory, with about a 1.6 MB/s trickle from SSD in steady state. Accordingly, the 1000 Mt run is longer, with 2000 query mixes in the timed period, preceded by a warm-up of 2000 mixes with a different seed. For the memory-only scales, we run 500 mixes twice, and take the timing of the second run. Looking at single user speeds, 6 Single and 7 Single are closest at the small end and drift farther apart at the larger scales. This comes from the increased opportunity to parallelize Q5, since this works on more data and is relatively more important as the scale gets larger. The 100 Mt run of 7 Single has about 130% CPU, and the 1000 Mt run has about 270%. This also explains why adding clients gives a larger boost at the smaller scale. Now let us look at the relative effects of parallelizing and vectoring in 7 Single. We run 50 mixes of Single User Explore: 6132 QMpH with both parallelizing and vectoring on; 2805 QMpH with execution limited to a single thread. Then we set the vector size to 1, meaning that the query pipeline runs one row at a time. This gets us 1319 QMpH which is a bit worse than 6 Single. This is to be expected since there is some overhead to running vectored with single-element vectors. Q5 on 7 Single with vectoring and a single thread runs at 1.9 qps; with single-element vectors, at 0.8 qps. The 6 Single engine runs Q5 at 1.13 qps. The 100 Mt scale 7 Single gains the most from adding clients; the 1000 Mt 6 Single gains the least. The reason for the latter is covered in detail in A Benchmarking Story. We note that while vectoring is primarily geared to better single-thread speed and better cache hit rates, it delivers a huge multithreaded benefit by eliminating the mutex contention at the index tree top which stops 6 Single dead at 1000 Mt. In conclusion, we see that even with a workload of short queries and little opportunity for parallelism, we get substantial benefits from query parallelization and vectoring. When moving to more complex workloads, the benefits become more pronounced. For a single user complex query load, we can get 7x speed-up from parallelism (8 core), plus up to 3x from vectoring. These numbers do not take into account the benefits of the column store; those will be analyzed separately a bit later. The full run details will be supplied at the end of this blog series. Benchmarks, Redux Series Benchmarks, Redux (part 1): On RDF Benchmarks Benchmarks, Redux (part 2): A Benchmarking Story Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore (this post) Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs Benchmarks, Redux (part 6): BSBM and I/O, continued Benchmarks, Redux (part 7): What Does BSBM Explore Measure? Benchmarks, Redux (part 8): BSBM Explore and Update Benchmarks, Redux (part 9): BSBM With Cluster Benchmarks, Redux (part 10): LOD2 and the Benchmark Process Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks Benchmarks, Redux (part 12): Our Own BSBM Results Report Benchmarks, Redux (part 13): BSBM BI Modifications Benchmarks, Redux (part 14): BSBM BI Mix Benchmarks, Redux (part 15): BSBM Test Driver Enhancements</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>In this post I will summarize the figures for <a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x1edb1dd0">BSBM</a> Load and <i>Explore</i> mixes at 100 Mt, 200 Mt, and 1000 Mt.  (1 Mt = 1 Megatriple, or one million triples.)  The measurements were made on a 72GB 2xXeon 5520 with 4 SSDs.  The exact specifications and configurations are in the raw reports to follow.</p>

<p>The load time in <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/V6/index.html" id="link-id0x1f3716d8">the recent Berlin report</a> was measured with <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/V6/index.html#resultsExplore" id="link-id0x1dd37f80">the wrong function</a>, and so far as we can tell, without multiple threads. The intermediate cut of <a class="auto-href" href="http://virtuoso.openlinksw.com" id="link-id0x1c1c7798">Virtuoso</a> they tested also <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/V6/index.html#resultsExploreAndUpdate" id="link-id0x1e5fcf40"> had broken</a> <a class="auto-href" href="http://dbpedia.org/resource/SPARQL" id="link-id0x1bfa40b8">SPARQL</a>/<a class="auto-href" href="http://dbpedia.org/page/SPARUL" id="link-id0x1c1e1320">Update</a> (also known as <a class="auto-href" href="http://dbpedia.org/page/SPARUL" id="link-id0x1ddc87d8">SPARUL</a>) features.  We have fixed this since, and give <a href="http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/BenchmarksReduxSupportingFiles/results.zip" id="link-id0x1edf36b0">here the right numbers</a>.</p>

<p>In the course of the discussion to follow, we talk about 3 different kinds of Virtuoso:</p>

<ul>
 <li>
  <p>
    <i>6 Single</i> is the generally available single server configuration of Virtuoso.  Whether this is open source or not does not make a difference.</p>
 </li>
<li>
  <p>
    <i>6 Cluster</i> is the generally available commercial only cluster-capable Virtuoso.</p>
</li>
<li>
  <p>
    <i>7 Single</i> is the next generation single server Virtuoso, about to be released as a preview.</p>
</li>
</ul>

<p>To understand the numbers, we must explain how these differ from each other in execution:</p>

<ul>
 <li>
  <p>
    <i>6 Single</i> has one thread-per-query, and operates on one state of the query at a time.</p>
 </li>

<li>
  <p>
    <i>6 Cluster</i> has one thread-per-query-per-process, and between processes it operates on batches of some tens-of-thousands of simultaneous query states.  Within each node, these batches run through the execution pipeline one state at a time. Aggregation is distributed, and the query optimizer is generally smart about shipping colocated functions together.</p>
</li>

<li>
  <p>
    <i>7 Single</i> has multiple threads-per-query and in all situations operates on batches of 10,000 or more simultaneous query states.  This means, for example, that index lookups get large numbers of parameters which then are sorted to get an ascending search pattern which benefits from locality, so the <code>n * log(n)</code> index access for the batch becomes more like linear if the <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0x1ea197c8">data</a> accessed has any locality. Furthermore, if there are many operands to an operator, these can be split on multiple threads.  Also, scans of consecutive rows can be split before the scan on multiple threads, each doing a range of the scan.  These features are called <i>vectored execution</i> and <i>query parallelization</i>.  These techniques will also be applied to the cluster variant in due time.</p>
</li>
</ul>

<p>The version 6 and 7 variants discussed here use the same physical storage layout with row-wise <a class="auto-href" href="http://dbpedia.org/resource/Data_compression" id="link-id0x1bd035c0">key compression</a>.  Additionally, there exists a column-wise storage option in 7 that can fit 4x the number of quads in the same space.  This column store option is not used here because it still has some problems with random order inserts.</p>

<p> We will first consider loading.  Below are the load times and rates for 7 at each scale.</p>

<table border="1" cellspacing="2" cellpadding="2" align="center" width="90%">
	<tr>
		<th colspan="4" align="center">7 Single</th>
	</tr>
	<tr>
		<th align="center">Scale</th>
		<th align="center">Rate <br /> (quads per second)</th>
		<th align="center">Load time <br /> (seconds)</th>
		<th align="center">Checkpoint time <br /> (seconds)</th>
	</tr>
	<tr>
		<th align="center">100 Mt</th>
		<td align="center"> 261,366 </td>
		<td align="center"> 301 </td>
		<td align="center"> 82 </td>
	</tr>
	<tr>
		<th align="center">200 Mt</th>
		<td align="center"> 216,000 </td>
		<td align="center"> 802 </td>
		<td align="center"> 123 </td>
	</tr>
	<tr>
		<th align="center">1000 Mt</th>
		<td align="center"> 130,378 </td>
		<td align="center"> 6641 </td>
		<td align="center"> 1012 </td>
	</tr>
</table>

<p>In each case the load was made on 8 concurrent streams, each reading a file from a pool of 80 files for the two smaller scales and 360 files for the larger scale.</p>

<p>We also loaded the smallest data set with 6 Single using the same load script.

</p>
<table border="1" cellspacing="2" cellpadding="2" align="center" width="90%">
	<tr>
		<th colspan="4" align="center">6 Single</th>
	</tr>
	<tr>
		<th align="center">Scale</th>
		<th align="center">Rate <br /> (quads per second)</th>
		<th align="center">Load time <br /> (seconds)</th>
		<th align="center">Checkpoint time <br /> (seconds)</th>
	</tr>
	<tr>
		<th align="center">100 Mt</th>
		<td align="center"> 74,713 </td>
		<td align="center"> 1192 </td>
		<td align="center"> 145 </td>
	</tr>
</table>


<p>
<a class="auto-href" href="http://dbpedia.org/resource/Central_processing_unit" id="link-id0x1c0b96c0">CPU</a> time with 6 Single was 8047 seconds.  We compare this to 4453 seconds of CPU for the same load on 7 Single.  The CPU% during the run was on either side of 700% for 6 Single and 1300% for 7 Single.  Note that high percentages involve core threads, not real cores. </p>

<p>The difference is mostly attributable to vectoring and the introduction of a non-transactional insert.  The 6 Single inserts transactionally but makes very frequent commits and writes no log, resulting in <i>de facto</i> non-transactional behavior but still there is a lock and commit cycle.  Inserts in <a class="auto-href" href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1ddef3e8">RDF</a> load usually exhibit locality on all SPOG.  Sorting by value gives ascending insert order and eliminates much of the lookup time for deciding where the next row will go.  Contention on page read-write locks is less because the engine stays longer on a page, inserting multiple values in one go, instead of re-acquiring the read-write lock and possible transaction locks for each row.</p>

<p>Furthermore, for single stream loading the non-transactional mode can serve one thread doing the parsing with many threads doing the inserting; hence, in practice the speed is bounded by the parsing speed.  In multi-stream load this parallelization also happens but is less significant, as adding threads past the count of core threads is not useful.  Writes are all in-place, and no delta-merge mechanism is involved.  For transactional inserts, the uncommitted rows are not visible to read-committed readers, which do not block.  Repeatable and serializable readers would block before an uncommitted insert.</p>



<p>Now for the run (larger numbers indicate more queries executed, and are therefore better):</p>

<table border="1" cellspacing="2" cellpadding="2" align="center" width="90%">
	<tr>
		<th colspan="3" align="center"> 6 Single Throughput <br /> (QMpH, query mixes per hour) </th>
	</tr>
	<tr>
		<th align="center">Scale</th>
		<th align="center"> Single User </th>
		<th align="center"> 16 User </th>
	</tr>
	<tr>
		<th align="center">100 Mt</th>
		<td align="center"> 7641 </td>
		<td align="center"> 29433 </td>
	</tr>
	<tr>
		<th align="center">200 Mt</th>
		<td align="center"> 6017 </td>
		<td align="center"> 13335 </td>
	</tr>
	<tr>
		<th align="center">1000 Mt</th>
		<td align="center"> 1770 </td>
		<td align="center"> 2487 </td>
	</tr>
</table>
<br />
<table border="1" cellspacing="2" cellpadding="2" align="center" width="90%">
	<tr>
		<th colspan="3" align="center"> 7 Single Throughput <br /> (QMpH, query mixes per hour) </th>
	</tr>
	<tr>
		<th align="center">Scale</th>
		<th align="center"> Single User </th>
		<th align="center"> 16 User </th>
	</tr>
	<tr>
		<th align="center">100 Mt</th>
		<td align="center"> 11742 </td>
		<td align="center"> 72278 </td>
	</tr>
	<tr>
		<th align="center">200 Mt</th>
		<td align="center"> 10225 </td>
		<td align="center"> 60951 </td>
	</tr>
	<tr>
		<th align="center">1000 Mt</th>
		<td align="center"> 6262 </td>
		<td align="center"> 24672 </td>
	</tr>
</table>

<p>The 100 Mt and 200 Mt runs are entirely in memory; the 1000 Mt run is mostly in memory, with about a 1.6 MB/s trickle from SSD in steady state.  Accordingly, the 1000 Mt run is longer, with 2000 query mixes in the timed period, preceded by a warm-up of 2000 mixes with a different seed.  For the memory-only scales, we run 500 mixes twice, and take the timing of the second run.</p>

<p>Looking at single user speeds, 6 Single and 7 Single are closest at the small end and drift farther apart at the larger scales. This comes from the increased opportunity to parallelize Q5, since this works on more data and is relatively more important as the scale gets larger. The 100 Mt run of 7 Single has about 130% CPU, and the 1000 Mt run has about 270%.  This also explains why adding clients gives a larger boost at the smaller scale. </p>

<p>Now let us look at the relative effects of parallelizing and vectoring in 7 Single.  We run 50 mixes of Single User <i>Explore</i>: 6132 QMpH with both parallelizing and vectoring on; 2805 QMpH with execution limited to a single thread.  Then we set the vector size to 1, meaning that the query pipeline runs one row at a time.  This gets us 1319 QMpH which is a bit worse than 6 Single.  This is to be expected since there is some overhead to running vectored with single-element vectors. Q5 on 7 Single with vectoring and a single thread runs at 1.9 qps; with single-element vectors, at 0.8 qps. The 6 Single engine runs Q5 at 1.13 qps.</p>

<p>The 100 Mt scale 7 Single gains the most from adding clients; the 1000 Mt 6 Single gains the least.  The reason for the latter is covered in detail in <a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x1b9ed390">A Benchmarking Story</a>.  We note that while vectoring is primarily geared to better single-thread speed and better <a class="auto-href" href="http://dbpedia.org/resource/Cache" id="link-id0x1ddc2f78">cache</a> hit rates, it delivers a huge multithreaded benefit by eliminating the mutex contention at the index tree top which stops 6 Single dead at 1000 Mt.</p>

<p>In conclusion, we see that even with a workload of short queries and little opportunity for parallelism, we get substantial benefits from query parallelization and vectoring.  When moving to more complex workloads, the benefits become more pronounced.  For a single user complex query load, we can get 7x speed-up from parallelism (8 core), plus up to 3x from vectoring.  These numbers do not take into account the benefits of the column store; those will be analyzed separately a bit later.</p>

<p>The full run details will be supplied at the end of this <a class="auto-href" href="http://dbpedia.org/resource/Blog" id="link-id0x1e9f69f0">blog</a> series.</p>

<h3>
<i>Benchmarks, Redux</i> Series</h3>
<ul>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1658" id="link-id0x1d0bb988">Benchmarks, Redux (part 1): On RDF Benchmarks </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x155fc700">Benchmarks, Redux (part 2): A Benchmarking Story</a>
</li>
<li>Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore <i>(this post)</i>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1665" id="link-id0x1d96e218">Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1667" id="link-id0x1d7a5170">Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1669" id="link-id0x1def9ca0">Benchmarks, Redux (part 6): BSBM and I/O, continued</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1671" id="link-id0x1a7a7800">Benchmarks, Redux (part 7): What Does BSBM Explore Measure?</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1673" id="link-id0x1e9c6c68">Benchmarks, Redux (part 8): BSBM Explore and Update </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1675" id="link-id0x1e80c208">Benchmarks, Redux (part 9): BSBM With Cluster</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1677" id="link-id0x1dafd290">Benchmarks, Redux (part 10): LOD2 and the Benchmark Process</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1678" id="link-id0x1f34f7f8">Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1df24f50">Benchmarks, Redux (part 12): Our Own BSBM Results Report</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1f4b19c8">Benchmarks, Redux (part 13): BSBM BI Modifications </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1de90cf8">Benchmarks, Redux (part 14): BSBM BI Mix </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1ebefbe8">Benchmarks, Redux (part 15): BSBM Test Driver Enhancements </a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2010-09-13#1626">
  <rss:title>VLDB Semdata Workshop - The New Frontier of Semdata </rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2010-09-13T22:09:24Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">This is a revised version of the talk I will be giving at the Semdata workshop at VLDB 2010. The paper shows how we store TPC-H data as RDF with relational-level efficiency and how we query both RDF and relational versions in comparable time. We also compare row-wise and column-wise storage formats as implemented in Virtuoso. A question that has come up a few times during the Semdata initiative is how semantic data will avoid the fate of other would-be database revolutions like OODBMS and deductive databases. The need and opportunity are driven by the explosion of data in quantity and diversity of structure. The competition consists of analytics RDBMS, point solutions done with map-reduce or the like, and lastly in some cases from key-value stores with relaxed schema but limited querying. The benefits of RDF are the ever expanding volume of data published in it, reuse of vocabulary, and well-defined semantics. The downside is efficiency. This is not so much a matter of absolute scalability â you can run an RDF database on a cluster â but a question of relative cost as opposed to alternatives. The baseline is that for relational-style queries, one should get relational performance or close enough. We outline in the paper how RDF reduces to a run-time-typed relational column-store, and gets all the compression and locality advantages traditionally associated with such. After memory is no longer the differentiator, the rest is engineering. So much for the scalability barrier to adoption. I do not need to talk here about the benefits of linked data and more or less ad hoc integration per se. But again, to make these practical, there are logistics to resolve: How to keep data up to date? How to distribute it incrementally? How to monetize freshness? We propose some solutions for these, looking at diverse-RDF replication and RDB-to-RDF replication in Virtuoso. But to realize the ultimate promise of RDF/Linked Data/Semdata, however we call it, we must look farther into the landscape of what is being done with big data. Here we are no longer so much running against the RDBMS, but against map-reduce and key-value stores. Given the psychology of geekdom, the charm of map-reduce is understandable: One controls what is going on, can work in the usual languages, can run on big iron without being picked to pieces by the endless concurrency and timing and order-of-events issues one gets when programming a cluster. Tough for the best, and unworkable for the rest. The key-value store has some of the same appeal, as it is the DBMS laid bare, so to say, made understandable, without the again intractably-complex questions of fancy query planning and distributed ACID transactions. The psychological rewards of the sense of control are there, never mind the complex query; one can always hard code a point solution for the business question, if really must â maybe even in map-reduce. Besides, for some things that go beyond SQL (for example, with graph structures), there really isn&#39;t a good solution. Now, enter Vertica, Greenplum, VectorWise (a MonetDB project derivative from Ingres) and Virtuoso, maybe others, who all propose some combination of SQL- and explicit map-reduce-style control structures. This is nice but better is possible. Here we find the next frontier of Semdata. Take Joe Hellerstein et al&#39;s work on declarative logic for the data centric data center. We have heard it many times â when the data is big, the logic must go to it. We can take declarative, location-conscious rules, Ã  la BOOM and BLOOM, and combine these with the declarative query, well-defined semantics, parallel-database capability of the leading RDF stores. Merge this with locality compression and throughput from the best analytics DBMS. Here we have a data infrastructure that subsumes map-reduce as a special case of arbitrary distributed-parallel control flow, can send the processing to the data, and has flexible queries and schema-last capability. Further, since RDF more or less reduces to relational columns, the techniques of caching and reuse and materialized joins and demand-driven indexing, Ã  la MonetDB, are applicable with minimal if any adaptation. Such a hybrid database-fusion frontier is relevant because it addresses heterogenous, large-scale data, with operations that are not easy to reduce to SQL, still without loss of the advantages of SQL. Apply this to anything from enhancing the business intelligence process by faster integration, including integration with linked open data to the map-reduce bulk processing of today. Do it with strong semantics and inference close to the data. In short, RDF stays relevant by tackling real issues, with scale second to none, and decisive advantages in time-to-integrate and expressive power. Last week I was at the LOD2 kick off and a LarKC meeting. The capabilities envisioned in this and the following post mirror our commitments to the EU co-funded LOD2 project. This week is VLDB and the Semdata workshop. I will talk more about how these trends are taking shape within the Virtuoso product development roadmap in future posts.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>This is a revised version of the talk I will be giving at the <a href="http://semdata.org/events/2010/vldb" id="link-id0x1d137fe0">Semdata workshop</a> at <a href="http://www.vldb2010.org/" id="link-id0x24e5c4d0">VLDB 2010</a>.</p>

<p>
<a href="http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtDirectionsChallengesSemdata" id="link-id0x1cff6678">The paper</a> shows how we store <a href="http://www.tpc.org/" id="link-id0x25037270">TPC</a>-<a href="http://dbpedia.org/resource/TPC-H" id="link-id0x28c239f8">H</a> <a href="http://dbpedia.org/resource/Data" id="link-id0x242a5378">data</a> as <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x25900d40">RDF</a> with relational-level efficiency and how we query both RDF and relational versions in comparable time. We also compare row-wise and column-wise storage formats as implemented in <a href="http://virtuoso.openlinksw.com" id="link-id0x25904ab8">Virtuoso</a>.</p>

<p>A question that has come up a few times during the Semdata initiative is how semantic data will avoid the fate of other would-be database revolutions like OODBMS and deductive databases.</p>

<p>The need and opportunity are driven by the explosion of data in quantity and diversity of structure. The competition consists of analytics <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x24c7c6b8">RDBMS</a>, point solutions done with map-reduce or the like, and lastly in some cases from key-value stores with relaxed <a href="http://dbpedia.org/resource/Database_schema" id="link-id0x23ce37d8">schema</a> but limited querying.</p>

<p>The benefits of RDF are the ever expanding volume of data published in it, reuse of vocabulary, and well-defined semantics. The downside is efficiency. This is not so much a matter of absolute scalability â you can run an RDF database on a cluster â but a question of relative cost as opposed to alternatives.</p>

<p>The baseline is that for relational-style queries, one should get relational performance or close enough. We outline in the paper how RDF reduces to a run-time-typed relational column-store, and gets all the compression and locality advantages traditionally associated with such. After memory is no longer the differentiator, the rest is engineering. So much for the scalability barrier to adoption.</p>

<p>I do not need to talk here about the benefits of <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x25345270">linked data</a> and more or less <i>ad hoc</i> integration <i>per se</i>. But again, to make these practical, there are logistics to resolve: How to keep data up to date? How to distribute it incrementally? How to monetize freshness? We propose some solutions for these, looking at diverse-RDF replication and RDB-to-RDF replication in Virtuoso.</p>

<p>But to realize the ultimate promise of RDF/Linked Data/Semdata, however we call it, we must look farther into the landscape of what is being done with big data. Here we are no longer so much running against the RDBMS, but against map-reduce and key-value stores.</p>

<p>Given the psychology of geekdom, the charm of map-reduce is understandable: One controls what is going on, can work in the usual languages, can run on big iron without being picked to pieces by the endless concurrency and timing and order-of-events issues one gets when programming a cluster. Tough for the best, and unworkable for the rest.</p>

<p>The key-value store has some of the same appeal, as it is the DBMS laid bare, so to say, made understandable, without the again intractably-complex questions of fancy query planning and distributed <a href="http://dbpedia.org/resource/ACID" id="link-id0x25365e30">ACID</a> transactions. The psychological rewards of the sense of control are there, never mind the complex query; one can always hard code a point solution for the business question, if really must â maybe even in map-reduce.</p>

<p>Besides, for some things that go beyond <a href="http://dbpedia.org/resource/SQL" id="link-id0x2533b280">SQL</a> (for example, with graph structures), there really isn&#39;t a good solution.</p>

<p>Now, enter <a href="http://www.vertica.com/" id="link-id0x253fe1d8">Vertica</a>, <a href="http://dbpedia.org/resource/Greenplum" id="link-id0x257102a0">Greenplum</a>, <a href="http://www.ingres.com/vectorwise/" id="link-id0x248ce120">VectorWise</a> (a <a href="http://dbpedia.org/resource/MonetDB" id="link-id0x2596a8a8">MonetDB</a> project derivative from <a href="http://dbpedia.org/resource/Ingres" id="link-id0x2502a170">Ingres</a>) and Virtuoso, maybe others, who all propose some combination of SQL- and explicit map-reduce-style control structures. This is nice but better is possible.</p>

<p>Here we find the next frontier of Semdata. Take <a href="http://dbpedia.org/resource/Joseph_M._Hellerstein" id="link-id0x24d3dca8">Joe Hellerstein</a> et al&#39;s work on <a href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-90.html" id="link-id0x1c64ba98">declarative logic for the data centric data center</a>.</p>

<p>We have heard it many times â when the data is big, the logic must go to it. We can take declarative, location-conscious rules, <i>Ã  la</i> <a href="http://www.eecs.berkeley.edu/Research/Projects/Data/105733.html" id="link-id0x257d2988">BOOM</a> and BLOOM, and combine these with the declarative query, well-defined semantics, parallel-database capability of the leading RDF stores. Merge this with locality compression and throughput from the best analytics DBMS.</p>

<p>Here we have a data infrastructure that subsumes map-reduce as a special case of arbitrary distributed-parallel control flow, can send the processing to the data, and has flexible queries and schema-last capability.</p>

<p>Further, since RDF more or less reduces to relational columns, the techniques of caching and reuse and materialized joins and demand-driven indexing, <i>Ã  la</i> MonetDB, are applicable with minimal if any adaptation.</p>

<p>Such a hybrid database-fusion frontier is relevant because it addresses heterogenous, large-scale data, with operations that are not easy to reduce to SQL, still without loss of the advantages of SQL. Apply this to anything from enhancing the business intelligence process by faster integration, including integration with <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x25455a40">linked open data</a> to the map-reduce bulk processing of today. Do it with strong semantics and inference close to the data.</p>

<p>In short, RDF stays relevant by tackling real issues, with scale second to none, and decisive advantages in time-to-integrate and expressive power.</p>

<p>Last week I was at the <a href="http://lod2.eu/" id="link-id0x2438c3e0">LOD2</a> <a href="http://lod2.eu/BlogPost/9-press-release-lod2-project-launch.html" id="link-id0x1aec1c10">kick off</a> and a <a href="http://www.larkc.eu/" id="link-id0x2836b780">LarKC</a> meeting. The capabilities envisioned in this and the following post mirror our commitments to the EU co-funded LOD2 project. This week is VLDB and the Semdata workshop. I will talk more about how these trends are taking shape within the Virtuoso product development roadmap in future posts.</p>
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2010-04-05#1618">
  <rss:title>&quot;The Acquired, The Innate, and the Semantic&quot; or &quot;Teaching Sem Tech&quot;</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2010-04-05T15:21:19Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">I was recently asked to write a section for a policy document touching the intersection of database and semantics, as a follow up to the meeting in Sofia I blogged about earlier. I will write about technology, but this same document also touches the matter of education and computer science curricula. Since the matter came up, I will share a few thoughts on the latter topic. I have over the years trained a few truly excellent engineers and managed a heterogeneous lot of people. These days, since what we are doing is in fact quite difficult and the world is not totally without competition, I find that I must stick to core competence, which is hardcore tech and leave management to those who have time for it. When younger, I thought that I could, through sheer personal charisma, transfer either technical skills, sound judgment, or drive and ambition to people I was working with. Well, to the extent I believed this, my own judgment was not sound. Transferring anything at all is difficult and chancy. I must here think of a fantasy novel where a wizard said that, &quot;working such magic that makes things do what they already want to do is easy.&quot; There is a grain of truth in that. In order to build or manage organizations, we must work, as the wizard put it, with nature, not against it. There are also counter-examples, for example my wife&#39;s grandmother had decided to transform a regular willow into a weeping one by tying down the branches. Such &quot;magic,&quot; needless to say, takes constant maintenance; else the spell breaks. To operate efficiently, either in business or education, we need to steer away from such endeavors. This is a valuable lesson, but now consider teaching this to somebody. Those who would most benefit from this wisdom are the least receptive to it. So again, we are reminded to stay away from the fantasy of being able to transfer some understanding we think to have and to have this take root. It will if it will and if it does not, it will take constant follow up, like the would-be weeping willow. Now, in more specific terms, what can we realistically expect to teach about computer science? Complexity of algorithms would be the first thing. Understanding the relative throughputs and latencies of the memory hierarchy (i.e., cache, memory, local network, disk, wide area network) is the second. Understanding the difference of synchronous and asynchronous and the cost of synchronization (i.e., anything from waiting for a mutex to waiting for a network message) is the third. Understanding how a database works would be immensely helpful for almost any application development task but this is probably asking too much. Then there is the question of engineering. Where do we put interfaces and what should these interfaces expose? Well, they certainly should expose multiple instances of whatever it is they expose, since passing through an interface takes time. I tried once to tell the SPARQL committee that parameterized queries and array parameters are a self-evident truism on the database side. This is an example of an interface that exposes multiple instances of what it exposes. But the committee decided not to standardize these. There is something in the &quot;semanticist&quot; mind that is irrationally antagonistic to what is self-evident for databasers. This is further an example of ignoring precept 2 above, the point about the throughputs and latencies in the memory hierarchy. Nature is a better and more patient teacher than I; the point will become clear of itself in due time, no worry. Interfaces seem to be overvalued in education. This is tricky because we should not teach that interfaces are bad either. Nature has islands of tightly intertwined processes, separated by fairly narrow interfaces. People are taught to think in block diagrams, so they probably project this also where it does not apply, thereby missing some connections and porosity of interfaces. LarKC (EU FP7 Large Knowledge Collider project) is an exercise in interfaces. The lessons so far are that coupling needs to be tight, and that the roles of the components are not always as neatly separable as the block diagram suggests. Recognizing the points where interfaces are naturally narrow is very difficult. Teaching this in a curriculum is likely impossible. This is not to say that the matter should not be mentioned and examples of over-&quot;paradigmatism&quot; given. The geek mind likes to latch on to a paradigm (e.g., object orientation), and then they try to put it everywhere. It is safe to say that taking block diagrams too naively or too seriously makes for poor performance and needless code. In some cases, block diagrams can serve as tactical disinformation; i.e., you give lip service to the values of structure, information hiding, and reuse, which one is not allowed to challenge, ever, and at the same time you do not disclose the competitive edge, which is pretty much always a breach of these same principles. I was once at a data integration workshop in the US where some very qualified people talked about the process of science. They had this delightfully American metaphor for it: The edge is created in the &quot;Wild West&quot; â there are no standards or hard-and-fast rules, and paradigmatism for paradigmatism&#39;s sake is a laughing matter with the cowboys in the fringe where new ground is broken. Then there is the OK Corral, where the cowboys shoot it out to see who prevails. Then there is Dodge City, where the lawman already reigns, and compliance, standards, and paradigms are not to be trifled with, lest one get the tar-and-feather treatment and be &quot;driven out o&#39;Dodge.&quot; So, if reality is like this, what attitude should the curriculum have towards it? Do we make innovators or followers? Well, as said before, they are not made. Or if they are made, they are not at least made in the university but much before that. I never made any of either, in spite of trying, but did meet many of both kinds. The education system needs to recognize individual differences, even though this is against the trend of turning out a standardized product. Enforced mediocrity makes mediocrity. The world has an amazing tolerance for mediocrity, it is true. But the edge is not created with this, if edge is what we are after. But let us move to specifics of semantic technology. What are the core precepts, the equivalent of the complexity/memory/synchronization triangle of general purpose CS basics? Let us not forget that, especially in semantic technology, when we have complex operations, lots of data, and almost always multiple distributed data sources, forgetting the laws of physics carries an especially high penalty. Know when to ontologize, when to folksonomize. The history of standards has examples of &quot;stacks of Babel,&quot; sky-high and all-encompassing, which just result in non-communication and non-adoption. Lighter weight, community driven, tag folksonomy, VoCamp-style approaches can be better. But this is a judgment call, entirely contextual, having to do with the maturity of the domain of discourse, etc. Answer only questions that are actually asked. This precept is two-pronged. The literal interpretation is not to do inferential closure for its own sake, materializing all implied facts of the knowledge base. The broader interpretation is to take real-world problems. Expanding RDFS semantics with map-reduce and proving how many iterations this will take is a thing one can do but real-world problems will be more complex and less neat. Deal with ambiguity. Data on which semantic technologies will be applied will be dirty, with errors from machine processing of natural language to erroneous human annotations. The knowledge bases will not be contradiction free. Michael Witbrock of CYC said many good things about this in Sofia; he would have something to say about a curriculum, no doubt. Here we see that semantic technology is a younger discipline than computer science. We can outline some desirable skills and directions to follow but the idea of core precepts is not as well formed. So we can approach the question from the angle of needed skills more than of precepts of science. What should the certified semantician be able to do? Data integration. Given heterogenous relational schemas talking about the same entities, the semantician should find existing ontologies for the domain, possibly extend these, and then map the relational data to them. After the mapping is conceptually done, the semantician must know what combination of ETL and on-the-fly mapping fits the situation. This does mean that the semantician indeed must understand databases, which I above classified as an almost unreachable ideal. But there is no getting around this. Data is increasingly what makes the world go round. From this it follows that everybody must increasingly publish, consume, and refine, i.e., integrate. The anti-database attitude of the semantic web community simply has to go. Design and implement workflows for content extraction, e.g., NLP or information extraction from images. This also means familiarity with NLP, desirably to the point of being able to tune the extraction rule sets of various NLP frameworks. Design SOA workflows. The semantician should be able to extract and represent the semantics of business transactions and the data involved therein. Lightweight knowledge engineering. The experience of building expert systems from the early days of AI is not the best possible, but with semantics attached to data, some sort of rules seem about inevitable. The rule systems will merge into the DBMS in time. Some ability to work with these, short of making expert systems, will be desirable. Understand information quality in the sense of trust, provenance, errors in the information, etc. If the world is run based on data analytics, then one must know what the data in the warehouse means, what accidental and deliberate errors it contains, etc. Of course, most of these tasks take place at some sort of organizational crossroads or interface. This means that the semantician must have some project management skills; must be capable of effectively communicating with different publics and simply getting the job done, always in the face of organizational inertia and often in the face of active resistance from people who view the semantician as some kind of intruder on their turf. Now, this is a tall order. The semantician will have to be reasonably versatile technically, reasonably clever, and a self-starter on top. The self-starter aspect is the hardest. The semanticists I have met are more of the scholar than the IT consultant profile. I say semanticist for the semantic web research people and semantician for the practitioner we are trying to define. We could start by taking people who already do data integration projects and educating them in some semantic technology. We are here talking about a different breed than the one that by nature gravitates to description logics and AI. Projecting semanticist interests or attributes on this public is a source of bias and error. If we talk about a university curriculum, the part that cannot be taught is the leadership and self-starter aspect, or whatever makes a good IT consultant. Thus the semantic technology studies must be profiled so as to attract people with this profile. As quoted before, the dream job for each era is a scarce skill that makes value from something that is plentiful in the environment. At this moment and for a few moments to come, this is the data geek, or maybe even semantician profile, if we take data geek past statistics and traditional business intelligence skills. The semantic tech community, especially the academic branch of it, needs to reinvent itself in order to rise to this occasion. The flavor of the dream job curriculum will be away from the theoretical computer science towards the hands-on of database, large systems performance, and the practicalities of getting data intensive projects delivered. Related Linked Data Driven Data Virtualization for Web-scale Integration (presentation) Linked Data and Virtuoso in 2010 Getting The Linked Data Value Pyramid Layers Right Provenance and Reification in Virtuoso The Time for RDBMS Primacy Downgrade is Nigh! Aspects of RDF to RDF Mapping</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>I was recently asked to write a section for a policy document touching the intersection of database and semantics, as a follow up to the meeting in Sofia I <a href="http://www.openlinksw.com/weblog/oerling/?id=1614" id="link-id0x19c4f938">blogged about earlier</a>. I will write about technology, but this same document also touches the matter of education and computer science curricula. Since the matter came up, I will share a few thoughts on the latter topic.</p>

<p>I have over the years trained a few truly excellent engineers and managed a heterogeneous lot of people. These days, since what we are doing is in fact quite difficult and the world is not totally without competition, I find that I must stick to core competence, which is hardcore tech and leave management to those who have time for it.</p>

<p>When younger, I thought that I could, through sheer personal charisma, transfer either technical skills, sound judgment, or drive and ambition to people I was working with. Well, to the extent I believed this, my own judgment was not sound. Transferring anything at all is difficult and chancy. I must here think of a fantasy novel where a wizard said that, &quot;working such magic that makes things do what they already want to do is easy.&quot; There is a grain of truth in that.</p>

<p>In order to build or manage organizations, we must work, as the wizard put it, <i>with</i> nature, not against it. There are also counter-examples, for example my wife&#39;s grandmother had decided to transform a regular willow into a weeping one by tying down the branches. Such &quot;magic,&quot; needless to say, takes constant maintenance; else the spell breaks.</p>

<p>To operate efficiently, either in business or education, we need to steer away from such endeavors. This is a valuable lesson, but now consider teaching this to somebody. Those who would most benefit from this wisdom are the least receptive to it. So again, we are reminded to stay away from the fantasy of being able to transfer some understanding we think to have and to have this take root. It will if it will and if it does not, it will take constant follow up, like the would-be weeping willow.</p>

<p>Now, in more specific terms, what can we realistically expect to teach about computer science?</p>

<p>Complexity of algorithms would be the first thing. Understanding the relative throughputs and latencies of the memory hierarchy (i.e., <a href="http://dbpedia.org/resource/Cache" id="link-id0x15761008">cache</a>, memory, local network, disk, wide area network) is the second. Understanding the difference of synchronous and asynchronous and the cost of synchronization (i.e., anything from waiting for a mutex to waiting for a network message) is the third.</p>

<p>Understanding how a database works would be immensely helpful for almost any application development task but this is probably asking too much.</p>

<p>Then there is the question of engineering. Where do we put interfaces and what should these interfaces expose? Well, they certainly should expose multiple instances of whatever it is they expose, since passing through an interface takes time.</p>

<p>I tried once to tell the <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x5ca2ea8">SPARQL</a> committee that parameterized queries and array parameters are a self-evident truism on the database side. This is an example of an interface that exposes multiple instances of what it exposes. But the committee decided not to standardize these. There is something in the &quot;semanticist&quot; mind that is irrationally antagonistic to what is self-evident for databasers. This is further an example of ignoring precept 2 above, the point about the throughputs and latencies in the memory hierarchy. Nature is a better and more patient teacher than I; the point will become clear of itself in due time, no worry.</p>

<p>Interfaces seem to be overvalued in education. This is tricky because we should not teach that interfaces are bad either. Nature has islands of tightly intertwined processes, separated by fairly narrow interfaces. People are taught to think in block diagrams, so they probably project this also where it does not apply, thereby missing some connections and porosity of interfaces.</p>

<p>
<a href="http://www.larkc.eu/" id="link-id0x7f63780">LarKC</a> (EU FP7 Large <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x4f50c00">Knowledge</a> Collider project) is an exercise in interfaces. The lessons so far are that coupling needs to be tight, and that the roles of the components are not always as neatly separable as the block diagram suggests.</p>

<p>Recognizing the points where interfaces are naturally narrow is very difficult. Teaching this in a curriculum is likely impossible. This is not to say that the matter should not be mentioned and examples of over-&quot;paradigmatism&quot; given. The geek mind likes to latch on to a paradigm (e.g., object orientation), and then they try to put it everywhere. It is safe to say that taking block diagrams too naively or too seriously makes for poor performance and needless code. In some cases, block diagrams can serve as tactical disinformation; i.e., you give lip service to the values of structure, <a href="http://dbpedia.org/resource/Information" id="link-id0x1f00c378">information</a> hiding, and reuse, which one is not allowed to challenge, ever, and at the same time you do not disclose the competitive edge, which is pretty much always a breach of these same principles.</p>

<p>I was once at a <a href="http://dbpedia.org/resource/Data" id="link-id0x2021f598">data</a> integration workshop in the US where some very qualified people talked about the process of science. They had this delightfully American metaphor for it:</p>

<blockquote>
<i>The edge is created in the &quot;Wild West&quot; â there are no standards or hard-and-fast rules, and paradigmatism for paradigmatism&#39;s sake is a laughing matter with the cowboys in the fringe where new ground is broken. Then there is the OK Corral, where the cowboys shoot it out to see who prevails. Then there is Dodge City, where the lawman already reigns, and compliance, standards, and paradigms are not to be trifled with, lest one get the tar-and-feather treatment and be &quot;driven out o&#39;Dodge.&quot;</i>
</blockquote>

<p>So, if reality is like this, what attitude should the curriculum have towards it? Do we make innovators or followers? Well, as said before, they are not made. Or if they are made, they are not at least made in the university but much before that. I never made any of either, in spite of trying, but did meet many of both kinds. The education system needs to recognize individual differences, even though this is against the trend of turning out a standardized product. Enforced mediocrity makes mediocrity. The world has an amazing tolerance for mediocrity, it is true. But the edge is not created with this, if edge is what we are after.</p>

<p>But let us move to specifics of semantic technology. What are the core precepts, the equivalent of the complexity/memory/synchronization triangle of general purpose CS basics? Let us not forget that, especially in semantic technology, when we have complex operations, lots of data, and almost always multiple distributed data sources, forgetting the laws of physics carries an especially high penalty.</p>

<ul>
 <li>
  <p>
    <b>Know when to ontologize, when to folksonomize.</b> The history of standards has examples of &quot;stacks of Babel,&quot; sky-high and all-encompassing, which just result in non-communication and non-adoption. Lighter weight, community driven, <a href="http://dbpedia.org/resource/Tag" id="link-id0x6b5d288">tag</a> folksonomy, VoCamp-style approaches can be better. But this is a judgment call, entirely contextual, having to do with the maturity of the domain of discourse, etc.</p>
 </li>

<li>
  <p>
    <b>Answer only questions that are actually asked.</b> This precept is two-pronged. The literal interpretation is not to do inferential closure for its own sake, materializing all implied facts of the knowledge base.</p>

<p>The broader interpretation is to take real-world problems. Expanding RDFS semantics with map-reduce and proving how many iterations this will take is a thing one can do but real-world problems will be more complex and less neat.</p>
</li>

<li>
  <p>
    <b>Deal with ambiguity.</b> Data on which semantic technologies will be applied will be dirty, with errors from machine processing of natural language to erroneous human annotations. The knowledge bases will not be contradiction free. Michael Witbrock of CYC said many good things about this in Sofia; he would have something to say about a curriculum, no doubt.</p>
</li>
</ul>

<p>Here we see that semantic technology is a younger discipline than computer science. We can outline some desirable skills and directions to follow but the idea of core precepts is not as well formed.</p>

<p>So we can approach the question from the angle of needed skills more than of precepts of science. What should the certified semantician be able to do?</p>

<ul>
 <li>
  <p>
    <b>Data integration.</b> Given heterogenous relational schemas talking about the same entities, the semantician should find existing ontologies for the domain, possibly extend these, and then map the relational data to them. After the mapping is conceptually done, the semantician must know what combination of ETL and on-the-fly mapping fits the situation. This does mean that the semantician indeed must understand databases, which I above classified as an almost unreachable ideal. But there is no getting around this. Data is increasingly what makes the world go round. From this it follows that everybody must increasingly publish, consume, and refine, i.e., integrate. The anti-database attitude of the <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x1c5f5398">semantic web</a> community simply has to go.</p>
 </li>

<li>
  <p>
    <b>Design and implement workflows for content extraction,</b> e.g., <a href="http://dbpedia.org/resource/Natural_language_processing" id="link-id0x16443fe0">NLP</a> or information extraction from images. This also means familiarity with NLP, desirably to the point of being able to tune the extraction rule sets of various NLP frameworks.</p>
</li>

<li>
  <p>
    <b>Design SOA workflows.</b> The semantician should be able to extract and represent the semantics of business transactions and the data involved therein.</p>
</li>

<li>
  <p>
    <b>Lightweight knowledge engineering.</b> The experience of building expert systems from the early days of AI is not the best possible, but with semantics attached to data, some sort of rules seem about inevitable. The rule systems will merge into the DBMS in time. Some ability to work with these, short of making expert systems, will be desirable.</p>
</li>

<li>
  <p>
    <b>Understand information quality</b> in the sense of trust, provenance, errors in the information, etc. If the world is run based on data analytics, then one must know what the data in the warehouse means, what accidental and deliberate errors it contains, etc.</p>
</li>
</ul>

<p>Of course, most of these tasks take place at some sort of organizational crossroads or interface. This means that the semantician must have some project management skills; must be capable of effectively communicating with different publics and simply getting the job done, always in the face of organizational inertia and often in the face of active resistance from people who view the semantician as some kind of intruder on their turf.</p>

<p>Now, this is a tall order. The semantician will have to be reasonably versatile technically, reasonably clever, and a self-starter on top. The self-starter aspect is the hardest.</p>

<p>The semanticists I have met are more of the scholar than the IT consultant profile. I say <i>semanticist</i> for the semantic web research people and <i>semantician</i> for the practitioner we are trying to define.</p>

<p>We could start by taking people who already do data integration projects and educating them in some semantic technology. We are here talking about a different breed than the one that by nature gravitates to description logics and AI. Projecting semanticist interests or attributes on this public is a source of bias and error.</p>

<p>If we talk about a university curriculum, the part that cannot be taught is the leadership and self-starter aspect, or whatever makes a good IT consultant. Thus the semantic technology studies must be profiled so as to attract people with this profile. As quoted before, the dream job for each era is a scarce skill that makes value from something that is plentiful in the environment. At this moment and for a few moments to come, this is the data geek, or maybe even semantician profile, if we take data geek past statistics and traditional business intelligence skills.</p>

<p>The semantic tech community, especially the academic branch of it, needs to reinvent itself in order to rise to this occasion. The flavor of the dream job curriculum will be away from the theoretical computer science towards the hands-on of database, large systems performance, and the practicalities of getting data intensive projects delivered.</p>

<p>
<b>Related</b>
</p>
<ul>
 <li>
  <a href="http://virtuoso.openlinksw.com/presentations/Linked_Data_Virtualization/Linked_Data_Virtualization.html" id="link-id0x199aca78">Linked Data Driven Data Virtualization for Web-scale Integration (presentation)</a>
 </li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1603" id="link-id0x13297a70">Linked Data and Virtuoso in 2010</a>
</li>
<li>
  <a href="http://www.openlinksw.com/blog/~kidehen/?id=1595" id="link-id0x1a3d0bd0">Getting The Linked Data Value Pyramid Layers Right</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1572" id="link-id0x1802b170">Provenance and Reification in Virtuoso</a>
</li>
<li>
  <a href="http://www.openlinksw.com/blog/~kidehen/?id=1519" id="link-id0x19af4220">The Time for RDBMS Primacy Downgrade is Nigh!</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1375" id="link-id0x1a07a378">Aspects of RDF to RDF Mapping</a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2009-08-14#1568">
  <rss:title>Updated hardware improves LUBM 8000 load rate in Virtuoso 6</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-08-14T19:01:30Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We repeated the earlier LUBM 8000 experiment on a newer machine, with 2 x Xeon 5520 and 72G 1333MHz memory, and once again with the 2 machines as a networked cluster. Otherwise the settings were the same. The load rate is now 160,739 triples-per-second.    Virtuoso 6 (previous run)    Virtuoso 6 (new run)    Virtuoso 6 (newest run) blades    1    1    2 processors    2 x Xeon 5410    2 x Xeon 5520    2 x Xeon 5520 + 2 x Xeon 5410 with 1x1GigE interconnect memory    16G 667 MHz    72G 1333 MHz    72G 1333 MHz + 16G 667 MHz respectively reported load ratetriples-per-second    110,532    160,739    214,188 Again, if others talk about loading LUBM, so must we. Otherwise, this metric is rather uninteresting.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We repeated the <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1562" id="link-id173d3068">earlier LUBM 8000 experiment</a> on a newer machine, with 2 x Xeon 5520 and 72G 1333MHz memory, and once again with the 2 machines as a networked cluster. Otherwise the settings were the same.</p>

<p>The load rate is now 160,739 triples-per-second.</p>

<table>
<tr>
<th></th>
<td>   </td>
<th align="center"><a href="http://virtuoso.openlinksw.com" id="link-id0x199b9740">Virtuoso</a> 6 <br /> (previous run)</th>
<td>   </td>
<th align="center">Virtuoso 6 <br /> (new run)</th>
<td>   </td>
<th align="center">Virtuoso 6 <br /> (newest run)</th>
</tr>
<tr>
<td align="left">blades</td>
<td>   </td>
<td align="center">1 </td>
<td>   </td>
<td align="center">1 </td>
<td>   </td>
<td align="center">2</td>
</tr>
<tr>
<td align="left">processors</td>
<td>   </td>
<td align="center">2 x Xeon 5410</td>
<td>   </td>
<td align="center">2 x Xeon 5520</td>
<td>   </td>
<td align="center"> 2 x Xeon 5520 <br />+ <br />2 x Xeon 5410 <br />with 1x1GigE <br />interconnect </td>
</tr>
<tr>
<td align="left">memory</td>
<td>   </td>
<td align="center"> 16G 667 MHz</td>
<td>   </td>
<td align="center">72G 1333 MHz</td>
<td>   </td>
<td align="center">72G 1333 MHz <br />+ <br /> 16G 667 MHz <br /> respectively</td>
</tr>
<tr>
<td align="left">reported load rate<br />triples-per-second</td>
<td>   </td>
<td align="center"> 110,532 </td>
<td>   </td>
<td align="center"> 160,739 </td>
<td>   </td>
<td align="center"> 214,188  </td>
</tr>
</table>

<p>Again, if others talk about loading LUBM, so must we.  Otherwise, this metric is rather uninteresting.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2009-03-25#1537">
  <rss:title>Beyond Applications - Introducing the Planetary Datasphere (Part 2)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-03-25T15:50:56Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We have looked at the general implications of the DataSphere, a universal, ubiquitous database infrastructure, on end-user experience and application development and content. Now we will look at what this means at the back end, from hosting to security to server software and hardware. Application Hosting For the infrastructure provider, hosting the DataSphere is no different from hosting large Web 2.0 sites. This may be paid for by users, as in the cloud computing model where users rent capacity for their own purposes, or by advertisers, as in most of Web 2.0. Clouds play a role in this as places with high local connectivity. The DataSphere is the atmosphere; the Cloud is an atmospheric phenomenon. What of Proprietary Data and its Security? Having proprietary data does not imply using a proprietary language. I would say that for any domain of discourse, no matter how private or specialized, at least some structural concepts can be borrowed from public, more generic sources. This lowers training thresholds and facilitates integration. Being able to integrate does not imply opening one&#39;s own data. To take an analogy, if you have a bunker with closed circuit air recycling, you still breathe air, even if that air is cut off from the atmosphere at large. For places with complex existing RDBMS security, the best is to map the RDBMS to RDF on the fly, always running all requests through the RDBMS. This implicitly preserves any policy or label based security schemes. What of Individual Privacy on the Open Web? The more complex situations will be found in environments with mixed security needs, as in social networking with partly-open and partly-closed profiles. The FOAF+SSL solution with https:// URIs is one approach. For query processing, we have a question of enforcing instance-level policies. In the DataSphere, granting privileges on tables and views no longer makes sense. In SQL, a policy means that behind the scenes the DBMS will add extra criteria to queries and updates depending on who is issuing them. The query processor adds conditions like getting the user&#39;s department ID and comparing it to the department ID on the payroll record. Labeled security is a scheme where data rows themselves contain security tags and the DBMS enforces these, row by row. I would say that these techniques are suited for highly-structured situations where the roles, compartments, and needs are clear, and where the organization has the database know-how to write, test, and deploy such rules by the table, row, and column. This does not sit well with schema-last. I would not bet much on an average developer&#39;s capacity for making airtight policies on RDF data where not even 100% schema-adherence is guaranteed. Doing security at the RDF graph level seems more appropriate. In many use cases, the graph is analogous to a photo album or a file system directory. A Data Space can be divided into graphs to provide more granularity for expressing topic, provenance, or security. If policy conditions apply mostly to the graph, then things are not as likely to slip by, for example, policy rules missing some infrequent misuse of the schema. In these cases, the burden on the query processor is also not excessive: Just as with documents, the container (table, graph) is the object of access grants, not the individual sentences (DBMS records, RDF triples) in the document. It is left to the application to present a choice of graph level policies to the user. Exactly what these will be depends on the domain of discourse. A policy might restrict access to a meeting in a calendar to people whose OpenIDs figure in the attendee list, or limit access to a photo album to people mentioned in the owner&#39;s social network. Defining such policies is typically a task for the application developer. The difference between the Document Web and the Linked Data Web is that while the Document Web enforces security when a thing is returned to the user, Linked Data Web enforcement must occur whenever a query references something, even if this is an intermediate result not directly shown to the user. The DataSphere will offer a generic policy scheme, filtering what graphs are accessed in a given query situation. Other applications may then verify the safety of one&#39;s disclosed information using the same DataSphere infrastructure. Of course, the user must rely on the infrastructure provider to correctly enforce these rules. Then again, some users will operate and audit their own infrastructure anyway. Federation vs. Centralization On the open web, there is the question of federation vs. centralization. If an application is seen to be an interface to a vocabulary, it becomes more agnostic with respect to this. In practice, if we are talking about hosted services, what is hosted together joins much faster. Data Spaces with lots of interlinking, such as closely connected social networks, will tend to cluster together on the same cloud to facilitate joint operation. Data is ubiquitous and not location-conscious, but what one can efficiently do with it depends on location. Joint access patterns favor joint location. Due to technicalities of the matter, single database clusters will run complex queries within the cluster 100 to 1000 times faster than between clusters. The size of such data clouds may be in the hundreds-of-billions of triples. It seems to make sense to have data belonging to same-type or jointly-used applications close together. In practice, there will arise partitioning by type of usage, user profile, etc., but this is no longer airtight and applications more-or-less float on top of all of this. A search engine can host a copy of the Document Web and allow text lookups on it. But a text lookup is a single well-defined query that happens to parallelize and partition very well. A search engine can also have all the structured public data copied, but the problem there is that queries are a lot less predictable and may take orders of magnitude more resources than a single text lookup. As a partial answer, even now, we can set up a database so that the first million single-row joins cost the user nothing, but doing more requires a special subscription. The cost for hosting a trillion triples will vary radically in function of what throughput is promised. This may result in pricing per service level, a bit like ISP pricing varies in function of promised connectivity. Queries can be run for free if no throughput guarantee applies, and might cost more if the host promises at least five-million joins-per-second including infrequently-accessed data. Performance and cost dynamics will probably lead to the emergence of domain-specific clusters of colocated Data Spaces. The landscape will be hybrid, where usage drives data colocation. A single Google is not a practical solution to the world&#39;s spectrum of query needs. What is the Cost of Schema-Last? The DataSphere proposition is predicated on a worldwide database fabric that can store anything, just like a network can transport anything. It cannot enforce a fixed schema, just like TCP/IP cannot say that it will transport only email. This is continuous schema evolution. Well, TCP/IP can transport anything but it does transport a lot of HTML and email. Similarly, the DataSphere can optimize for some common vocabularies. We have seen that an application-specific relational schema is often 10 times more efficient than an equivalent completely generic RDF representation of the same thing. The gap may narrow, but task specific representations will keep an edge. We ought to know, as we do both. While anything can be represented, the masses are not that creative. For any data-hosting provider, making a specialized representation for the top 100 entities may cut data size in half or better. This is a behind-the-scenes optimization that will in time be a matter of course. Historically, our industry has been driven by two phenomena: New PCs every 2 years. To make this necessary, Windows has been getting bigger and bigger, and not upgrading is not an option if one must exchange documents with new data formats and keep up with security. Agility, or ad hoc over planned. The reason the RDBMS won over CODASYL network databases was that one did not have to define what queries could be made when creating the database. With the Linked Data Web, we have one more step in this direction when we say that one does not have to decide what can be represented when creating the database. To summarize, there is some cost to schema-last, but then our industry needs more complexity to keep justifying constant investment. The cost is in this sense not all bad. Building the DataSphere may be the next great driver of server demand. As a case in point, Cisco, whose fortune was made when the network became ubiquitous, just entered the server game. It&#39;s in the air. DataSphere Precursors Right now, we have the Linked Open Data movement with lots of new data being added. We have the drive for data- and reputation-portability. We have Freebase as a demonstrator of end-users actually producing structured data. We have convergence of terminology around DBpedia, FOAF, SIOC, and more. We have demonstrators of useful data integration on the RDF stack in diverse fields, especially life sciences. We have a totally ubiquitous network for the distribution of this, plus database technology to make this work. We have a practical need for semantics, as search is getting saturated, email is getting killed by spam, and information overload is a constant. Social networks can be leveraged for solving a lot of this, if they can only be opened. Of course, there is a call for transparency in society at large. Well, the battle of transparency vs. spin is a permanent feature of human existence but even there, we cannot ignore the possibilities of open data. Databases and Servers Technically, what does this take? Mostly, this takes a lot of memory. The software is there and we are productizing it as we speak. As with other data intensive things, the key is scalable querying over clusters of commodity servers. Nothing we have not heard before. Of course, the DBMS must know about RDF specifics to get the right query plans and so on but this we have explained elsewhere. This all comes down to the cost of memory. No amount of CPU or network speed will make any difference if data is not in memory. Right now, a board with 8G and a dual core AMD X86-64 and 4 disks may cost about $700. 2 x 4 core Xeon and 16G and 8 disks may be $4000, counting just the components. In our experience, about 32G per billion triples is a minimum. This must be backed by a few independent disks so as to fill the cache in parallel. A cluster with 1 TB of RAM would be under $100K if built from low end boards. The workload is all about large joins across partitions. The queries parallelize well, thus using the largest and most expensive machines for building blocks is not cost efficient. Having absolutely everything in RAM is also not cost efficient, but it is necessary to have many disks to absorb the random access load. Disk access is predominantly random, unlike some analytics workloads that can read serially. If SSD&#39;s get a bit cheaper, one could have SSD for the database and disk for backup. With large data centers, redundancy becomes an issue. The most cost effective redundancy is simply storing partitions in duplicate or triplicate on different commodity servers. The DBMS software should handle the replication and fail-over. For operating such systems, scaling-on-demand is necessary. Data must move between servers, and adding or replacing servers should be an on-the-fly operation. Also, since access is essentially never uniform, the most commonly accessed partitions may benefit from being kept in more copies than less frequently accessed ones. The DBMS must be essentially self administrating since these things are quite complex and easily intractable if one does not have in depth understanding of this rather complex field. The best price point for hardware varies with time. Right now, the optimum is to have many basic motherboards with maximum memory in a rack unit, then another unit with local disks for each motherboard. Much cheaper than SAN&#39;s and Infiniband fabrics. Conclusions and Next Steps The ingredients and use cases are there. If server clusters with 1TB RAM begin under $100K, the cost of deployment is small compared to personnel costs. Bootstrapping the DataSphere from current Linked Open Data, such as DBpedia, OpenCYC, Freebase, and every sort of social network, is feasible. Aside from private data integration and analytics efforts and E-science, the use cases are liberating social networks and C2C and some aspects of search from silos, overcoming spam, and mass use of semantics extracted from text. Emergent effects will then carry the ball to places we have not yet been. The Linked Data Web has its origins in Semantic Web research, and many of the present participants come from these circles. Things may have been slowed down by a disconnect, only too typical of human activity, between Semantic Web research on one hand and database engineering on the other. Right now, the challenge is one of engineering. As documented on this blog, we have worked quite a bit on cluster databases, mostly but not exclusively with RDF use cases. The actual challenges of this are however not at all what is discussed in Semantic Web conferences. These have to do with complexities of parallelism, timing, message bottlenecks, transactions, and the like, i.e., hardcore engineering. These are difficult beyond what the casual onlooker might guess but not impossible. The details that remain to be worked out are nothing semantic, they are hardcore database, concerning automatic provisioning and such matters. It is as if the Semantic Web people look with envy at the Web 2.0 side where there are big deployments in production, yet they do not seem quite ready to take the step themselves. Well, I will write some other time about research and engineering. For now, the message is &amp;mdash go for it. Stay tuned for more announcements, as we near production with our next generation of software. Related Beyond Applications - Introducing the Planetary Datasphere (Part 1) Serendipitous Discovery Quotient (SDQ) How Linked Data will change Advertising The Time for RDBMS Primacy Downgrade is Nigh! Data Spaces</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>
<a href="http://www.openlinksw.com/weblog/oerling/?id=1535" id="link-id155e3bd0">We have looked at the general implications of the DataSphere</a>, a universal, ubiquitous database infrastructure, on end-user experience and application development and content. Now we will look at what this means at the back end, from hosting to security to server software and hardware.</p>

<h2>Application Hosting</h2>

<p>For the infrastructure provider, hosting the DataSphere is no different from hosting large Web 2.0 sites. This may be paid for by users, as in the cloud computing model where users rent capacity for their own purposes, or by advertisers, as in most of Web 2.0.</p>

<p>Clouds play a role in this as places with high local connectivity. The DataSphere is the atmosphere; the Cloud is an atmospheric phenomenon.</p>

<h2>What of Proprietary <a href="http://dbpedia.org/resource/Data" id="link-id0x13b5b4a0">Data</a> and its Security?</h2>

<p>Having proprietary data does not imply using a proprietary language. I would say that for any domain of discourse, no matter how private or specialized, at least some structural concepts can be borrowed from public, more generic sources. This lowers training thresholds and facilitates integration. Being able to integrate does not imply opening one&#39;s own data. To take an analogy, if you have a bunker with closed circuit air recycling, you still breathe air, even if that air is cut off from the atmosphere at large. For places with complex existing <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x24db80e0">RDBMS</a> security, the best is to map the RDBMS to <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x24ea7c40">RDF</a> on the fly, always running all requests through the RDBMS. This implicitly preserves any policy or label based security schemes.</p>

<h2>What of Individual Privacy on the Open Web?</h2>

<p>The more complex situations will be found in environments with mixed security needs, as in social networking with partly-open and partly-closed profiles. The FOAF+SSL solution with <code>https://</code> URIs is one approach. For query processing, we have a question of enforcing instance-level policies. In the DataSphere, granting privileges on tables and views no longer makes sense. In <a href="http://dbpedia.org/resource/SQL" id="link-id0x24aaccc0">SQL</a>, a policy means that behind the scenes the DBMS will add extra criteria to queries and updates depending on who is issuing them. The query processor adds conditions like getting the user&#39;s department ID and comparing it to the department ID on the payroll record. Labeled security is a scheme where data rows themselves contain security tags and the DBMS enforces these, row by row.</p>

<p>I would say that these techniques are suited for highly-structured situations where the roles, compartments, and needs are clear, and where the organization has the database know-how to write, test, and deploy such rules by the table, row, and column. This does not sit well with schema-last. I would not bet much on an average developer&#39;s capacity for making airtight policies on RDF data where not even 100% schema-adherence is guaranteed.</p>

<p>Doing security at the RDF graph level seems more appropriate. In many use cases, the graph is analogous to a photo album or a file system directory. A Data <a href="http://en.wikipedia.org/wiki/Data_Spaces" id="link-id0x2396c058">Space</a> can be divided into graphs to provide more granularity for expressing topic, provenance, or security. If policy conditions apply mostly to the graph, then things are not as likely to slip by, for example, policy rules missing some infrequent misuse of the schema. In these cases, the burden on the query processor is also not excessive: Just as with documents, the container (table, graph) is the object of access grants, not the individual sentences (DBMS records, RDF triples) in the document.</p>

<p>It is left to the application to present a choice of graph level policies to the user. Exactly what these will be depends on the domain of discourse. A policy might restrict access to a meeting in a calendar to people whose OpenIDs figure in the attendee list, or limit access to a photo album to people mentioned in the owner&#39;s social network. Defining such policies is typically a task for the application developer.</p>

<p>The difference between the Document Web and the <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x238a0098">Linked Data</a> <a href="http://dbpedia.org/resource/Giant_Global_Graph" id="link-id0x23882280">Web</a> is that while the Document Web enforces security when a thing is returned to the user, Linked Data Web enforcement must occur whenever a query references something, even if this is an intermediate result not directly shown to the user.</p>

<p>The DataSphere will offer a generic policy scheme, filtering what graphs are accessed in a given query situation. Other applications may then verify the safety of one&#39;s disclosed <a href="http://dbpedia.org/resource/Information" id="link-id0x2388e458">information</a> using the same DataSphere infrastructure. Of course, the user must rely on the infrastructure provider to correctly enforce these rules. Then again, some users will operate and audit their own infrastructure anyway.</p>

<h2>Federation vs. Centralization</h2>

<p>On the open web, there is the question of federation vs. centralization. If an application is seen to be an interface to a vocabulary, it becomes more agnostic with respect to this. In practice, if we are talking about hosted services, what is hosted together joins much faster. Data Spaces with lots of interlinking, such as closely connected social networks, will tend to cluster together on the same cloud to facilitate joint operation. Data is ubiquitous and not location-conscious, but what one can efficiently do with it depends on location. Joint access patterns favor joint location. Due to technicalities of the matter, single database clusters will run complex queries within the cluster 100 to 1000 times faster than between clusters. The size of such data clouds may be in the hundreds-of-billions of triples. It seems to make sense to have data belonging to same-type or jointly-used applications close together. In practice, there will arise partitioning by type of usage, user profile, etc., but this is no longer airtight and applications more-or-less float on top of all of this.</p>

<p>A search engine can host a copy of the Document Web and allow text lookups on it. But a text lookup is a single well-defined query that happens to parallelize and partition very well. A search engine can also have all the structured public data copied, but the problem there is that queries are a lot less predictable and may take orders of magnitude more resources than a single text lookup. As a partial answer, even now, we can set up a database so that the first million single-row joins cost the user nothing, but doing more requires a special subscription.</p>

<p>The cost for hosting a trillion triples will vary radically in function of what throughput is promised. This may result in pricing per service level, a bit like ISP pricing varies in function of promised connectivity. Queries can be run for free if no throughput guarantee applies, and might cost more if the host promises at least five-million joins-per-second including infrequently-accessed data.</p>

<p>Performance and cost dynamics will probably lead to the emergence of domain-specific clusters of colocated Data Spaces. The landscape will be hybrid, where usage drives data colocation. A single Google is not a practical solution to the world&#39;s spectrum of query needs.</p>

<h2>What is the Cost of Schema-Last?</h2>

<p>The DataSphere proposition is predicated on a worldwide database fabric that can store anything, just like a network can transport anything. It cannot enforce a fixed schema, just like TCP/IP cannot say that it will transport only email. This is continuous schema evolution. Well, TCP/IP can transport anything but it does transport a lot of HTML and email. Similarly, the DataSphere can optimize for some common vocabularies.</p>

<p>We have seen that an application-specific relational schema is often 10 times more efficient than an equivalent completely generic RDF representation of the same thing. The gap may narrow, but task specific representations will keep an edge. We ought to know, as we do both.</p>

<p>While anything can be represented, the masses are not that creative. For any data-hosting provider, making a specialized representation for the top 100 entities may cut data size in half or better. This is a behind-the-scenes optimization that will in time be a matter of course.</p>

<p>Historically, our industry has been driven by two phenomena:</p>

<ol>
<li>
  <b>New PCs every 2 years.</b> To make this necessary, Windows has been getting bigger and bigger, and not upgrading is not an option if one must exchange documents with new data formats and keep up with security.</li>

<li>
  <b>Agility, or <i>ad hoc</i> over planned.</b> The reason the RDBMS won over <a href="http://dbpedia.org/resource/CODASYL" id="link-id0x13b23460">CODASYL</a> network databases was that one did not have to define what queries could be made when creating the database. With the Linked Data Web, we have one more step in this direction when we say that one does not have to decide what can be represented when creating the database.</li>
</ol>

<p>To summarize, there is some cost to schema-last, but then our industry needs more complexity to keep justifying constant investment. The cost is in this sense not all bad.</p>

<p>Building the DataSphere may be the next great driver of server demand. As a case in point, Cisco, whose fortune was made when the network became ubiquitous, just entered the server game. It&#39;s in the air.</p>

<h2>DataSphere Precursors</h2>

<p>Right now, we have the <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x236a9be8">Linked Open Data</a> movement with lots of new data being added. We have the drive for data- and reputation-portability. We have Freebase as a demonstrator of end-users actually producing structured data. We have convergence of terminology around <a href="http://dbpedia.org/resource/DBpedia" id="link-id0x24db8350">DBpedia</a>, FOAF, SIOC, and more. We have demonstrators of useful data integration on the RDF stack in diverse fields, especially life sciences.</p>

<p>We have a totally ubiquitous network for the distribution of this, plus database technology to make this work.</p>

<p>We have a practical need for semantics, as search is getting saturated, email is getting killed by spam, and information overload is a constant. Social networks can be leveraged for solving a lot of this, if they can only be opened.</p>

<p>Of course, there is a call for transparency in society at large. Well, the battle of transparency vs. spin is a permanent feature of human existence but even there, we cannot ignore the possibilities of open data.</p>

<h2>Databases and Servers</h2>

<p>Technically, what does this take? Mostly, this takes a lot of memory. The software is there and we are productizing it as we speak. As with other data intensive things, the key is scalable querying over clusters of commodity servers. Nothing we have not heard before. Of course, the DBMS must know about RDF specifics to get the right query plans and so on but this we have explained elsewhere.</p>

<p>This all comes down to the cost of memory. No amount of CPU or network speed will make any difference if data is not in memory. Right now, a board with 8G and a dual core AMD X86-64 and 4 disks may cost about $700. 2 x 4 core Xeon and 16G and 8 disks may be $4000, counting just the components. In our experience, about 32G per billion triples is a minimum. This must be backed by a few independent disks so as to fill the cache in parallel. A cluster with 1 TB of RAM would be under $100K if built from low end boards.</p>

<p>The workload is all about large joins across partitions. The queries parallelize well, thus using the largest and most expensive machines for building blocks is not cost efficient. Having absolutely everything in RAM is also not cost efficient, but it is necessary to have many disks to absorb the random access load. Disk access is predominantly random, unlike some analytics workloads that can read serially. If SSD&#39;s get a bit cheaper, one could have SSD for the database and disk for backup.</p>

<p>With large data centers, redundancy becomes an issue. The most cost effective redundancy is simply storing partitions in duplicate or triplicate on different commodity servers. The DBMS software should handle the replication and fail-over.</p>

<p>For operating such systems, scaling-on-demand is necessary. Data must move between servers, and adding or replacing servers should be an on-the-fly operation. Also, since access is essentially never uniform, the most commonly accessed partitions may benefit from being kept in more copies than less frequently accessed ones. The DBMS must be essentially self administrating since these things are quite complex and easily intractable if one does not have in depth understanding of this rather complex field.</p>

<p>The best price point for hardware varies with time. Right now, the optimum is to have many basic motherboards with maximum memory in a rack unit, then another unit with local disks for each motherboard. Much cheaper than SAN&#39;s and Infiniband fabrics.</p>

<h2>Conclusions and Next Steps</h2>

<p>The ingredients and use cases are there. If server clusters with 1TB RAM begin under $100K, the cost of deployment is small compared to personnel costs.</p>

<p>Bootstrapping the DataSphere from current Linked Open Data, such as DBpedia, <a href="http://dbpedia.org/resource/Cyc" id="link-id0x2396a038">OpenCYC</a>, Freebase, and every sort of social network, is feasible. Aside from private data integration and analytics efforts and E-science, the use cases are liberating social networks and C2C and some aspects of search from silos, overcoming spam, and mass use of semantics extracted from text. Emergent effects will then carry the ball to places we have not yet been.</p>

<p>The Linked Data Web has its origins in <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x13ea7110">Semantic Web</a> research, and many of the present participants come from these circles. Things may have been slowed down by a disconnect, only too typical of human activity, between Semantic Web research on one hand and database engineering on the other. Right now, the challenge is one of engineering. As documented on this <a href="http://dbpedia.org/resource/Blog" id="link-id0x2388e368">blog</a>, we have worked quite a bit on cluster databases, mostly but not exclusively with RDF use cases. The actual challenges of this are however not at all what is discussed in Semantic Web conferences. These have to do with complexities of parallelism, timing, message bottlenecks, transactions, and the like, i.e., hardcore engineering. These are difficult beyond what the casual onlooker might guess but not impossible. The details that remain to be worked out are nothing semantic, they are hardcore database, concerning automatic provisioning and such matters.</p>

<p>It is as if the Semantic Web people look with envy at the Web 2.0 side where there are big deployments in production, yet they do not seem quite ready to take the step themselves. Well, I will write some other time about research and engineering. For now, the message is &amp;mdash <i><b>go for it</b></i>. Stay tuned for more announcements, as we near production with our next generation of software.</p>


<h2>Related</h2>
<ul>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1535" id="link-id14e02bb0">Beyond Applications - Introducing the Planetary Datasphere (Part 1)</a>
</li>
<li>
  <a href="http://www.openlinksw.com/blog/~kidehen/?id=1442" id="link-id117dc518">Serendipitous Discovery Quotient (SDQ)</a>
</li>
<li>
  <a href="http://www.openlinksw.com/blog/~kidehen/?id=1534" id="link-id15c52410">How Linked Data will change Advertising</a>
</li>
<li>
  <a href="http://www.openlinksw.com/blog/~kidehen/?id=1519" id="link-id11e93658">The Time for RDBMS Primacy Downgrade is Nigh!</a>
</li>
<li>
  <a href="http://www.openlinksw.com/blog/~kidehen/?tag=DataSpace" id="link-id1491a588">Data Spaces</a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2009-03-24#1535">
  <rss:title>Beyond Applications - Introducing the Planetary Datasphere (Part 1)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-03-24T14:38:57Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">This is the first in a short series of blog posts about what becomes possible when essentially unlimited linked data can be deployed on the open web and private intranets. The term DataSphere comes from Dan Simmons&#39; Hyperion science fiction series, where it is a sort of pervasive computing capability that plays host to all sorts of processes, including what people do on the net today, and then some. I use this term here in order to emphasize the blurring of silo and application boundaries. The network is not only the computer but also the database. I will look at what effects the birth of a sort of linked data stratum can have on end-user experience, application development, application deployment and hosting, business models and advertising, and security; how cloud computing fits in; and how back-end software such as databases must evolve to support all of these. This is a mid-term vision. The components are coming into production as we speak, but the end result is not here quite yet. I use the word DataSphere to refer to a worldwide database fabric, a global Distributed DBMS collective, within which there are many Data Spaces, or Named Data Spaces. A Data Space is essentially a person&#39;s or organization&#39;s contribution to the DataSphere. I use Linked Data Web to refer to component technologies and practices such as RDF, SPARQL, Linked Data practices, etc. The DataSphere does not have to be built on this technology stack per se, but this stack is still the best bet for it. General There exist applications for performing specialized functions such as social networking, shopping, document search, and C2C commerce at planetary scale. All these applications run on their own databases, each with a task specific schema. They communicate by web pages and by predefined messages for diverse application-specific transactions and reports. These silos are scalable because in general their data has some natural partitioning, and because the set of transactions is predetermined and the data structure is set up for this. The Linked Data Web proposes to create a data infrastructure that can hold anything, just like a network can transport anything. This is not a network with a memory of messages, but a whole that can answer arbitrary questions about what has been said. The prerequisite is that the questions are phrased in a vocabulary that is compatible with the vocabulary in which the statements themselves were made. In this setting, the vocabulary takes the place of the application. Of course, there continues to be a procedural element to applications; this has the function of translating statements between the domain vocabulary and a user interface. Examples are data import from existing applications, running predefined reports, composing new reports, and translating between natural language and the domain vocabulary. The big difference is that the database moves outside of the silo, at least in logical terms. The database will be like the network â horizontal and ubiquitous. The equivalent of TCP/IP will be the RDF/SPARQL combination. The equivalent of routing protocols between ISPs will be gateways between the specific DBMS engines supporting the services. The place of the DBMS in the stack changes The RDBMS in itself is eternal, or at least as eternal as a culture with heavy reliance on written records is. Any such culture will invent the RDBMS and use it where it best fits. We are not replacing this; we are building an abstracted worldwide data layer. This is to the RDBMS supporting line-of-business applications what the www was to enterprise content management systems. For transactions, the Web 2.0-style application-specific messages are fine. Also, any transactional system that must be audited must physically reside somewhere, have physical security, etc. It can&#39;t just be somewhere in the DataSphere, managed by some system with which one has no contract, just like Google&#39;s web page cache can&#39;t be relied on as a permanent repository of web content. Providing space on the Linked Data Web is like providing hosting on the Document Web. This may have varying service levels, pricing models, etc. The value of a queriable DataSphere is that a new application does not have to begin by building its own schema, database infrastructure, service hosting, etc. The application becomes more like a language meme, a cultural form of interaction mediated by a relatively lightweight user-facing component, laterally open for unforeseen interaction with other applications from other domains of discourse. End User Benefits For the end user, the web will still look like a place where one can shop, discuss, date, whatever. These activities will be mediated by user interfaces as they are now. Right now, the end user&#39;s web presence is his/her blog or web site, and their contributions to diverse wikis, social web sites, and so forth. These are scattered. The user&#39;s Data Space is the collection of all these things, now presented in a queriable form. The user&#39;s Data Space is the user&#39;s statement of presence, referencing the diverse contributions of the user on diverse sites. The personal Data Space being a queriable, structured whole facilitates finding and being found, which is what brings individuals to the web in the first place. The best applications and sites are those which make this the easiest. The Linked Data Web allows saying what one wishes in a structured, queriable manner, across all application domains, independently of domain specific silos. The end user&#39;s interaction with the personal data space is through applications, like now. But these applications are just wrappers on top of self describing data, represented in domain specific vocabularies; one vocabulary is used for social networking, another for C2C commerce, and so on. The user is the master of their personal Data Space, free to take it where he or she wishes. Further benefits will include more ready referencing between these spaces, more uniform identity management, cross-application operations, and the emergence of &quot;meta-applications,&quot; i.e., unified interfaces for managing many related applications/tasks. Of course, there is the increase in semantic richness, such as better contextuality derived from entity extraction from text. But this is also possible in a silo. The Linked Data Web angle is the sharing of identifiers for real world entities, which makes extracts of different sources by different parties potentially joinable. The user interaction will hardly ever be with the raw data. But the raw data being still at hand makes for better targeting of advertisements, better offering of related services, easier discovery of related content, and less noise overall. Kingsley Idehen has coined the term SDQ, for Serendipitous Discovery Quotient, to denote this. When applications expose explicit semantics, constructing a user experience that combines relevant data from many sources, including applications as well as highly targeted advertising, becomes natural. It is no longer a matter of &quot;mashing up&quot; web service interfaces with procedural code, but of &quot;meshing&quot; data through declarative queries across application spaces. Applications in the DataSphere The workflows supported by the DataSphere are essentially those taking place on the web now. The DataSphere dimension is expressed by bookmarklets, browser plugins, and the like, with ready access to related data and actions that are relevant for this data. Actions triggered by data can be anything from posting a comment to making an e-commerce purchase. Web 2.0 models fit right in. Web application development now consists of designing an application-specific database schema and writing web pages to interact with this schema. In the DataSphere, the database is abstracted away, as is a large part of the schema. The application floats on a sea of data instead of being tied to its own specific store and schema. Some local transaction data should still be handled in the old way, though. For the application developer, the question becomes one of vocabulary choice. How will the application synthesize URIs from the user interaction? Which URIs will be used, since pretty much anything will in practice have many names (e.g., DBpedia Vs. Freebase identifiers). The end user will generally have no idea of this choice, nor of the various degrees of normalization, etc., in the vocabularies. Still, usage of such applications will produce data using some identifiers and vocabularies. Benefits of ready joining without translation will drive adoption. A vocabulary with instance data will get more instance data. The Linked Data Web infrastructure itself must support vocabulary and identifier choice by answering questions about who uses a particular identifier and where. Even now, we offer entity ranks and resolution of synonyms, queries on what graphs mention a certain identifier and so on. This is a means of finding the most commonly used term for each situation. Convergence of terminology cuts down on translation and makes for easier and more efficient querying. Advertising The application developer is, for purposes of advertising, in the position of the inventory owner, just like a traditional publisher, whether web or other. But with smarter data, it is not a matter of static keywords but of the semantically explicit data behind each individual user impression driving the ads. Data itself carries no ads but the user impression will still go through a display layer that can show ads. If the application relies on reuse of licensed content, such as media, then the content provider may get a cut of the ad revenue even if it is not the direct owner of the inventory. The specifics of implementing and enforcing this are to be worked out. Content Providers, License, and Attribution For the content provider, the URI is the brand carrier. If the data is well linked and queriable, this will drive usage and traffic to the services of the content provider. This is true of any provider, whether a media publisher, e-commerce business, government agency, or anything else. Intellectual property considerations will make the URI a first class citizen. Just like the URI is a part of the document web experience, it is a part of the Linked Data Web experience. Just like Creative Commons licenses allow the licensor to define what type of attribution is required, a data publisher can mandate that a user experience mediated by whatever application should expose the source as a dereferenceable URI. One element of data dereferencing must be linking to applications that facilitate human interaction with the data. A generic data browser is a developer tool; the end user experience must still be mediated by interfaces tailored to the domain. This layer can take care of making the brand visible and can show advertising or be monetized on a usage basis. Next we will look at the service provider and infrastructure side of this. Related Serendipitous Discovery Quotient (SDQ) How Linked Data will change Advertising The Time for RDBMS Primacy Downgrade is Nigh! Data Spaces</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>This is the first in a short series of <a href="http://dbpedia.org/resource/Blog" id="link-id0x12c91d60">blog</a> posts about what becomes possible when essentially unlimited <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x2375f488">linked data</a> can be deployed on the open web and private intranets.</p>

<p>The term <i>DataSphere</i> comes from Dan Simmons&#39; <i><a href="http://dbpedia.org/resource/Hyperion_Cantos" id="link-id12ad4718">Hyperion</a></i> science fiction series, where it is a sort of pervasive computing capability that plays host to all sorts of processes, including what people do on the <a href="http://dbpedia.org/resource/.NET_Framework" id="link-id0x13084f08">net</a> today, and then some. I use this term here in order to emphasize the blurring of silo and application boundaries. The network is not only the computer but also the database. I will look at what effects the birth of a sort of linked data stratum can have on end-user experience, application development, application deployment and hosting, business models and advertising, and security; how cloud computing fits in; and how back-end software such as databases must evolve to support all of these.</p>

<p>This is a mid-term vision. The components are coming into production as we speak, but the end result is not here quite yet.</p>

<p>I use the word <i>DataSphere</i> to refer to a worldwide database fabric, a global Distributed DBMS collective, within which there are many <a href="http://dbpedia.org/resource/Data" id="link-id0x2504fff8">Data</a> Spaces, or Named Data Spaces. A <i>Data <a href="http://en.wikipedia.org/wiki/Data_Spaces" id="link-id0x81175fa0">Space</a></i> is essentially a person&#39;s or organization&#39;s contribution to the DataSphere. I use <i>Linked Data <a href="http://dbpedia.org/resource/Giant_Global_Graph" id="link-id0x70f4e190">Web</a></i> to refer to component technologies and practices such as <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x3a5ddcd8">RDF</a>, <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x23b049e0">SPARQL</a>, Linked Data practices, etc. The DataSphere does not have to be built on this technology stack <i>per se</i>, but this stack is still the best bet for it.</p>

<h2>General</h2>

<p>There exist applications for performing specialized functions such as social networking, shopping, document search, and C2C commerce at planetary scale. All these applications run on their own databases, each with a task specific schema. They communicate by web pages and by predefined messages for diverse application-specific transactions and reports.</p>

<p>These silos are scalable because in general their data has some natural partitioning, and because the set of transactions is predetermined and the data structure is set up for this.</p>

<p>The Linked Data Web proposes to create a data infrastructure that can hold anything, just like a network can transport anything. This is not a network with a memory of messages, but a whole that can answer arbitrary questions about what has been said. The prerequisite is that the questions are phrased in a vocabulary that is compatible with the vocabulary in which the statements themselves were made.</p>

<p>In this setting, the vocabulary takes the place of the application. Of course, there continues to be a procedural element to applications; this has the function of translating statements between the domain vocabulary and a user interface. Examples are data import from existing applications, running predefined reports, composing new reports, and translating between natural language and the domain vocabulary.</p>

<p>The big difference is that the database moves outside of the silo, at least in logical terms. The database will be like the network â horizontal and ubiquitous. The equivalent of TCP/IP will be the RDF/SPARQL combination. The equivalent of routing protocols between ISPs will be gateways between the specific DBMS engines supporting the services.</p>

<h2>The place of the DBMS in the stack changes</h2>

<p>The <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x10082590">RDBMS</a> in itself is eternal, or at least as eternal as a culture with heavy reliance on written records is. Any such culture will invent the RDBMS and use it where it best fits. We are not replacing this; we are building an abstracted worldwide data layer. This is to the RDBMS supporting line-of-business applications what the www was to enterprise content management systems.</p>

<p>For transactions, the Web 2.0-style application-specific messages are fine. Also, any transactional system that must be audited must physically reside somewhere, have physical security, etc. It can&#39;t just be somewhere in the DataSphere, managed by some system with which one has no contract, just like Google&#39;s web page cache can&#39;t be relied on as a permanent repository of web content.</p>

<p>Providing space on the Linked Data Web is like providing hosting on the Document Web. This may have varying service levels, pricing models, etc. The value of a queriable DataSphere is that a new application does not have to begin by building its own schema, database infrastructure, service hosting, etc. The application becomes more like a language <a href="http://dbpedia.org/resource/Meme" id="link-id0x23c85e68">meme</a>, a cultural form of interaction mediated by a relatively lightweight user-facing component, laterally open for unforeseen interaction with other applications from other domains of discourse.</p>

<h2>End User Benefits</h2>

<p>For the end user, the web will still look like a place where one can shop, discuss, date, whatever. These activities will be mediated by user interfaces as they are now. Right now, the end user&#39;s web presence is his/her blog or web site, and their contributions to diverse wikis, social web sites, and so forth. These are scattered. The user&#39;s Data Space is the collection of all these things, now presented in a queriable form. The user&#39;s Data Space is the user&#39;s statement of presence, referencing the diverse contributions of the user on diverse sites.</p>

<p>The personal Data Space being a queriable, structured whole facilitates finding and being found, which is what brings individuals to the web in the first place. The best applications and sites are those which make this the easiest. The Linked Data Web allows saying what one wishes in a structured, queriable manner, across all application domains, independently of domain specific silos. The end user&#39;s interaction with the personal data space is through applications, like now. But these applications are just wrappers on top of self describing data, represented in domain specific vocabularies; one vocabulary is used for social networking, another for C2C commerce, and so on. The user is the master of their personal Data Space, free to take it where he or she wishes.</p>

<p>Further benefits will include more ready referencing between these spaces, more uniform identity management, cross-application operations, and the emergence of &quot;meta-applications,&quot; i.e., unified interfaces for managing many related applications/tasks.</p>

<p>Of course, there is the increase in semantic richness, such as better contextuality derived from <a href="http://dbpedia.org/resource/Entity" id="link-id0x23904698">entity</a> extraction from text. But this is also possible in a silo. The Linked Data Web angle is the sharing of identifiers for real world entities, which makes extracts of different sources by different parties potentially joinable. The user interaction will hardly ever be with the raw data. But the raw data being still at hand makes for better targeting of advertisements, better offering of related services, easier discovery of related content, and less noise overall.</p>

<p>
<a href="http://myopenlink.net/dataspace/person/kidehen#this" id="link-id0x37342a60">Kingsley Idehen</a> has coined the term <a href="http://www.openlinksw.com/blog/~kidehen/?id=1442" id="link-id0x3a56e4e8">SDQ</a>, for <a href="http://www.openlinksw.com/blog/~kidehen/?id=1442" id="link-id0x23649b70">Serendipitous Discovery Quotient</a>, to denote this. When applications expose explicit semantics, constructing a user experience that combines relevant data from many sources, including applications as well as highly targeted advertising, becomes natural. It is no longer a matter of &quot;mashing up&quot; web service interfaces with procedural code, but of &quot;meshing&quot; data through declarative queries across application spaces.</p>

<h2>Applications in the DataSphere</h2>

<p>The workflows supported by the DataSphere are essentially those taking place on the web now. The DataSphere dimension is expressed by bookmarklets, browser plugins, and the like, with ready access to related data and actions that are relevant for this data. Actions triggered by data can be anything from posting a comment to making an e-commerce purchase. Web 2.0 models fit right in.</p>

<p>Web application development now consists of designing an application-specific database schema and writing web pages to interact with this schema. In the DataSphere, the database is abstracted away, as is a large part of the schema. The application floats on a sea of data instead of being tied to its own specific store and schema. Some local transaction data should still be handled in the old way, though.</p>

<p>For the application developer, the question becomes one of vocabulary choice. How will the application synthesize URIs from the user interaction? Which URIs will be used, since pretty much anything will in practice have many names (e.g., <a href="http://dbpedia.org/resource/DBpedia" id="link-id0x2364eae8">DBpedia</a> Vs. Freebase identifiers). The end user will generally have no idea of this choice, nor of the various degrees of normalization, etc., in the vocabularies. Still, usage of such applications will produce data using some identifiers and vocabularies. Benefits of ready joining without translation will drive adoption. A vocabulary with instance data will get more instance data.</p>

<p>The Linked Data Web infrastructure itself must support vocabulary and identifier choice by answering questions about who uses a particular identifier and where. Even now, we offer entity ranks and resolution of synonyms, queries on what graphs mention a certain identifier and so on. This is a means of finding the most commonly used term for each situation. Convergence of terminology cuts down on translation and makes for easier and more efficient querying.</p>

<h2>Advertising</h2>

<p>The application developer is, for purposes of advertising, in the position of the inventory owner, just like a traditional publisher, whether web or other. But with smarter data, it is not a matter of static keywords but of the semantically explicit data behind each individual user impression driving the ads. Data itself carries no ads but the user impression will still go through a display layer that can show ads. If the application relies on reuse of licensed content, such as media, then the content provider may get a cut of the ad revenue even if it is not the direct owner of the inventory. The specifics of implementing and enforcing this are to be worked out.</p>

<h2>Content Providers, License, and Attribution</h2>

<p>For the content provider, the <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id0xa9abc2f8">URI</a> is the brand carrier. If the data is well linked and queriable, this will drive usage and traffic to the services of the content provider. This is true of any provider, whether a media publisher, e-commerce business, government agency, or anything else.</p>

<p>Intellectual property considerations will make the URI a first class citizen. Just like the URI is a part of the document web experience, it is a part of the Linked Data Web experience. Just like Creative Commons licenses allow the licensor to define what type of attribution is required, a data publisher can mandate that a user experience mediated by whatever application should expose the source as a dereferenceable URI.

</p>
<p>One element of data dereferencing must be linking to applications that facilitate human interaction with the data. A generic data browser is a developer tool; the end user experience must still be mediated by interfaces tailored to the domain. This layer can take care of making the brand visible and can show advertising or be monetized on a usage basis.</p>

<p>Next we will look at the service provider and infrastructure side of this.</p>

<h2>Related</h2>
<ul>
<li>
  <a href="http://www.openlinksw.com/blog/~kidehen/?id=1442" id="link-id148ea4e0">Serendipitous Discovery Quotient (SDQ)</a>
</li>
<li>
  <a href="http://www.openlinksw.com/blog/~kidehen/?id=1534" id="link-id14b07f88">How Linked Data will change Advertising</a>
</li>
<li>
  <a href="http://www.openlinksw.com/blog/~kidehen/?id=1519" id="link-id117c6608">The Time for RDBMS Primacy Downgrade is Nigh!</a>
</li>
<li>
  <a href="http://www.openlinksw.com/blog/~kidehen/?tag=DataSpace" id="link-id154e1d58">Data Spaces</a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2009-01-02#1510">
  <rss:title>Linked Data &amp; The Year 2009 (updated)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-01-02T16:17:06Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">As is fitting for the season, I will editorialize a bit about what has gone before and what is to come. Sir Tim said it at WWW08 in Beijing â linked data and the linked data web is the semantic web and the Web done right. The grail of ad hoc analytics on infinite data has lost none of its appeal. We have seen fresh evidence of this in the realm of data warehousing products, as well as storage in general. The benefits of a data model more abstract than the relational are being increasingly appreciated also outside the data web circles. Microsoft&#39;s Entity Frameworks technology is an example. Agility has been a buzzword for a long time. Everything should be offered in a service based business model and should interoperate and integrate with everything else â business needs first; schema last. Not to forget that when money is tight, reuse of existing assets and paying on a usage basis are naturally emphasized. Information, as the asset it is, is none the less important, on the contrary. But even with information, value should be realized economically, which, among other things, entails not reinventing the wheel. It is against this backdrop that this year will play out. As concerns research, I will again quote Harry Halpin at ESWC 2008: &quot;Men will fight in a war, and even lose a war, for what they believe just. And it may come to pass that later, even though the war were lost, the things then fought for will emerge under another name and establish themselves as the prevailing reality&quot; [or words to this effect]. Something like the data web, and even the semantic web, will happen. Harry&#39;s question was whether this would be the descendant of what is today called semantic web research. I heard in conversation about a project for making a very large metadata store. I also heard that the makers did not particularly insist on this being RDF-based, though. Why should such a thing be RDF-based? If it is already accepted that there will be ad hoc schema and that queries ought to be able to view the data from all angles, not be limited by having indices one way and not another way, then why not RDF? The justification of RDF is in reusing and linking-to data and terminology out there. Another justification is that by using an RDF store, one is spared a lot of work and tons of compromises which attend making an entity-attribute-value (EAV, i.e., triple) store on a generic RDBMS. The sem-web world has been there, trust me. We came out well because we put all inside the RDBMS, lowest level, which you can&#39;t do unless you own the RDBMS. Source access is not enough; you also need the knowledge. Technicalities aside, the question is one of proprietary vs. standards-based. This is not only so with software components, where standards have consistently demonstrated benefits, but now also with the data. Zemanta and OpenCalais serving DBpedia URIs are examples. Even in entirely closed applications, there is benefit in reusing open vocabularies and identifiers: One does not need to create a secret language for writing a secret memo. Where data is a carrier of value, its value is enhanced by it being easy to repurpose (i.e., standard vocabularies) and to discover (i.e., data set metadata). As on the web, so on the enterprise intranet. In this lies the strength of RDF as opposed to proprietary flexible database schemes. This is a qualitative distinction. In hoc signo vinces. In this light, we welcome the voiD (VOcabulary of Interlinked Data), which is the first promise of making federatable data discoverable. Now that there is a point of focus for these efforts, the needed expressivity will no doubt accrete around the voiD core. For data as a service, we clearly see the value of open terminologies as prerequisites for service interchangeability, i.e., creating a marketplace. XML is for the transaction; RDF is for the discovery, query, and analytics. As with databases in general, first there was the transaction; then there was the query. Same here. For monetizing the query, there are models ranging from renting data sets and server capacity in the clouds to hosted services where one pays for processing past a certain quota. For the hosted case, we just removed a major barrier to offering unlimited query against unlimited data when we completed the Virtuoso Anytime feature. With this, the user gets what is found within a set time, which is already something, and in case of needing more, one can pay for the usage. Of course, we do not forget advertising. When data has explicit semantics, contextuality is better than with keywords. For these visions to materialize on top of the linked data platform, linked data must join the world of data. This means messaging that is geared towards the database public. They know the problem, but the RDF proposition is still not well enough understood for it to connect. For the relational IT world, we offer passage to the data web and its promise of integration through RDF mapping. We are also bringing out new Microsoft Entity Framework components. This goes in the direction of defining a unified database frontier with RDF and non-RDF entity models side by side. For OpenLink Software, 2008 was about developing technology for scale, RDF as well as generic relational. We did show a tiny preview with the Billion Triples Challenge demo. Now we are set to come out with the real thing, featuring, among other things, faceted search at the billion triple scale. We started offering ready-to-go Virtuoso-hosted linked open data sets on Amazon EC2 in December. Now we continue doing this based on our next-generation server, as well as make Virtuoso 6 Cluster commercially available. Technical specifics are amply discussed on this blog. There are still some new technology things to be developed this year; first among these are strong SPARQL federation, and on-the-fly resizing of server clusters. On the research partnerships side, we have an EU grant for working with the OntoWiki project from the University of Leipzig, and we are partners in DERI&#39;s LÃ­on project. These will provide platforms for further demonstrating the &quot;web&quot; in data web, as in web-scale smart databasing. 2009 will see change through scale. The things that exist will start interconnecting and there will be emergent value. Deployments will be larger and scale will be readily available through a services model or by installation at one&#39;s own facilities. We may see the start of Search becoming Find, like Kingsley says, meaning semantics of data guiding search. Entity extraction will multiply data volumes and bring parts of the data web to real time. Exciting 2009 to all.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>As is fitting for the season, I will editorialize a bit about what has gone before and what is to come.</p>

<p>
<a href="http://www.w3.org/People/Berners-Lee/card#i" id="link-id1119f250">Sir Tim</a> said it at WWW08 in <a href="http://www2008.org/" id="link-id0x14ab66b0">Beijing</a> â <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x115a4588">linked data</a> and the linked data <a href="http://dbpedia.org/resource/Giant_Global_Graph" id="link-id0xa5c678">web</a> is the <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x7cbe5540">semantic web</a> and the Web done right.</p>

<p>The grail of <i>ad hoc</i> analytics on infinite <a href="http://dbpedia.org/resource/Data" id="link-id0xa4b25428">data</a> has lost none of its appeal.  We have seen fresh evidence of this in the realm of data warehousing products, as well as storage in general.</p>

<p>The benefits of a data model more abstract than the relational are being increasingly appreciated also outside the data web circles. Microsoft&#39;s <a href="http://dbpedia.org/resource/Entity" id="link-id0x1c3c72b0">Entity</a> Frameworks technology is an example.  Agility has been a buzzword for a long time.  Everything should be offered in a service based business model and should interoperate and integrate with everything else â business needs first; schema last.</p>

<p>Not to forget that when money is tight, reuse of existing assets and paying on a usage basis are naturally emphasized.  <a href="http://dbpedia.org/resource/Information" id="link-id0xa0743bd8">Information</a>, as the asset it is, is none the less important, on the contrary.  But even with information, value should be realized economically, which, among other things, entails not reinventing the wheel.</p>

<p>It is against this backdrop that this year will play out.</p>

<p>As concerns research, I will <a href="http://www.openlinksw.com/weblog/oerling/?id=1374" id="link-id1151b128">again quote</a> <a href="http://www.ibiblio.org/hhalpin/#" id="link-id141cb740">Harry Halpin</a> at <a href="http://www.eswc2008.org/" id="link-id0x28f68040">ESWC 2008</a>: &quot;Men will fight in a war, and even lose a war, for what they believe just.  And it may come to pass that later, even though the war were lost, the things then fought for will emerge under another name and establish themselves as the prevailing reality&quot; [or words to this effect].</p>

<p>Something like the data web, and even the semantic web, will happen. Harry&#39;s question was whether this would be the descendant of what is today called semantic web research.</p>

<p>I heard in conversation about a project for making a very large metadata store.  I also heard that the makers did not particularly insist on this being <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x13c8af68">RDF</a>-based, though.</p>

<p>Why should such a thing be RDF-based?  If it is already accepted that there will be <i>ad hoc</i> schema and that queries ought to be able to view the data from all angles, not be limited by having indices one way and not another way, then why not RDF?</p>

<p>The justification of RDF is in reusing and linking-to data and terminology out there.  Another justification is that by using an RDF store, one is spared a lot of work and tons of compromises which attend making an <a href="http://dbpedia.org/resource/Entity-attribute-value_model" id="link-id0x1ca17b20">entity</a>-attribute-value (<a href="http://dbpedia.org/resource/Entity-attribute-value_model" id="link-id0x1c9d6050">EAV</a>, i.e., triple) store on a generic <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x557dff0">RDBMS</a>.  The sem-web world has been there, trust me.  We came out well because we put all inside the RDBMS, lowest level, which you can&#39;t do unless you own the RDBMS.  Source access is not enough; you also need the <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x1470c748">knowledge</a>.</p>

<p>Technicalities aside, the question is one of proprietary vs. standards-based.  This is not only so with software components, where standards have consistently demonstrated benefits, but now also with the data. <a href="http://www.zemanta.com/" id="link-id0x524bea0">Zemanta</a> and <a href="http://www.opencalais.com/" id="link-id0x46132d38">OpenCalais</a> serving <a href="http://dbpedia.org/resource/DBpedia" id="link-id0x13624fb8">DBpedia</a> URIs are examples.  Even in entirely closed applications, there is benefit in reusing open vocabularies and identifiers: One does not need to create a secret language for writing a secret memo.</p>

<p>Where data is a carrier of value, its value is enhanced by it being easy to repurpose (i.e., standard vocabularies) and to discover (i.e., data set metadata).  As on the web, so on the enterprise <a href="http://dbpedia.org/resource/Intranet" id="link-id0xa1392eb8">intranet</a>.  In this lies the strength of RDF as opposed to proprietary flexible database schemes.  This is a qualitative distinction.</p>
<p align="center">
 <a href="http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData" id="link-id117178a8"><img src="http://www.openlinksw.com/images/logos/LoDLogo.gif" alt="Linking Open Data project logo" />
 </a>
<br />
 <a href="http://dbpedia.org/resource/In_hoc_signo_vinces" id="link-id115f47e8"><i>In hoc signo vinces.</i>
 </a>
</p>

<p>In this light, we welcome the <a href="http://semanticweb.org/wiki/VoiD" id="link-id0x12352cc0">voiD</a> (<a href="http://semanticweb.org/wiki/VoiD" id="link-id0x722c18">VOcabulary of Interlinked Data</a>), which is the first promise of making federatable data discoverable. Now that there is a point of focus for these efforts, the needed expressivity will no doubt accrete around the voiD core.</p>

<p>For data as a service, we clearly see the value of open terminologies as prerequisites for service interchangeability, i.e., creating a marketplace.  <a href="http://dbpedia.org/resource/XML" id="link-id0x2c21c00">XML</a> is for the transaction; RDF is for the discovery, query, and analytics.  As with databases in general, first there was the transaction; then there was the query.  Same here.  For monetizing the query, there are models ranging from renting data sets and server capacity in the clouds to hosted services where one pays for processing past a certain quota.  For the hosted case, we just removed a major barrier to offering unlimited query against unlimited data when we completed the <a href="http://www.openlinksw.com/weblog/oerling/?id=1374" id="link-id110b8668">Virtuoso Anytime</a> feature.  With this, the user gets what is found within a set time, which is already something, and in case of needing more, one can pay for the usage.  Of course, we do not forget advertising.  When data has explicit semantics, contextuality is better than with keywords.</p>

<p>For these visions to materialize on top of the linked data platform, linked data must join the world of data.  This means messaging that is geared towards the database public.  They know the problem, but the RDF proposition is still not well enough understood for it to connect.</p>

<p>For the relational IT world, we offer passage to the data web and its promise of integration through RDF mapping.  We are also bringing out new Microsoft Entity <a href="http://dbpedia.org/resource/ADO.NET_Entity_Framework" id="link-id0x723080">Framework</a> components.  This goes in the direction of defining a unified database frontier with RDF and non-RDF entity models side by side.</p>

<p>For <a href="http://www.openlinksw.com/dataspace/organization/openlink#this" id="link-id0x11e1dfc0">OpenLink Software</a>, 2008 was about developing technology for scale, RDF as well as generic relational.  We did show a tiny preview with the <a href="http://challenge.semanticweb.org/" id="link-id0x722d08">Billion Triples Challenge</a> demo.  Now we are set to come out with the real thing, featuring, among other things, faceted search at the billion triple scale.  We <a href="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?id=1489" id="link-id150c6090">started offering ready-to-go Virtuoso-hosted linked open data sets</a> on Amazon EC2 in December.  Now we continue doing this based on our next-generation server, as well as make Virtuoso 6 Cluster commercially available.  Technical specifics are amply discussed on this <a href="http://dbpedia.org/resource/Blog" id="link-id0x10fc1930">blog</a>.  There are still some new technology things to be developed this year; first among these are strong <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x7fd25590">SPARQL</a> federation, and on-the-fly resizing of server clusters.  On the research partnerships side, we have an EU grant for working with the OntoWiki project from the University of Leipzig, and we are partners in DERI&#39;s <a href="https://lion.deri.ie/" id="link-id115c02f8">LÃ­on project</a>.  These will provide platforms for further demonstrating the &quot;web&quot; in data web, as in web-scale smart databasing.</p>

<p>2009 will see change through scale.  The things that exist will start interconnecting and there will be emergent value.  Deployments will be larger and scale will be readily available through a services model or by installation at one&#39;s own facilities.  We may see the start of Search becoming Find, like <a href="http://myopenlink.net/dataspace/person/kidehen#this" id="link-id14e43050">Kingsley</a> says, meaning semantics of data guiding search.  Entity extraction will multiply data volumes and bring parts of the data web to real time.</p>

<p>Exciting 2009 to all.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-12-11#1494">
  <rss:title>Virtuoso Anytime:  No Query Is Too Complex (updated)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-12-11T16:13:10Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">A persistent argument against the linked data web has been the cost, scalability, and vulnerability of SPARQL end points, should the linked data web gain serious mass and traffic. As we are on the brink of hosting the whole DBpedia Linked Open Data cloud in Virtuoso Cluster, we have had to think of what we&#39;ll do if, for example, somebody decides to count all the triples in the set. How can we encourage clever use of data, yet not die if somebody, whether through malice, lack of understanding, or simple bad luck, submits impossible queries? Restricting the language is not the way; any language beyond text search can express queries that will take forever to execute. Also, just returning a timeout after the first second (or whatever arbitrary time period) leaves people in the dark and does not produce an impression of responsiveness. So we decided to allow arbitrary queries, and if a quota of time or resources is exceeded, we return partial results and indicate how much processing was done. Here we are looking for the top 10 people whom people claim to know without being known in return, like this: SQL&gt; sparql SELECT ?celeb, COUNT (*) WHERE { ?claimant foaf:knows ?celeb . FILTER (!bif:exists ( SELECT (1) WHERE { ?celeb foaf:knows ?claimant } ) ) } GROUP BY ?celeb ORDER BY DESC 2 LIMIT 10; celeb callret-1 VARCHAR VARCHAR ________________________________________ _________ http://twitter.com/BarackObama 252 http://twitter.com/brianshaler 183 http://twitter.com/newmediajim 101 http://twitter.com/HenryRollins 95 http://twitter.com/wilw 81 http://twitter.com/stevegarfield 78 http://twitter.com/cote 66 mailto:adam.westerski@deri.org 66 mailto:michal.zaremba@deri.org 66 http://twitter.com/dsifry 65 *** Error S1TAT: [Virtuoso Driver][Virtuoso Server]RC...: Returning incomplete results, query interrupted by result timeout. Activity: 1R rnd 0R seq 0P disk 1.346KB / 3 messages SQL&gt; sparql SELECT ?celeb, COUNT (*) WHERE { ?claimant foaf:knows ?celeb . FILTER (!bif:exists ( SELECT (1) WHERE { ?celeb foaf:knows ?claimant } ) ) } GROUP BY ?celeb ORDER BY DESC 2 LIMIT 10; celeb callret-1 VARCHAR VARCHAR ________________________________________ _________ http://twitter.com/JasonCalacanis 496 http://twitter.com/Twitterrific 466 http://twitter.com/ev 442 http://twitter.com/BarackObama 356 http://twitter.com/laughingsquid 317 http://twitter.com/gruber 294 http://twitter.com/chrispirillo 259 http://twitter.com/ambermacarthur 224 http://twitter.com/t 219 http://twitter.com/johnedwards 188 *** Error S1TAT: [Virtuoso Driver][Virtuoso Server]RC...: Returning incomplete results, query interrupted by result timeout. Activity: 329R rnd 44.6KR seq 342P disk 638.4KB / 46 messages The first query read all data from disk; the second run had the working set from the first and could read some more before time ran out, hence the results were better. But the response time was the same. If one has a query that just loops over consecutive joins, like in basic SPARQL, interrupting the processing after a set time period is simple. But such queries are not very interesting. To give meaningful partial answers with nested aggregation and sub-queries requires some more tricks. The basic idea is to terminate the innermost active sub-query/aggregation at the first timeout, and extend the timeout a bit so that accumulated results get fed to the next aggregation, like from the GROUP BY to the ORDER BY. If this again times out, we continue with the next outer layer. This guarantees that results are delivered if there were any results found for which the query pattern is true. False results are not produced, except in cases where there is comparison with a count and the count is smaller than it would be with the full evaluation. One can also use this as a basis for paid services. The cutoff does not have to be time; it can also be in other units, making it insensitive to concurrent usage and variations of working set. This system will be deployed on our Billion Triples Challenge demo instance in a few days, after some more testing. When Virtuoso 6 ships, all LOD Cloud AMIs and OpenLink-hosted LOD Cloud SPARQL endpoints will have this enabled by default. (AMI users will be able to disable the feature, if desired.) The feature works with Virtuoso 6 in both single server and cluster deployment.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[
<p>A persistent argument against the <a href="http://dbpedia.org/resource/Linked_Data" id="link-id1199d5f8">linked data</a> <a href="http://dbpedia.org/resource/Giant_Global_Graph" id="link-id116f2730">web</a> has been the cost, scalability, and vulnerability of <a href="http://dbpedia.org/resource/SPARQL" id="link-id14e423c0">SPARQL</a> end points, should the linked data web gain serious mass and traffic.</p>

<p>As we are on the brink of hosting the whole <a href="http://dbpedia.org/resource/DBpedia" id="link-id1376a8b0">DBpedia</a> <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id113c8d20">Linked Open Data</a> cloud in <a href="http://virtuoso.openlinksw.com" id="link-id11425a78">Virtuoso</a> Cluster, we have had to think of what we&#39;ll do if, for example, somebody decides to count all the triples in the set.</p>

<p>How can we encourage clever use of <a href="http://dbpedia.org/resource/Data" id="link-id116f1210">data</a>, yet not die if somebody, whether through malice, lack of understanding, or simple bad luck, submits impossible queries?</p>

<p>Restricting the language is not the way; any language beyond text search can express queries that will take forever to execute.  Also, just returning a timeout after the first second (or whatever arbitrary time period) leaves people in the dark and does not produce an impression of responsiveness.  So we decided to allow arbitrary queries, and if a quota of time or resources is exceeded, we return partial results and indicate how much processing was done.</p>

<p>Here we are looking for the top 10 people whom people claim to know without being known in return, like this:</p>

<blockquote>
<pre>SQL&gt; sparql 
SELECT ?celeb, 
       COUNT (*)
WHERE { ?claimant foaf:knows ?celeb .
        FILTER (!bif:exists ( SELECT (1) 
                              WHERE { ?celeb foaf:knows ?claimant }
                            )
               )
      } 
GROUP BY ?celeb 
ORDER BY DESC 2 
LIMIT 10;<br />
celeb                                      callret-1
VARCHAR                                    VARCHAR
________________________________________   _________<br />
http://twitter.com/BarackObama             252
http://twitter.com/brianshaler             183
http://twitter.com/newmediajim             101
http://twitter.com/HenryRollins            95
http://twitter.com/wilw                    81
http://twitter.com/stevegarfield           78
http://twitter.com/cote                    66
mailto:adam.westerski@deri.org             66
mailto:michal.zaremba@deri.org             66
http://twitter.com/dsifry                  65<br />
*** Error S1TAT: [Virtuoso Driver][Virtuoso Server]RC...: Returning incomplete 
results, query interrupted by result timeout.  
Activity:      1R rnd      0R seq      0P disk  1.346KB /      3 messages<br />
SQL&gt; sparql 
SELECT ?celeb, 
       COUNT (*)
WHERE { ?claimant foaf:knows ?celeb .
        FILTER (!bif:exists ( SELECT (1) 
                              WHERE { ?celeb foaf:knows ?claimant }
                            )
               )
      } 
GROUP BY ?celeb 
ORDER BY DESC 2 
LIMIT 10;<br />
celeb                                      callret-1
VARCHAR                                    VARCHAR
________________________________________   _________<br />
http://twitter.com/JasonCalacanis          496
http://twitter.com/Twitterrific            466
http://twitter.com/ev                      442
http://twitter.com/BarackObama             356
http://twitter.com/laughingsquid           317
http://twitter.com/gruber                  294
http://twitter.com/chrispirillo            259
http://twitter.com/ambermacarthur          224
http://twitter.com/t                       219
http://twitter.com/johnedwards             188<br />
*** Error S1TAT: [Virtuoso Driver][Virtuoso Server]RC...: Returning incomplete 
results, query interrupted by result timeout.  
Activity:    329R rnd   44.6KR seq    342P disk  638.4KB /     46 messages</pre></blockquote>

<p>The first query read all data from disk; the second run had the working set from the first and could read some more before time ran out, hence the results were better.  But the response time was the same.</p>

<p>If one has a query that just loops over consecutive joins, like in basic SPARQL, interrupting the processing after a set time period is simple.  But such queries are not very interesting.  To give meaningful partial answers with nested aggregation and sub-queries requires some more tricks.  The basic idea is to terminate the innermost active sub-query/aggregation at the first timeout, and extend the timeout a bit so that accumulated results get fed to the next aggregation, like from the <code>GROUP BY</code> to the <code>ORDER BY</code>.  If this again times out, we continue with the next outer layer.  This guarantees that results are delivered if there were any results found for which the query pattern is true.  False results are not produced, except in cases where there is comparison with a count and the count is smaller than it would be with the full evaluation.</p>

<p>One can also use this as a basis for paid services.  The cutoff does not have to be time; it can also be in other units, making it insensitive to concurrent usage and variations of working set.</p>

<p>This system will be deployed on our <a href="http://challenge.semanticweb.org/" id="link-id11500a58">Billion Triples Challenge</a> <a href="http://b3s.openlinksw.com/" id="link-id11683120">demo instance</a> in a few days, after some more testing.  When Virtuoso 6 ships, all <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id1157a500">LOD</a> Cloud AMIs and OpenLink-hosted LOD Cloud SPARQL endpoints will have this enabled by default.  (AMI users will be able to disable the feature, if desired.)  The feature works with Virtuoso 6 in both single server and cluster deployment.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-11-04#1476">
  <rss:title>ISWC 2008: RDB2RDF Face-to-Face</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-11-04T13:26:19Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">The W3C&#39;s RDB-to-RDF mapping incubator group (RDB2RDF XG) met in Karlsruhe after ISWC 2008. The meeting was about writing a charter for a working group that would define a standard for mapping relational databases to RDF, either for purposes of import into RDF stores or of query mapping from SPARQL to SQL. There was a lot of agreement and the meeting even finished ahead of the allotted time. Whose Identifiers? There was discussion concerning using the Entity Name Service from the Okkam project for assigning URIs to entities mapped from relational databases. This makes sense when talking about long-lived, legal entities, such as people or companies or geography. Of course, there are cases where this makes no sense; for example, a purchase order or maintenance call hardly needs an identifier registered with the ENS. The problem is, in practice, a CRM could mention customers that have an ENS registered ID (or even several such IDs) and others that have none. Of course, the CRM&#39;s reference cannot depend on any registration. Also, even when there is a stable URI for the entity, a CRM may need a key that specifies some administrative subdivision of the customer. Also we note that an on-demand RDB-to-RDF mapping may have some trouble dealing with &quot;same as&quot; assertions. If names that are anything other than string forms of the keys in the system must be returned, there will have to be a lookup added to the RDB. This is an administrative issue. Certainly going over the network to ask for names of items returned by queries has a prohibitive cost. It would be good for ad hoc integration to use shared URIs when possible. The trouble of adding and maintaining lookups for these, however, makes this more expensive than just mapping to RDF and using literals for joining between independently maintained systems. XML or RDF? We talked about having a language for human consumption and another for discovery and machine processing of mappings. Would this latter be XML or RDF based? Describing every detail of syntax for a mapping as RDF is really tedious. Also such descriptions are very hard to query, just as OWL ontologies are. One solution is to have opaque strings embedded into RDF, just like XSLT has XPath in string form embedded into XML. Maybe it will end up in this way here also. Having a complete XML mapping of the parse tree for mappings, XQueryX-style, could be nice for automatic generation of mappings with XSLT from an XML view of the information schema. But then XSLT can also produce text, so an XML syntax that has every detail of a mapping language as distinct elements is not really necessary for this. Another matter is then describing the RDF generated by the mapping in terms of RDFS or OWL. This would be a by-product of declaring the mapping. Most often, I would presume the target ontology to be given, though, reducing the need for this feature. But if RDF mapping is used for discovery of data, such a description of the exposed data is essential. Interoperability We agreed with SÃ¶ren Auer that we could make Virtuoso&#39;s mapping language compatible with Triplify. Triplify is very simple, extraction only, no SPARQL, but does have the benefit of expressing everything in SQL. As it happens, I would be the last person to tell a web developer what language to program in. So if it is SQL, then let it stay SQL. Technically, a lot of the information the Virtuoso mapping expresses is contained in the Triplify SQL statements, but not all. Some extra declarations are needed still but can have reasonable defaults. There are two ways of stating a mapping. Virtuoso starts with the triple and says which tables and columns will produce the triple. Triplify starts with the SQL statement and says what triples it produces. These are fairly equivalent. For the web developer, the latter is likely more self-evident, while the former may be more compact and have less repetition. Virtuoso and Triplify alone would give us the two interoperable implementations required from a working group, supposing the language were annotations on top of SQL. This would be a guarantee of delivery, as we would be close enough to the result from the get go. Related Web resources OpenLink Virtuoso: Open-Source Edition: Mapping SQL Data to RDF Virtuoso RDF Views â Getting Started Guide (PDF)</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>The W3C&#39;s RDB-to-<a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x141f0470">RDF</a> mapping incubator group (<a href="http://www.w3.org/2005/Incubator/rdb2rdf/" id="link-id0x13b8d018">RDB2RDF XG</a>) met in <a href="http://dbpedia.org/resource/Karlsruhe" id="link-id0x1e748060">Karlsruhe</a> after <a href="http://iswc2008.semanticweb.org/" id="link-id0x1eba8468">ISWC 2008</a>.</p>

<p>The meeting was about writing a charter for a working group that would define a standard for mapping relational databases to RDF, either for purposes of import into RDF stores or of query mapping from <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1e5abe10">SPARQL</a> to <a href="http://dbpedia.org/resource/SQL" id="link-id0x13930368">SQL</a>. There was a lot of agreement and the meeting even finished ahead of the allotted time.</p>

<h2>Whose Identifiers?</h2>

<p>There was discussion concerning using the <a href="http://dbpedia.org/resource/Entity" id="link-id0x15587978">Entity</a> Name Service from the Okkam project for assigning URIs to entities mapped from relational databases. This makes sense when talking about long-lived, legal entities, such as people or companies or geography. Of course, there are cases where this makes no sense; for example, a purchase order or maintenance call hardly needs an identifier registered with the ENS. The problem is, in practice, a CRM could mention customers that have an ENS registered ID (or even several such IDs) and others that have none. Of course, the CRM&#39;s reference cannot depend on any registration. Also, even when there is a stable <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id0x144660f8">URI</a> for the entity, a CRM may need a key that specifies some administrative subdivision of the customer.</p>

<p>Also we note that an on-demand RDB-to-RDF mapping may have some trouble dealing with &quot;same as&quot; assertions. If names that are anything other than string forms of the keys in the system must be returned, there will have to be a lookup added to the RDB. This is an administrative issue. Certainly going over the network to ask for names of items returned by queries has a prohibitive cost. It would be good for ad hoc integration to use shared URIs when possible. The trouble of adding and maintaining lookups for these, however, makes this more expensive than just mapping to RDF and using literals for joining between independently maintained systems.</p>

<h2>
<a href="http://dbpedia.org/resource/XML" id="link-id0x1edb8170">XML</a> or RDF?</h2>

<p>We talked about having a language for human consumption and another for discovery and machine processing of mappings. Would this latter be XML or RDF based? Describing every detail of syntax for a mapping as RDF is really tedious. Also such descriptions are very hard to query, just as <a href="http://dbpedia.org/resource/Web_Ontology_Language" id="link-id0x2450fba8">OWL</a> ontologies are. One solution is to have opaque strings embedded into RDF, just like XSLT has <a href="http://dbpedia.org/resource/XPath" id="link-id0x234e5478">XPath</a> in string form embedded into XML. Maybe it will end up in this way here also. Having a complete XML mapping of the parse tree for mappings, XQueryX-style, could be nice for automatic generation of mappings with XSLT from an XML view of the <a href="http://dbpedia.org/resource/Information" id="link-id0x22e129f8">information</a> schema. But then XSLT can also produce text, so an XML syntax that has every detail of a mapping language as distinct elements is not really necessary for this.</p>

<p>Another matter is then describing the RDF generated by the mapping in terms of RDFS or OWL. This would be a by-product of declaring the mapping. Most often, I would presume the target ontology to be given, though, reducing the need for this feature. But if RDF mapping is used for discovery of <a href="http://dbpedia.org/resource/Data" id="link-id0x155139c0">data</a>, such a description of the exposed data is essential.</p>

<h2>Interoperability</h2>

<p>We agreed with <a href="http://www.informatik.uni-leipzig.de/~auer/foaf.rdf#me" id="link-id0x132a64e0">SÃ¶ren Auer</a> that we could make <a href="http://virtuoso.openlinksw.com" id="link-id0x1272c988">Virtuoso</a>&#39;s mapping language compatible with <a href="http://triplify.org/" id="link-id0x12622738">Triplify</a>. Triplify is very simple, extraction only, no SPARQL, but does have the benefit of expressing everything in SQL. As it happens, I would be the last person to tell a web developer what language to program in. So if it is SQL, then let it stay SQL. Technically, a lot of the information the Virtuoso mapping expresses is contained in the Triplify SQL statements, but not all. Some extra declarations are needed still but can have reasonable defaults.</p>

<p>There are two ways of stating a mapping. Virtuoso starts with the triple and says which tables and columns will produce the triple. Triplify starts with the SQL statement and says what triples it produces. These are fairly equivalent. For the web developer, the latter is likely more self-evident, while the former may be more compact and have less repetition.</p>

<p>Virtuoso and Triplify alone would give us the two interoperable implementations required from a working group, supposing the language were annotations on top of SQL. This would be a guarantee of delivery, as we would be close enough to the result from the get go.</p>

<h2>Related Web resources</h2>
<ul>
 <li>
  <a href="http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VOSSQL2RDF" id="link-id14e27040">OpenLink Virtuoso: Open-Source Edition: Mapping SQL Data to RDF</a>
 </li>
<li>
  <a href="http://virtuoso.openlinksw.com/Whitepapers/pdf/Virtuoso_SQL_to_RDF_Mapping.pdf" id="link-id1baad3a8">Virtuoso RDF Views â Getting Started Guide (PDF)</a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-10-02#1448">
  <rss:title>Virtuoso Update, Billion Triples and Outlook</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-10-02T09:31:17Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">I will say a few things about what we have been doing and where we can go. Firstly, we have a fairly scalable platform with Virtuoso 6 Cluster. It was most recently tested with the workload discussed in the previous Billion Triples post. There is an updated version of the paper about this. This will be presented at the web scale workshop of ISWC 2008 in Karlsruhe. Right now, we are polishing some things in Virtuoso 6 -- some optimizations for smarter balancing of interconnect traffic over multiple network interfaces, and some more SQL optimizations specific to RDF. The must-have basics, like parallel running of sub-queries and aggregates, and all-around unrolling of loops of every kind into large partitioned batches, is all there and proven to work. We spent a lot of time around the Berlin SPARQL Benchmark story, so we got to the more advanced stuff like the Billion Triples Challenge rather late. We did along the way also run BSBM with an Oracle back-end, with Virtuoso mapping SPARQL to SQL. This merits its own analysis in the near future. This will be the basic how-to of mapping OLTP systems to RDF. Depending on the case, one can use this for lookups in real-time or ETL. RDF will deliver value in complex situations. An example of a complex relational mapping use case came from Ordnance Survey, presented at the RDB2RDF XG. Examples of complex warehouses include the Neurocommons database, the Billion Triples Challenge, and the Garlik DataPatrol. In comparison, the Berlin workload is really simple and one where RDF is not at its best, as amply discussed on the Linked Data forum. BSBM&#39;s primary value is as a demonstrator for the basic mapping tasks that will be repeated over and over for pretty much any online system when presence on the data web becomes as indispensable as presence on the HTML web. I will now talk about the complex warehouse/web-harvesting side. I will come to the mapping in another post. Now, all the things shown in the Billion Triples post can be done with a relational system specially built for each purpose. Since we are a general purpose RDBMS, we use this capability where it makes sense. For example, storing statistics about which tags or interests occur with which other tags or interests as RDF blank nodes makes no sense. We do not even make the experiment; we know ahead of time that the result is at least an order of magnitude in favor of the relational row-oriented solution in both space and time. Whenever there is a data structure specially made for answering one specific question, like joint occurrence of tags, RDB and mapping is the way to go. With Virtuoso, this can fully-well coexist with physical triples, and can still be accessed in SPARQL and mixed with triples. This is territory that we have not extensively covered yet, but we will be giving some examples about this later. The real value of RDF is in agility. When there is no time to design and load a new warehouse for every new question, RDF is unparalleled. Also SPARQL, once it has the necessary extensions of aggregating and sub-queries, is nicer than SQL, especially when we have sub-classes and sub-properties, transitivity, and &quot;same as&quot; enabled. These things have some run time cost and if there is a report one is hitting absolutely all the time, then chances are that resolving terms and identity at load-time and using materialized views in SQL is the reasonable thing. If one is inventing a new report every time, then RDF has a lot more convenience and flexibility. We are just beginning to explore what we can do with data sets such as the online conversation space, linked data, and the open ontologies of UMBEL and OpenCyc. It is safe to say that we can run with real world scale without loss of query expressivity. There is an incremental cost for performance but this is not prohibitive. Serving the whole billion triples set from memory would cost about $32K in hardware. $8K will do if one can wait for disk part of the time. One can use these numbers as a basis for costing larger systems. For online search applications, one will note that running the indexes pretty much from memory is necessary for flat response time. For back office analytics this is not necessarily as critical. It all depends on the use case. We expect to be able to combine geography, social proximity, subject matter, and named entities, with hierarchical taxonomies and traditional full text, and to present this through a simple user interface. We expect to do this with online response times if we have a limited set of starting points and do not navigate more than 2 or 3 steps from each starting point. An example would be to have a full text pattern and news group, and get the cloud of interests from the authors of matching posts. Another would be to make a faceted view of the properties of the 1000 people most closely connected to one person. Queries like finding the fastest online responders to questions about romance across the global board-scape, or finding the person who initiates the most long running conversations about crime, take a bit longer but are entirely possible. The genius of RDF is to be able to do these things within a general purpose database, ad hoc, in a single query language, mostly without materializing intermediate results. Any of these things could be done with arbitrary efficiency in a custom built system. But what is special now is that the cost of access to this type of information and far beyond drops dramatically as we can do these things in a far less labor intensive way, with a general purpose system, with no redesigning and reloading of warehouses at every turn. The query becomes a commodity. Still, one must know what to ask. In this respect, the self-describing nature of RDF is unmatched. A query like list the top 10 attributes with the most distinct values for all persons cannot be done in SQL. SQL simply does not allow the columns to be variable. Further, we can accept queries as text, the way people are used to supplying them, and use structure for drill-down or result-relevance, and also recognize named entities and subject matter concepts in query text. Very simple NLP will go a long way towards keeping SPARQL out of the user experience. The other way of keeping query complexity hidden is to publish hand-written SPARQL as parameter-fed canned reports. Between now and ISWC 2008, the last week of October, we will put out demos showing some of these things. Stay tuned.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>I will say a few things about what we have been doing and where we can go.</p>

<p>Firstly, we have a fairly scalable platform with <a href="http://virtuoso.openlinksw.com" id="link-id0xa412e450">Virtuoso</a> 6 Cluster. It was most recently tested with the workload discussed in the previous <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1445" id="link-id1638a5b8">Billion Triples post</a>.</p>

<p>There is an updated version of <a href="http://www.openlinksw.com/weblog/oerling/2008webscale_rdf.pdf" id="link-id16280a68">the paper about this</a>. This will be presented at the web scale workshop of ISWC 2008 in Karlsruhe.</p>

<p>Right now, we are polishing some things in Virtuoso 6 -- some optimizations for smarter balancing of interconnect traffic over multiple network interfaces, and some more <a href="http://dbpedia.org/resource/SQL" id="link-id0x1c1c5f48">SQL</a> optimizations specific to <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1bcb6108">RDF</a>. The must-have basics, like parallel running of sub-queries and aggregates, and all-around unrolling of loops of every kind into large partitioned batches, is all there and proven to work.</p>

<p>We spent a lot of time around the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x3a4e17c8">Berlin SPARQL Benchmark</a> story, so we got to the more advanced stuff like the <a href="http://challenge.semanticweb.org/" id="link-id0x1a66c568">Billion Triples Challenge</a> rather late. We did along the way also run <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x188c2608">BSBM</a> with an <a href="http://dbpedia.org/resource/Oracle_Database" id="link-id0x1aa97f98">Oracle</a> back-end, with Virtuoso mapping <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1abd87a0">SPARQL</a> to SQL. This merits its own analysis in the near future. This will be the basic how-to of mapping OLTP systems to RDF. Depending on the case, one can use this for lookups in real-time or ETL.</p>

<p>RDF will deliver value in complex situations. An example of a complex relational mapping use case came from Ordnance Survey, presented at the <a href="http://www.w3.org/2005/Incubator/rdb2rdf/" id="link-id0x1a941678">RDB2RDF XG</a>. Examples of complex warehouses include the <a href="http://neurocommons.org/page/Main_Page" id="link-id0x1aa5a9f8">Neurocommons</a> database, the Billion Triples Challenge, and the <a href="http://www.garlik.com/" id="link-id0x372df7b0">Garlik DataPatrol</a>.</p>

<p>In comparison, the Berlin workload is really simple and one where RDF is not at its best, as amply discussed on the <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x1a671cf0">Linked Data</a> forum. BSBM&#39;s primary value is as a demonstrator for the basic mapping tasks that will be repeated over and over for pretty much any online system when presence on the <a href="http://dbpedia.org/resource/Data" id="link-id0x1ab83dd0">data</a> web becomes as indispensable as presence on the HTML web.</p>

<p>I will now talk about the complex warehouse/web-harvesting side. I will come to the mapping in another post.</p>

<p>Now, all the things shown in the <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1445" id="link-id14de1d18">Billion Triples post</a> can be done with a relational system specially built for each purpose. Since we are a general purpose <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x340d3470">RDBMS</a>, we use this capability where it makes sense. For example, storing statistics about which tags or interests occur with which other tags or interests as RDF blank nodes makes no sense. We do not even make the experiment; we know ahead of time that the result is at least an order of magnitude in favor of the relational row-oriented solution in both space and time.</p>

<p>Whenever there is a data structure specially made for answering one specific question, like joint occurrence of tags, RDB and mapping is the way to go. With Virtuoso, this can fully-well coexist with physical triples, and can still be accessed in SPARQL and mixed with triples. This is territory that we have not extensively covered yet, but we will be giving some examples about this later.</p>

<p>The real value of RDF is in agility. When there is no time to design and load a new warehouse for every new question, RDF is unparalleled. Also SPARQL, once it has the necessary extensions of aggregating and sub-queries, is nicer than SQL, especially when we have sub-classes and sub-properties, transitivity, and &quot;same as&quot; enabled. These things have some run time cost and if there is a report one is hitting absolutely all the time, then chances are that resolving terms and identity at load-time and using materialized views in SQL is the reasonable thing. If one is inventing a new report every time, then RDF has a lot more convenience and flexibility.</p>

<p>We are just beginning to explore what we can do with data sets such as the online conversation space, linked data, and the open ontologies of <a href="http://umbel.org/about/" id="link-id0x19cabf38">UMBEL</a> and <a href="http://dbpedia.org/resource/Cyc" id="link-id0x19cecd10">OpenCyc</a>. It is safe to say that we can run with real world scale without loss of query expressivity. There is an incremental cost for performance but this is not prohibitive. Serving the whole billion triples set from memory would cost about $32K in hardware. $8K will do if one can wait for disk part of the time. One can use these numbers as a basis for costing larger systems. For online search applications, one will note that running the indexes pretty much from memory is necessary for flat response time. For back office analytics this is not necessarily as critical. It all depends on the use case.</p>

<p>We expect to be able to combine geography, social proximity, subject matter, and <a href="http://dbpedia.org/resource/Named_entity_recognition" id="link-id0x1a8202e8">named entities</a>, with hierarchical taxonomies and traditional full text, and to present this through a simple user interface.</p>

<p>We expect to do this with online response times if we have a limited set of starting points and do not navigate more than 2 or 3 steps from each starting point. An example would be to have a full text pattern and news group, and get the cloud of interests from the authors of matching posts. Another would be to make a faceted view of the properties of the 1000 people most closely connected to one person.</p>

<p>Queries like finding the fastest online responders to questions about romance across the global board-scape, or finding the person who initiates the most long running conversations about crime, take a bit longer but are entirely possible.</p>

<p>The genius of RDF is to be able to do these things within a general purpose database, ad hoc, in a single query language, mostly without materializing intermediate results. Any of these things could be done with arbitrary efficiency in a custom built system. But what is special now is that the cost of access to this type of <a href="http://dbpedia.org/resource/Information" id="link-id0x1ab0a918">information</a> and far beyond drops dramatically as we can do these things in a far less labor intensive way, with a general purpose system, with no redesigning and reloading of warehouses at every turn. The query becomes a commodity.</p>

<p>Still, one must know what to ask. In this respect, the self-describing nature of RDF is unmatched. A query like <i>list the top 10 attributes with the most distinct values for all persons</i> cannot be done in SQL. SQL simply does not allow the columns to be variable.</p>

<p>Further, we can accept queries as text, the way people are used to supplying them, and use structure for drill-down or result-relevance, and also recognize named entities and subject matter concepts in query text. Very simple NLP will go a long way towards keeping SPARQL out of the user experience.</p>

<p>The other way of keeping query complexity hidden is to publish hand-written SPARQL as parameter-fed canned reports.</p>

<p>Between now and ISWC 2008, the last week of October, we will put out demos showing some of these things. Stay tuned.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-08-06#1409">
  <rss:title>BSBM With Triples and Mapped Relational Data</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-08-06T19:35:27Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">The special contribution of the Berlin SPARQL Benchmark (BSBM) to the RDF world is to raise the question of doing OLTP with RDF. Of course, here we immediately hit the question of comparisons with relational databases. To this effect, BSBM also specifies a relational schema and can generate the data as either triples or SQL inserts. The benchmark effectively simulates the case of exposing an existing RDBMS as RDF. OpenLink Software calls this RDF Views. Oracle is beginning to call this semantic covers. The RDB2RDF XG, a W3C incubator group, has been active in this area since Spring, 2008. But why an OLTP workload with RDF to begin with? We believe this is relevant because RDF promises to be the interoperability factor between potentially all of traditional IS. If data is online for human consumption, it may be online via a SPARQL end-point as well. The economic justification will come from discoverability and from applications integrating multi-source structured data. Online shopping is a fine use case. Warehousing all the world&#39;s publishable data as RDF is not our first preference, nor would it be the publisher&#39;s. Considerations of duplicate infrastructure and maintenance are reason enough. Consequently, we need to show that mapping can outperform an RDF warehouse, which is what we&#39;ll do here. What We Got First, we found that making the query plan took much too long in proportion to the run time. With BSBM this is an issue because the queries have lots of joins but access relatively little data. So we made a faster compiler and along the way retouched the cost model a bit. But the really interesting part with BSBM is mapping relational data to RDF. For us, BSBM is a great way of showing that mapping can outperform even the best triple store. A relational row store is as good as unbeatable with the query mix. And when there is a clear mapping, there is no reason the SPARQL could not be directly translated. If Chris Bizer et al launched the mapping ship, we will be the ones to pilot it to harbor! We filled two Virtuoso instances with a BSBM200000 data set, for 100M triples. One was filled with physical triples; the other was filled with the equivalent relational data plus mapping to triples. Performance figures are given in &quot;query mixes per hour&quot;. (An update or follow-on to this post will provide elapsed times for each test run.) With the unmodified benchmark we got: Physical Triples: Â  Â  1297 qmph Mapped Triples: Â  Â  3144 qmph In both cases, most of the time was spent on Q6, which looks for products with one of three words in the label. We altered Q6 to use text index for the mapping, and altered the databases accordingly. (There is no such thing as an e-commerce site without a text index, so we are amply justified in making this change.) The following were measured on the second run of a 100 query mix series, single test driver, warm cache. Physical Triples: Â  Â  5746 qmph Mapped Triples: Â  Â  7525 qmph We then ran the same with 4 concurrent instances of the test driver. The qmph here is 400 / the longest run time. Physical Triples: Â  Â  19459 qmph Mapped Triples: Â  Â  24531 qmph The system used was 64-bit Linux, 2GHz dual-Xeon 5130 (8 cores) with 8G RAM. The concurrent throughputs are a little under 4 times the single thread throughput, which is normal for SMP due to memory contention. The numbers do not evidence significant overhead from thread synchronization. The query compilation represents about 1/3 of total server side CPU. In an actual online application of this type, queries would be parameterized, so the throughputs would be accordingly higher. We used the StopCompilerWhenXOverRunTime = 1 option here to cut needless compiler overhead, the queries being straightforward enough. We also see that the advantage of mapping can be further increased by more compiler optimizations, so we expect in the end mapping will lead RDF warehousing by a factor of 4 or so. Suggestions for BSBM Reporting Rules. The benchmark spec should specify a form for disclosure of test run data, TPC style. This includes things like configuration parameters and exact text of queries. There should be accepted variants of query text, as with the TPC. Multiuser operation. The test driver should get a stream number as parameter, so that each client makes a different query sequence. Also, disk performance in this type of benchmark can only be reasonably assessed with a naturally parallel multiuser workload. Add business intelligence. SPARQL has aggregates now, at least with Jena and Virtuoso, so let&#39;s use these. The BSBM business intelligence metric should be a separate metric off the same data. Adding synthetic sales figures would make more interesting queries possible. For example, producing recommendations like &quot;customers who bought this also bought xxx.&quot; For the SPARQL community, BSBM sends the message that one ought to support parameterized queries and stored procedures. This would be a SPARQL protocol extension; the SPARUL syntax should also have a way of calling a procedure. Something like select proc (??, ??) would be enough, where ?? is a parameter marker, like ? in ODBC/JDBC. Add transactions.Especially if we are contrasting mapping vs. storing triples, having an update flow is relevant. In practice, this could be done by having the test driver send web service requests for order entry and the SUT could implement these as updates to the triples or a mapped relational store. This could use stored procedures or logic in an app server. Comments on Query Mix The time of most queries is less than linear to the scale factor. Q6 is an exception if it is not implemented using a text index. Without the text index, Q6 will inevitably come to dominate query time as the scale is increased, and thus will make the benchmark less relevant at larger scales. Next We include the sources of our RDF view definitions and other material for running BSBM with our forthcoming Virtuoso Open Source 5.0.8 release. This also includes all the query optimization work done for BSBM. This will be available in the coming days.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>The special contribution of the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id10039db0">Berlin SPARQL Benchmark</a> (<a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id106b2538">BSBM</a>) to the <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id101a75f8">RDF</a> world is to raise the question of doing OLTP with <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xb230eb0">RDF</a>.</p>

<p>Of course, here we immediately hit the question of comparisons with relational databases.  To this effect, <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0xa832da8">BSBM</a> also specifies a relational schema and can generate the <a href="http://dbpedia.org/resource/Data" id="link-id1206c378">data</a> as either triples or <a href="http://dbpedia.org/resource/SQL" id="link-id1667f040">SQL</a> inserts.</p>

<p>The benchmark effectively simulates the case of exposing an existing <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id10a93518">RDBMS</a> as RDF.  <a href="http://www.openlinksw.com/dataspace/organization/openlink#this" id="link-id13e46d80">OpenLink Software</a> calls this <i>RDF Views</i>.  <a href="http://dbpedia.org/resource/Oracle_Database" id="link-id12027578">Oracle</a> is beginning to call this <i>semantic covers</i>.  The <a href="http://www.w3.org/2005/Incubator/rdb2rdf/" id="link-id161dc678">RDB2RDF XG</a>, a W3C incubator group, has been active in this area since Spring, 2008.</p>

<h3>But why an OLTP workload with RDF to begin with?</h3>

<p>We believe this is relevant because RDF promises to be the interoperability factor between potentially all of traditional IS.  If <a href="http://dbpedia.org/resource/Data" id="link-id0xabe48a0">data</a> is online for human consumption, it may be online via a <a href="http://dbpedia.org/resource/SPARQL" id="link-id106a8908">SPARQL</a> end-point as well.  The economic justification will come from discoverability and from applications integrating multi-source structured data.  Online shopping is a fine use case.</p>

<p>Warehousing all the world&#39;s publishable data as RDF is not our first preference, nor would it be the publisher&#39;s.  Considerations of duplicate infrastructure and maintenance are reason enough.  Consequently, we need to show that mapping can outperform an RDF warehouse, which is what we&#39;ll do here.</p>

<h3>What We Got </h3>

<p>First, we found that <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1400" id="link-id150ea748">making the query plan took much too long</a> in proportion to the run time.  With BSBM this is an issue because the queries have lots of joins but access relatively little data.  So we made a faster compiler and along the way retouched the cost model a bit.</p>

<p>But the really interesting part with BSBM is mapping relational data to RDF.  For us, BSBM is a great way of showing that mapping can outperform even the best triple store.  A relational row store is as good as unbeatable with the query mix.  And when there is a clear mapping, there is no reason the <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x96bb5e0">SPARQL</a> could not be directly translated.</p>

<p>If Chris Bizer et al launched the mapping ship, we will be the ones to pilot it to harbor!</p>

<p>We filled two <a href="http://virtuoso.openlinksw.com" id="link-id12dbdc70">Virtuoso</a> instances with a BSBM200000 data set, for 100M triples.  One was filled with physical triples; the other was filled with the equivalent relational data plus mapping to triples.  Performance figures are given in &quot;query mixes per hour&quot;.  (An update or follow-on to this post will provide elapsed times for each test run.)</p>

<p>With the unmodified benchmark we got:</p>
<blockquote>
<table>
<tr>
   <td><i>Physical Triples:</i>
   </td>
    <td>Â  Â </td>
    <td>1297 qmph</td>
  </tr>
<tr>
   <td><i>Mapped Triples:</i>
   </td>
    <td>Â  Â </td>
   <td><b>3144 qmph</b>
   </td>
  </tr>
</table>
</blockquote>
<p>In both cases, most of the time was spent on Q6, which looks for products with one of three words in the label.  We altered Q6  to use text index for the mapping, and altered the databases accordingly. (There is no such thing as an e-commerce site without a text index, so we are amply justified in making this change.)</p>

<p>The following were measured on the second run of a 100 query mix series, single test driver, warm cache.</p>
<blockquote>
<table>
<tr>
   <td><i>Physical Triples:</i>
   </td>
    <td>Â  Â </td>
    <td> 5746 qmph</td>
  </tr>
<tr>
   <td><i>Mapped Triples:</i>
   </td>
    <td>Â  Â </td>
   <td> <b>7525 qmph</b>
   </td>
  </tr>
</table>
</blockquote>
<p>We then ran the same with 4 concurrent instances of the test driver. The qmph here is 400 / the longest run time.</p>
<blockquote>
<table>
<tr>
   <td><i>Physical Triples:</i>
   </td>
    <td>Â  Â </td>
    <td> 19459 qmph</td>
  </tr>
<tr>
   <td><i>Mapped Triples:</i>
   </td>
    <td>Â  Â </td>
   <td> <b>24531 qmph</b>
   </td>
  </tr>
</table>
</blockquote>

<p>The system used was 64-bit Linux, 2GHz dual-Xeon 5130 (8 cores) with 8G RAM.  The concurrent throughputs are a little under 4 times the single thread throughput, which is normal for SMP due to memory contention.  The numbers do not evidence significant overhead from thread synchronization.</p>

<p>The query compilation represents about 1/3 of total server side CPU. In an actual online application of this type, queries would be parameterized, so the throughputs would be accordingly higher.  We used the <code>StopCompilerWhenXOverRunTime = 1</code> option here to cut needless compiler overhead, the queries being straightforward enough.</p>

<p>We also see that the advantage of mapping can be further increased by more compiler optimizations, so we expect in the end mapping will lead RDF warehousing by a factor of 4 or so.</p>

<h3>Suggestions for BSBM</h3>

<ul>
 <li>
  <p>
    <b>Reporting Rules.</b> The benchmark spec should specify a form for disclosure of test run data, TPC style.  This includes things like configuration parameters and exact text of queries.  There should be accepted variants of query text, as with the TPC.</p>
 </li>

<li>
  <p>
    <b>Multiuser operation.</b>  The test driver should get a stream number as parameter, so that each client makes a different query sequence. Also, disk performance in this type of benchmark can only be reasonably assessed with a naturally parallel multiuser workload.</p>
</li>

<li>
  <p>
    <b>Add business intelligence.</b>  SPARQL has aggregates now, at least with <a href="http://jena.sourceforge.net/" id="link-id11a25ac0">Jena</a> and <a href="http://virtuoso.openlinksw.com" id="link-id0xa83f490">Virtuoso</a>, so let&#39;s use these.  The BSBM business intelligence metric should be a separate metric off the same data.  Adding synthetic sales figures would make more interesting queries possible.  For example, producing recommendations like &quot;customers who bought this also bought xxx.&quot;</p>
</li>

<li>
  <p>
    <b>For the SPARQL community</b>, BSBM sends the message that one ought to support parameterized queries and stored procedures.  This would be a <a href="http://www.w3.org/TR/rdf-sparql-protocol/" id="link-id109e2448">SPARQL protocol</a> extension; the SPARUL syntax should also have a way of calling a procedure.  Something like <code>select proc (??, ??)</code> would be enough, where <code>??</code> is a parameter marker, like <code>?</code> in <a href="http://dbpedia.org/resource/Open_Database_Connectivity" id="link-id13febf48">ODBC</a>/<a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id120416a8">JDBC</a>.</p>
</li>

<li>
  <p>
    <b>Add transactions.</b>Especially if we are contrasting mapping vs. storing triples, having an update flow is relevant.  In practice, this could be done by having the test driver send web service requests for order entry and the SUT could implement these as updates to the triples or a mapped relational store.  This could use stored procedures or logic in an app server.</p>
</li>
</ul>

<h3>Comments on Query Mix</h3>

<p>The time of most queries is less than linear to the scale factor.  Q6 is an exception if it is not implemented using a text index.  Without the text index, Q6 will inevitably come to dominate query time as the scale is increased, and thus will make the benchmark less relevant at larger scales.</p>

<h2>Next</h2>

<p>We include the sources of our RDF view definitions and other material for running BSBM with our forthcoming Virtuoso Open Source 5.0.8 release.  This also includes all the query optimization work done for BSBM.  This will be available in the coming days.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-06-09#1374">
  <rss:title>ESWC 2008</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-06-09T13:49:15Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">YrjÃ¤nÃ¤ Rankka and I attended ESWC2008 on behalf of OpenLink. We were invited at the last minute to give a Linked Open Data talk at Paolo Bouquet&#39;s Identity and Reference workshop. We also had a demo of SPARQL BI (PPT); other formats coming soon), our business intelligence extensions to SPARQL as well as joining between relational data mapped to RDF and native RDF data. i was also speaking at the social networks panel chaired by Harry Halpin. I have gathered a few impressions that I will share in the next few posts (1 - RDF Mapping, 2 - DARQ, 3 - voiD, 4 - Paradigmata). Caveat: This is not meant to be complete or impartial press coverage of the event but rather some quick comments on issues of personal/OpenLink interest. The fact that I do not mention something does not mean that it is unimportant. The voiD Graph Linked Open Data was well represented, with Chris Bizer, Tom Heath, ourselves and many others. The great advance for LOD this time around is voiD, the Vocabulary of Interlinked Datasets, a means to describe what in fact is inside the LOD cloud, how to join it with what and so forth. Big time important if there is to be a web of federatable data sources, feeding directly into what we have been saying for a while about SPARQL end-point self-description and discovery. There is reasonable hope of having something by the date of Linked Data Planet in a couple of weeks. Federating Bastian Quilitz gave a talk about his DARQ, a federated version of Jena&#39;s ARQ. Something like DARQ&#39;s optimization statistics should make their way into the SPARQL protocol as well as the voiD data set description. We really need federation but more on this in a separate post. XSPARQL Axel Polleres et al had a paper about XSPARQL, a merge of XQuery and SPARQL. While visiting DERI a couple of weeks back and again at the conference, we talked about OpenLink implementing the spec. It is evident that the engines must be in the same process and not communicate via the SPARQL protocol for this to be practical. We could do this. We&#39;ll have to see when. Politically, using XQuery to give expressions and XML synthesis to SPARQL would be fitting. These things are needed anyhow, as surely as aggregation and sub-queries but the latter would not so readily come from XQuery. Some rapprochement between RDF and XML folks is desirable anyhow. Panel: Will the Sem Web Rise to the Challenge of the Social Web? The social web panel presented the question of whether the sem web was ready for prime time with data portability. The main thrust was expressed in Harry Halpin&#39;s rousing closing words: &quot;Men will fight in a battle and lose a battle for a cause they believe in. Even if the battle is lost, the cause may come back and prevail, this time changed and under a different name. Thus, there may well come to be something like our semantic web, but it may not be the one we have worked all these years to build if we do not rise to the occasion before us right now.&quot; So, how to do this? Dan Brickley asked the audience how many supported, or were aware of, the latest Web 2.0 things, such as OAuth and OpenID. A few were. The general idea was that research (after all, this was a research event) should be more integrated and open to the world at large, not living at the &quot;outdated pace&quot; of a 3 year funding cycle. Stefan Decker of DERI acquiesced in principle. Of course there is impedance mismatch between specialization and interfacing with everything. I said that triples and vocabularies existed, that OpenLink had ODS (OpenLink Data Spaces, Community LinkedData) for managing one&#39;s data-web presence, but that scale would be the next thing. Rather large scale even, with 100 gigatriples (Gtriples) reached before one even noticed. It takes a lot of PCs to host this, maybe $400K worth at today&#39;s prices, without replication. Count 16G ram and a few cores per Gtriple so that one is not waiting for disk all the time. The tricks that Web 2.0 silos do with app-specific data structures and app-specific partitioning do not really work for RDF without compromising the whole point of smooth schema evolution and tolerance of ragged data. So, simple vocabularies, minimal inference, minimal blank nodes. Besides, note that the inference will have to be done at run time, not forward-chained at load time, if only because users will not agree on what sameAs and other declarations they want for their queries. Not to mention spam or malicious sameAs declarations! As always, there was the question of business models for the open data web and for semantic technologies in general. As we see it, information overload is the factor driving the demand. Better contextuality will justify semantic technologies. Due to the large volumes and complex processing, a data-as-service model will arise. The data may be open, but its query infrastructure, cleaning, and keeping up-to-date, can be monetized as services. Identity and Reference For the identity and reference workshop, the ultimate question is metaphysical and has no single universal answer, even though people, ever since the dawn of time and earlier, have occupied themselves with the issue. Consequently, I started with the Genesis quote where Adam called things by nominibus suis, off-hand implying that things would have some intrinsic ontologically-due names. This would be among the older references to the question, at least in widely known sources. For present purposes, the consensus seemed to be that what would be considered the same as something else depended entirely on the application. What was similar enough to warrant a sameAs for cooking purposes might not warrant a sameAs for chemistry. In fact, complete and exact sameness for URIs would be very rare. So, instead of making generic weak similarity assertions like similarTo or seeAlso, one would choose a set of strong sameAs assertions and have these in effect for query answering if they were appropriate to the granularity demanded by the application. Therefore sameAs is our permanent companion, and there will in time be malicious and spam sameAs. So, nothing much should be materialized on the basis of sameAs assertions in an open world. For an app-specific warehouse, sameAs can be resolved at load time. There was naturally some apparent tension between the Occam camp of entity name services and the LOD camp. I would say that the issue is more a perceived polarity than a real one. People will, inevitably, continue giving things names regardless of any centralized authority. Just look at natural language. But having a dictionary that is commonly accepted for established domains of discourse is immensely helpful. CYC and NLP The semantic search workshop was interesting, especially CYC&#39;s presentation. CYC is, as it were, the grand old man of knowledge representation. Over the long term, I would have support of the CYC inference language inside a database query processor. This would mostly be for repurposing the huge knowledge base for helping in search type queries. If it is for transactions or financial reporting, then queries will be SQL and make little or no use of any sort of inference. If it is for summarization or finding things, the opposite holds. For scaling, the issue is just making correct cardinality guesses for query planning, which is harder when inference is involved. We&#39;ll see. I will also have a closer look at natural language one of these days, quite inevitably, since Zitgist (for example) is into entity disambiguation. Scale Garlic gave a talk about their Data Patrol and QDOS. We agree that storing the data for these as triples instead of 1000 or so constantly changing relational tables could well make the difference between next-to-unmanageable and efficiently adaptive. Garlic probably has the largest triple collection in constant online use to date. We will soon join them with our hosting of the whole LOD cloud and Sindice/Zitgist as triples. Conclusions There is a mood to deliver applications. Consequently, scale remains a central, even the principal topic. So for now we make bigger centrally-managed databases. At the next turn around the corner we will have to turn to federation. The point here is that a planetary-scale, centrally-managed, online system can be made when the workload is uniform and anticipatable, but if it is free-form queries and complex analysis, we have a problem. So we move in the direction of federating and charging based on usage whenever the workload is more complex than making simple lookups now and then. For the Virtuoso roadmap, this changes little. Next we make data sets available on Amazon EC2, as widely promised at ESWC. With big scale also comes rescaling and repartitioning, so this gets additional weight, as does further parallelizing of single user workloads. As it happens, the same medicine helps for both. At Linked Data Planet, we will make more announcements.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>YrjÃ¤nÃ¤ Rankka and I attended <a href="http://www.eswc2008.org/" id="link-id10b7a038">ESWC2008</a> on behalf of OpenLink.</p>
<p>We were invited at the last minute to give a <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id105df758">Linked Open Data</a> talk at Paolo Bouquet&#39;s Identity and Reference workshop. We also had a demo of <a href="http://dbpedia.org/resource/SPARQL" id="link-id12eacca0">SPARQL</a> BI (<a href="http://virtuoso.openlinksw.com/wiki/main/Main/VirtPresentations/ESWC2008%20SPARQL%20BI%20OpenLink.ppt" id="link-id10b43e58">PPT</a>); <a href="http://virtuoso.openlinksw.com/wiki/main/Main/VirtPresentations" id="link-id1116d8f0">other formats coming soon</a>), our business intelligence extensions to <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1843a368">SPARQL</a> as well as joining between relational <a href="http://dbpedia.org/resource/Data" id="link-id10badc40">data</a> mapped to <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id108edaf8">RDF</a> and native <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1843a3b0">RDF</a> <a href="http://dbpedia.org/resource/Data" id="link-id0x1843a3c8">data</a>. i was also speaking at the social networks panel chaired by Harry Halpin.</p>
<p>I have gathered a few impressions that I will share in the next few posts (<a href="http://www.openlinksw.com/weblog/oerling/?id=1375" id="link-id107298e0">1 - RDF Mapping</a>, <a href="http://www.openlinksw.com/weblog/oerling/?id=1376" id="link-id10b3a530">2 - DARQ</a>, <a href="http://www.openlinksw.com/weblog/oerling/?id=1377" id="link-id107290e0">3 - voiD</a>, <a href="http://www.openlinksw.com/weblog/oerling/?id=1378" id="link-id1071a950">4 - Paradigmata</a>). <i>Caveat: This is not meant to be complete or impartial press coverage of the event but rather some quick comments on issues of personal/OpenLink interest. The fact that I do not mention something does not mean that it is unimportant.</i>
</p>
<h2>The voiD Graph</h2>
<p>
<a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x16c781e0">Linked Open Data</a> was well represented, with Chris Bizer, Tom Heath, ourselves and many others. The great advance for <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id108f3c48">LOD</a> this time around is <a href="http://community.linkeddata.org/MediaWiki/index.php?MetaLOD#Kick-off_meeting_at_ESWC08" id="link-id10df9830">voiD, the Vocabulary of Interlinked Datasets</a>, a means to describe what in fact is inside the <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x16c78228">LOD</a> cloud, how to join it with what and so forth. Big time important if there is to be a <a href="http://www.openlinksw.com/weblog/oerling/?id=1377" id="link-iddf74578">web of federatable data sources</a>, feeding directly into what we have been saying for a while about SPARQL end-point self-description and discovery. There is reasonable hope of having something by the date of <a href="http://www.linkeddataplanet.com/" id="link-id10dd0848">Linked Data Planet</a> in a couple of weeks.</p>
<h2>Federating</h2>
<p>Bastian Quilitz gave a talk about his <a href="http://darq.sourceforge.net/" id="link-id108746e8">DARQ</a>, a federated version of Jena&#39;s ARQ.</p>
<p>Something like <a href="http://darq.sourceforge.net/" id="link-id0x16c782e8">DARQ</a>&#39;s optimization statistics should make their way into the <a href="http://www.w3.org/TR/rdf-sparql-protocol/" id="link-id10992348">SPARQL protocol</a> as well as the voiD data set description.</p>
<p>We really need federation but more on this in <a href="http://www.openlinksw.com/weblog/oerling/?id=1376" id="link-id1059d688">a separate post</a>.</p>
<h2>
<a href="http://xsparql.deri.ie/" id="link-id10314308">XSPARQL</a>
</h2>
<p>Axel Polleres et al had a paper about <a href="http://xsparql.deri.ie/" id="link-id0x1a2d8458">XSPARQL</a>, a merge of <a href="http://dbpedia.org/resource/XQuery" id="link-id10b98e90">XQuery</a> and SPARQL. While visiting DERI a couple of weeks back and again at the conference, we talked about OpenLink implementing the spec. It is evident that the engines must be in the same process and not communicate via the <a href="http://www.w3.org/TR/rdf-sparql-protocol/" id="link-id0x1d99c1d0">SPARQL protocol</a> for this to be practical. We could do this. We&#39;ll have to see when.</p>
<p>Politically, using <a href="http://dbpedia.org/resource/XQuery" id="link-id0x1acae1f0">XQuery</a> to give expressions and XML synthesis to SPARQL would be fitting. These things are needed anyhow, as surely as aggregation and sub-queries but the latter would not so readily come from XQuery. Some rapprochement between RDF and XML folks is desirable anyhow.</p>
<h2>Panel: Will the Sem Web Rise to the Challenge of the Social Web?</h2>
<p>The social web panel presented the question of whether the sem web was ready for prime time with data portability.</p>
<p>The main thrust was expressed in Harry Halpin&#39;s rousing closing words: &quot;Men will fight in a battle and lose a battle for a cause they believe in. Even if the battle is lost, the cause may come back and prevail, this time changed and under a different name. Thus, there may well come to be something like our <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id122f4da0">semantic web</a>, but it may not be the one we have worked all these years to build if we do not rise to the occasion before us right now.&quot;</p>
<p>So, how to do this? Dan Brickley asked the audience how many supported, or were aware of, the latest Web 2.0 things, such as <a href="http://dbpedia.org/page/OAuth" id="link-idf300bc0">OAuth</a> and <a href="http://dbpedia.org/page/OpenID" id="link-id10ce7a40">OpenID</a>. A few were. The general idea was that research (after all, this was a research event) should be more integrated and open to the world at large, not living at the &quot;outdated pace&quot; of a 3 year funding cycle. Stefan Decker of DERI acquiesced in principle. Of course there is impedance mismatch between specialization and interfacing with everything.</p>
<p>I said that triples and vocabularies existed, that OpenLink had <a href="http://dbpedia.org/resource/OpenLink_Data_Spaces" id="link-id1210dbf8">ODS</a> (<a href="http://dbpedia.org/resource/OpenLink_Data_Spaces" id="link-id11076be8">OpenLink Data Spaces</a>, <a href="http://community.linkeddata.org/" id="link-id10d46710">Community LinkedData</a>) for managing one&#39;s data-web presence, but that scale would be the next thing. Rather large scale even, with 100 gigatriples (Gtriples) reached before one even noticed. It takes a lot of PCs to host this, maybe $400K worth at today&#39;s prices, without replication. Count 16G ram and a few cores per Gtriple so that one is not waiting for disk all the time.</p>
<p>The tricks that Web 2.0 silos do with app-specific data structures and app-specific partitioning do not really work for RDF without compromising the whole point of smooth schema evolution and tolerance of ragged data.</p>
<p>So, simple vocabularies, minimal inference, minimal blank nodes. Besides, note that the inference will have to be done at run time, not forward-chained at load time, if only because users will not agree on what sameAs and other declarations they want for their queries. Not to mention spam or malicious sameAs declarations!</p>
<p>As always, there was the question of business models for the open data web and for semantic technologies in general. As we see it, <a href="http://dbpedia.org/resource/Information" id="link-id108b7688">information</a> overload is the factor driving the demand. Better contextuality will justify semantic technologies. Due to the large volumes and complex processing, a data-as-service model will arise. The data may be open, but its query infrastructure, cleaning, and keeping up-to-date, can be monetized as services.</p>
<h2>Identity and Reference</h2>
<p>For the identity and reference workshop, the ultimate question is metaphysical and has no single universal answer, even though people, ever since the dawn of time and earlier, have occupied themselves with the issue. Consequently, I started with the Genesis quote where Adam called things by <i>nominibus suis</i>, off-hand implying that things would have some intrinsic ontologically-due names. This would be among the older references to the question, at least in widely known sources.</p>
<p>For present purposes, the consensus seemed to be that what would be considered the same as something else depended entirely on the application. What was similar enough to warrant a sameAs for cooking purposes might not warrant a sameAs for chemistry. In fact, complete and exact sameness for URIs would be very rare. So, instead of making generic weak similarity assertions like similarTo or seeAlso, one would choose a set of strong sameAs assertions and have these in effect for query answering if they were appropriate to the granularity demanded by the application.</p>
<p>Therefore sameAs is our permanent companion, and there will in time be malicious and spam sameAs. So, nothing much should be materialized on the basis of sameAs assertions in an <a href="http://dbpedia.org/resource/Open_world_assumption" id="link-id10c4dfd0">open world</a>. For an app-specific warehouse, sameAs can be resolved at load time.</p>
<p>There was naturally some apparent tension between the Occam camp of <a href="http://dbpedia.org/resource/Entity" id="link-id105fd240">entity</a> name services and the LOD camp. I would say that the issue is more a perceived polarity than a real one. People will, inevitably, continue giving things names regardless of any centralized authority. Just look at natural language. But having a dictionary that is commonly accepted for established domains of discourse is immensely helpful.</p>
<h2>CYC and NLP</h2>
<p>The semantic search workshop was interesting, especially CYC&#39;s presentation. CYC is, as it were, the grand old man of <a href="http://dbpedia.org/resource/Knowledge" id="link-id10568158">knowledge</a> representation. Over the long term, I would have support of the CYC inference language inside a database query processor. This would mostly be for repurposing the huge <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x17f7dd40">knowledge</a> base for helping in search type queries. If it is for transactions or financial reporting, then queries will be <a href="http://dbpedia.org/resource/SQL" id="link-id130a0a80">SQL</a> and make little or no use of any sort of inference. If it is for summarization or finding things, the opposite holds. For scaling, the issue is just making correct cardinality guesses for query planning, which is harder when inference is involved. We&#39;ll see.</p>
<p>I will also have a closer look at natural language one of these days, quite inevitably, since <a href="http://zitgist.com/about/" id="link-id10795828">Zitgist</a> (for example) is into <a href="http://dbpedia.org/resource/Entity" id="link-id0x1a2c8bd0">entity</a> disambiguation.</p>
<h2>Scale</h2>
<p>Garlic gave a talk about their Data Patrol and QDOS. We agree that storing the data for these as triples instead of 1000 or so constantly changing relational tables could well make the difference between next-to-unmanageable and efficiently adaptive.</p>
<p>Garlic probably has the largest triple collection in constant online use to date. We will soon join them with our hosting of the whole LOD cloud and <a href="http://sindice.org/" id="link-id0x1b383720">Sindice</a>/<a href="http://zitgist.com/about/" id="link-id0x1b383738">Zitgist</a> as triples.</p>
<h2>Conclusions</h2>
<p>There is a mood to deliver applications. Consequently, scale remains a central, even the principal topic. So for now we make bigger centrally-managed databases. At the next turn around the corner we will have to turn to federation. The point here is that a planetary-scale, centrally-managed, online system can be made when the workload is uniform and anticipatable, but if it is free-form queries and complex analysis, we have a problem. So we move in the direction of federating and charging based on usage whenever the workload is more complex than making simple lookups now and then.</p>
<p>For the <a href="http://virtuoso.openlinksw.com" id="link-id1026ac28">Virtuoso</a> roadmap, this changes little. Next we make data sets available on Amazon EC2, as widely promised at ESWC. With big scale also comes rescaling and repartitioning, so this gets additional weight, as does further parallelizing of single user workloads. As it happens, the same medicine helps for both. At <a href="http://www.linkeddataplanet.com/" id="link-id0x1a2c7eb0">Linked Data Planet</a>, we will make more announcements.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-03-06#1321">
  <rss:title>TPC H as Linked Data (Updated 2)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-03-06T16:22:03Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We have a new demo online at http://demo.openlinksw.com/tpc-h. This takes the industry standard TPC-H benchmark data and presents it as linked data with a SPARQL end point and dereferenceable URIs. This is an example of using Virtuoso&#39;s relational-to-RDF mapping for publishing business data, for browsing using the linked data principles and opening it to analytics queries in SPARQL. As noted before, we have extended SPARQL with aggregation and nested queries, thus making it a viable SQL substitute for decision support queries. The article at http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VOSTPCHLinkedData gives details and the source code for the implementation. We are still working on some aspects of the more complex TPC-H queries, thus the demo is not complete with all the 22 queries. This is however enough to see a representative sample of how analytics queries work with SPARQL and Virtuoso&#39;s SQL-to-RDF mapping. The demo will be part of the next Virtuoso Open Source download, probably out next week.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We have a new demo online at <a href="http://demo.openlinksw.com/tpc-h" id="link-id1829c9a0">http://demo.openlinksw.com/tpc-h</a>. This takes the industry standard <a href="http://dbpedia.org/resource/TPC-H" id="link-id0xeb7e460">TPC-H</a> benchmark <a href="http://dbpedia.org/resource/Data" id="link-id0xb40fcb8">data</a> and presents it as <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x9edbd128">linked data</a> with a <a href="http://dbpedia.org/resource/SPARQL" id="link-id0xf566a50">SPARQL</a> end point and dereferenceable URIs. </p>
<p>This is an example of using <a href="http://virtuoso.openlinksw.com" id="link-id0x11e59f80">Virtuoso</a>&#39;s relational-to-<a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xfc93c70">RDF</a> mapping for publishing business data, for browsing using the linked data principles and opening it to analytics queries in SPARQL.</p>
<p> As noted before, we have extended SPARQL with aggregation and nested queries, thus making it a viable <a href="http://dbpedia.org/resource/SQL" id="link-id0xffe4520">SQL</a> substitute for decision support queries. </p>
<p>The article at <a href="http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VOSTPCHLinkedData" id="link-id10799d10">http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VOSTPCHLinkedData</a> gives details and the source code for the implementation.</p>
<p> We are still working on some aspects of the more complex TPC-H queries, thus the demo is not complete with all the 22 queries. This is however enough to see a representative sample of how analytics queries work with SPARQL and Virtuoso&#39;s SQL-to-RDF mapping. The demo will be part of the next Virtuoso Open Source download, probably out next week.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-02-04#1308">
  <rss:title>LUBM results with Virtuoso 6.0</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-02-04T09:58:03Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We have now run the LUBM benchmark on Virtuoso v6, with the same configuration as discussed last Friday. We had a database of 8000 universities, and we ran 8 clients on slices of 100, 1000 and 8000 universities — same data but different sizes of working set. 100 universities: 35.3 qps 1000 universities: 26.3 qps 8000 universities: 13.1 qps The 100 universities slice is about the same as with v5.0.5 (35.3 vs 33.1 qps). The 8000 universities set is almost 3x better (13.1 vs. 4.8 qps). This comes from the fact that the v6 database takes half of the space of the v5.0.5 one.  Further, this is with 64-bit IDs for everything.  If the 5.5 database were with 64-bit IDs, we&#39;d have a difference of over 3x.  This is worth something if it lets you get by with only 1 terabyte of RAM for the 100 billion  triple application, instead of 3 TB. In a few more days, we&#39;ll give the results for Virtuoso v6 Cluster.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We have now run the LUBM benchmark on <a href="http://virtuoso.openlinksw.com" id="link-id0x1a6cb3c8">Virtuoso</a> v6, with the same configuration <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1302" id="link-id107f0238">as discussed last Friday</a>.</p>
<p>We had a database of 8000 universities, and we ran 8 clients on slices of 100, 1000 and 8000 universities — same <a href="http://dbpedia.org/resource/Data" id="link-id0x12ac6cc8">data</a> but different sizes of working set.</p>
<blockquote>
<pre>
 100 universities: 35.3 qps
1000 universities: 26.3 qps
8000 universities: 13.1 qps</pre></blockquote>
<p>The 100 universities slice is about the same as with v5.0.5 (35.3 vs 33.1 qps). <br />The 8000 universities set is almost 3x better (13.1 vs. 4.8 qps).</p>
<p>This comes from the fact that the v6 database takes half of the space of the v5.0.5 one.  Further, this is with 64-bit IDs for everything.  If the 5.5 database were with 64-bit IDs, we&#39;d have a difference of over 3x.  This is worth something if it lets you get by with only 1 terabyte of RAM for the 100 billion  triple application, instead of 3 TB.</p>
<p>
<a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1358" id="link-id15fb4d38">In a few more days</a>, we&#39;ll give the results for Virtuoso v6 Cluster.</p>

]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-02-01#1304">
  <rss:title>Latest LUBM Benchmark results for Virtuoso</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-02-01T14:39:04Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We have now taken a close look at the query side of the LUBM benchmark, as promised a couple of blog posts ago. We load 8000 universities and run a query mix consisting of the 14 LUBM queries with different numbers of clients against different portions of the database. When it is all in memory, we get 33 queries per second with 8 concurrent clients; when it is so I/O bound that 7.7 of 8 threads wait for disk, we get 5 qps. This was run in 8G RAM with 2 Xeon 5130. We adapted some of the queries so that they do not run over the whole database. In terms of retrieving triples per second, this would be about 330000 for the rate of 33 qps, with 4 cores at 2GHz. This is a combination of random access and linear scans and bitmap merge intersections; lookups for non-found triples are not counted. The rate of random lookups alone based on known G, S, P, O, without any query logic, is about 250000 random lookups per core per second. The article LUBM and Virtuoso gives the details. In the process of going through the workload we made some cost model adjustments and optimized the bitmap intersection join. In this way we can quickly determine which subjects are, for example, professors holding a degree from a given university. So the benchmark served us well in that it provided an incentive to further optimize some things. Now, what has been said about RDF benchmarking previously still holds. What does it mean to do so many LUBM queries per second? What does this say about the capacity to run an online site off RDF data? Or about information integration? Not very much. But then this was not the aim of the authors either. So we still need to make a benchmark for online queries and search, and another for E-science and business intelligence. But we are getting there. In the immediate future, we have the general availability of Virtuoso Open Source 5.0.5 early next week. This comes with a LUBM test driver and a test suite running against the LUBM qualification database. After this we will give some numbers for the cluster edition with LUBM and TPC-H.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We have now taken a close look at the query side of the LUBM benchmark, <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1296" id="link-id10a98120">as promised a couple of blog posts ago.</a>
</p>
<p>We load 8000 universities and run a query mix consisting of the 14 LUBM queries with different numbers of clients against different portions of the database.</p>
<p>When it is all in memory, we get 33 queries per second with 8 concurrent clients; when it is so I/O bound that 7.7 of 8 threads wait for disk, we get 5 qps. This was run in 8G RAM with 2 Xeon 5130.</p>
<p>We adapted some of the queries so that they do not run over the whole database. In terms of retrieving triples per second, this would be about 330000 for the rate of 33 qps, with 4 cores at 2GHz. This is a combination of random access and linear scans and bitmap merge intersections; lookups for non-found triples are not counted. The rate of random lookups alone based on known G, S, P, O, without any query logic, is about 250000 random lookups per core per second.</p>
<p>The article <a href="http://virtuoso.openlinksw.com/wiki/main/Main/VOSArticleLUBMBenchmark" id="link-id10237708">LUBM and Virtuoso</a> gives the details.</p>
<p>In the process of going through the workload we made some cost model adjustments and optimized the bitmap intersection join. In this way we can quickly determine which subjects are, for example, professors holding a degree from a given university. So the benchmark served us well in that it provided an incentive to further optimize some things.</p>
<p>Now, what has been said about <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x104257c0">RDF</a> benchmarking previously still holds. What does it mean to do so many LUBM queries per second? What does this say about the capacity to run an online site off RDF <a href="http://dbpedia.org/resource/Data" id="link-id0x7376478">data</a>? Or about <a href="http://dbpedia.org/resource/Information" id="link-id0x13fd3f30">information</a> integration? Not very much. But then this was not the aim of the authors either.</p>
<p>So we still need to make a benchmark for online queries and search, and another for E-science and business intelligence. But we are getting there.</p>
<p>In the immediate future, we have the general availability of <a href="http://virtuoso.openlinksw.com" id="link-id0x193509e8">Virtuoso</a> Open Source 5.0.5 early next week. This comes with a LUBM test driver and a test suite running against the LUBM qualification database.</p>
<p>After this we will give some numbers for the cluster edition with LUBM and <a href="http://dbpedia.org/resource/TPC-H" id="link-id0x1b8d1348">TPC-H</a>.</p>
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2007-11-08#1269">
  <rss:title>Social Web RDF Store Benchmark</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2007-11-08T13:39:39Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Elaborating on my previous post, as food for thought for an RDF store benchmarking activity under the W3C, I present the following rough sketch. At the end of the below, I propose some common business questions that should be answered by a social web aggregator. The problem with these is that it is not really possible to ask interesting questions over a large database without involving some sort of counting and grouping. I feel that we simply cannot make a representative benchmark without these, quite regardless of the fact that SPARQL in its present form does not have these features. Hence I have simply stated the questions and left any implementation open. If this seems like an interesting direction, the nascent W3C benchmarking XG (experimental group) can refine the business questions, relative query frequencies, exact data set composition, etc. Social Web RDF Benchmark by Orri Erling Goals This benchmark model&#39;s use of RDF for representing and analyzing use of social software by user communities. The benchmark consists of a scalable synthetic data set, a feed of updates to the data set, and a query mix. The data set reflects the common characteristics of the social web, with realistic distribution of connections, user contributed content, commenting, tagging, and other social web activities. The data set is expressed in the FOAF and SIOC vocabularies. The query mix is divided between relatively short, dashboard or search engine style lookups, and longer running analytics queries. The system being modeled is an an aggregator of social web content; we could liken it to an RDF-based Technorati with some extra features. Users can publish their favorite queries or mesh-ups as logical views served by the system. In this manner, queries come to depend on other queries, somewhat like SQL VIEWs can reference each other. There is a small qualification data set that can be tested against the queries to validate that the system under test (SUT) produces the correct results. The benchmark is scaled by number of users. To facilitate comparison, some predefined scales are offered, i.e., 100K, 300K, 1M, 3M, 10M users. Each simulated user both produces and consumes content. The level of activity of users is unevenly divided. There are two work mixes â the browsing mix, which consists of a mix of lookups and contributing content, and the analytics mix, which consists of long-running queries for tracking the state of the network. For each 100 browsing mixes, one analytics mix is performed. A benchmark run is at least 1h real-time in duration. The metric is calculated by the number of browsing mixes completed during the test window. This simulates 10% of the users being online at any one time, thus for a scale of 1M users, 100K browsing mixes will be simultaneously proceeding. The test driver submits the work via HTTP. What load balancing or degree of parallel serving of the requests is used is left up to the SUT. The metric is expressed as queries per second, taking the total number of queries executed by completed browsing mixes and dividing this by the real time of the measurement window. The metric is called qpsSW, for queries per second, socialweb. The cost metric is $/qpsSW, calculated by the costing rules of the TPC. If compute-on-demand infrastructure is used, the costing will be $/qpsSW/day. The test sponsor is the party contributing the result. The contribution consists of the metric and of a full disclosure report (FDR), written following a template given in the benchmark specification. The disclosure requirements follow the TPC practices, including publishing any configuration scripts, data definition language statements, timing for warm-up and test window, times for individual queries etc. All details of the hardware and software are disclosed. Test Support Software The software consists of the data generator and of a test driver. The test driver calls functions supplied by the test sponsor for performing the diverse operations in the test. Source code for any modifications of the test driver is to be published as part of the FDR. Rules for SUT Any hardware/software combination â including single machines, clusters, clusters rented from computer providers like Amazon EC2 â is eligible. The SUT must produce correct answers for the validation queries against the validation data set. The implementation of the queries is not restricted. These can be any SPARQL or other queries, application server based logic, stored procedures or other, in any language, provided full source code is provided in the FDR. The data set is provided as serialized RDF. The means of storage are left up to the SUT. The basic intention is to use a triple store of some form, but the specific indexing, use of property tables, materialized views, and so forth, is left up to the test sponsor. All tuning and configuration is to be published in the FDR. Simulated Workload For each operation of each mix, the specification shall present: The logical intent of the operation, the business question, e.g., What is the hot topic among my friends? The question or update expressed in terms of the data in the data set. Sample text of a query answering the question or pseudo-code for deriving the answer. Result set layout, if applicable. The relative frequencies of the queries are given in the query mix summary. Browsing Mix The browsing mix consists of the following operations: Updates Make a blog post. Make a blog comment. Make a new social contact. For one new social contact, there are 10 posts and 20 comments. Queries What are the 10 most recent posts by somebody in my friends or their friends? This would be a typical dashboard item. What are the authoritative bloggers on topic x? This is a moderately complex ad-hoc query. Take posts tagged with the topic, count links to them, take the blogs containing them, show the 10 most cited blogs with the most recent posts with the tag. This would be typical of a stored query, like a parameterizable report. How do I contact person x? Calculate the chain of common acquaintances best for reaching person x. For practicality, we do not do a full walk of anything but just take the distinct persons in 2 steps of the user and in 2 steps of x and see the intersection. Who are the people like me? Find the top 10 people ranked by count of tags in common in the person&#39;s tag cloud. The tag cloud is the set of interests and the set of tags in blog posts of the person. Who react to or talk about me? Count of replies to material by the user, grouped by the commenting user and the site of the comment, top 20, sorted by count descending. Who are my fans that I do not know? Same as above, excluding people within 2 steps. Who are my competitors? Most prolific posters on topics of my interest that do not cite me. Where is the action? On forums where I participate, what are the top 5 threads, as measured by posts in the last day. Show count of posts in the last day and the day before that. How do I get there? Who are the people active around both topic x and y? This is defined by a person having participated during the last year in forums of x as well as of y. Forums are tagged by topics. The most active users are first. The ranking is proportional to the sum of the number of posts in x and y. Analytic Mix These queries are typical questions about the state of the conversation space as a whole and can for example be published as a weekly summary page. The fastest propagating idea - What is the topic with the most users who have joined in the last day? A user is considered to have joined if the user was not discussing this in the past 10 days. Prime movers - What users start conversations? A conversation is the set of material in reply to or citing a post. The reply distance can be arbitrarily long, the citing distance is a direct link to the original post or a reply there to. The number and extent of conversations contribute towards the score. Geography - Over the last 10 days, for each geographic area, show the top 50 tags. The location is the location of the poster. Social hubs - For each community, get the top 5 people who are central to it in terms of number of links to other members of the same community and in terms of being linked from posts. A community is the set of forums that have a specific topic.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Elaborating on <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1269" id="link-idfe9e1d8">my previous post</a>, as food for thought for an <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1d1e1468">RDF</a> store benchmarking activity under the W3C, I present the following rough sketch. At the end of the below, I propose some common business questions that should be answered by a social web aggregator.</p>
<p>The problem with these is that it is not really possible to ask interesting questions over a large database without involving some sort of counting and grouping. I feel that we simply cannot make a representative benchmark without these, quite regardless of the fact that <a href="http://dbpedia.org/resource/SPARQL" id="link-id0xba84830">SPARQL</a> in its present form does not have these features. Hence I have simply stated the questions and left any implementation open. If this seems like an interesting direction, the nascent W3C benchmarking XG (experimental group) can refine the business questions, relative query frequencies, exact <a href="http://dbpedia.org/resource/Data" id="link-id0x1c272b10">data</a> set composition, etc.</p>
<h3>Social Web RDF Benchmark </h3>
<p>
<i>by Orri Erling</i>
</p>
<h4>Goals</h4>
<p>This benchmark model&#39;s use of RDF for representing and analyzing use of social software by user communities. The benchmark consists of a scalable synthetic data set, a feed of updates to the data set, and a query mix. The data set reflects the common characteristics of the social web, with realistic distribution of connections, user contributed content, commenting, tagging, and other social web activities. The data set is expressed in the FOAF and SIOC vocabularies. The query mix is divided between relatively short, dashboard or search engine style lookups, and longer running analytics queries.</p>
<p>The system being modeled is an an aggregator of social web content; we could liken it to an RDF-based Technorati with some extra features.</p>
<p>Users can publish their favorite queries or mesh-ups as logical views served by the system. In this manner, queries come to depend on other queries, somewhat like <a href="http://dbpedia.org/resource/SQL" id="link-id0xb75c930">SQL</a> VIEWs can reference each other.</p>
<p>There is a small qualification data set that can be tested against the queries to validate that the system under test (SUT) produces the correct results.</p>
<p>The benchmark is scaled by number of users. To facilitate comparison, some predefined scales are offered, i.e., 100K, 300K, 1M, 3M, 10M users. Each simulated user both produces and consumes content. The level of activity of users is unevenly divided.</p>
<p>There are two work mixes â the browsing mix, which consists of a mix of lookups and contributing content, and the analytics mix, which consists of long-running queries for tracking the state of the network. For each 100 browsing mixes, one analytics mix is performed.</p>
<p>A benchmark run is at least 1h real-time in duration. The metric is calculated by the number of browsing mixes completed during the test window. This simulates 10% of the users being online at any one time, thus for a scale of 1M users, 100K browsing mixes will be simultaneously proceeding.</p>
<p>The test driver submits the work via <a href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x1ae7c010">HTTP</a>. What load balancing or degree of parallel serving of the requests is used is left up to the SUT.</p>
<p>The metric is expressed as queries per second, taking the total number of queries executed by completed browsing mixes and dividing this by the real time of the measurement window. The metric is called qpsSW, for <i>queries per second, socialweb</i>. The cost metric is $/qpsSW, calculated by the costing rules of the TPC. If compute-on-demand infrastructure is used, the costing will be $/qpsSW/day.</p>
<p>The test sponsor is the party contributing the result. The contribution consists of the metric and of a full disclosure report (FDR), written following a template given in the benchmark specification. The disclosure requirements follow the TPC practices, including publishing any configuration scripts, data definition language statements, timing for warm-up and test window, times for individual queries etc. All details of the hardware and software are disclosed.</p>
<h4>Test Support Software</h4>
<p>The software consists of the data generator and of a test driver. The test driver calls functions supplied by the test sponsor for performing the diverse operations in the test. Source code for any modifications of the test driver is to be published as part of the FDR.</p>
<h4>Rules for SUT</h4>
<p>Any hardware/software combination  â including single machines, clusters, clusters rented from computer providers like Amazon EC2 â is eligible.</p>
<p>The SUT must produce correct answers for the validation queries against the validation data set.</p>
<p>The implementation of the queries is not restricted. These can be any SPARQL or other queries, <a href="http://dbpedia.org/resource/Application_server" id="link-id0x1a38aee0">application server</a> based logic, stored procedures or other, in any language, provided full source code is provided in the FDR.</p>
<p>The data set is provided as serialized RDF. The means of storage are left up to the SUT. The basic intention is to use a triple store of some form, but the specific indexing, use of property tables, materialized views, and so forth, is left up to the test sponsor. All tuning and configuration is to be published in the FDR.</p>
<h4>Simulated Workload</h4>
<p>For each operation of each mix, the specification shall present:</p>
<ol>
 <li>
  <p>The logical intent of the operation, the business question, e.g., <i>What is the hot topic among my friends?</i>
  </p>
</li>
<li>
  <p>The question or update expressed in terms of the data in the data set.</p>
</li>
<li>
  <p>Sample text of a query answering the question or pseudo-code for deriving the answer.</p>
</li>
<li>
  <p>Result set layout, if applicable.</p>
</li>
</ol>
<p>The relative frequencies of the queries are given in the query mix summary.</p>
<h4>Browsing Mix</h4>
<p>The browsing mix consists of the following operations:</p>
<h5>Updates</h5>
<p></p>
<ul>
<li>
  <p>Make a <a href="http://dbpedia.org/resource/Blog" id="link-id0x1e0f6470">blog</a> post.</p>
</li>
<li>
  <p>Make a blog comment.</p>
</li>
<li>
  <p>Make a new social contact.</p>
</li>
</ul>
<p>For one new social contact, there are 10 posts and 20 comments.</p>
<h5>Queries</h5>
<ul>
 <li>
  <p>
    <i>What are the 10 most recent posts by somebody in my friends or their friends?</i> This would be a typical dashboard item.</p>
 </li>
<li>
  <p>
    <i>What are the authoritative bloggers on topic x?</i> This is a moderately complex ad-hoc query. Take posts tagged with the topic, count links to them, take the blogs containing them, show the 10 most cited blogs with the most recent posts with the <a href="http://dbpedia.org/resource/Tag" id="link-id0xbf5ace8">tag</a>. This would be typical of a stored query, like a parameterizable report.</p>
</li>
<li>
  <p>
    <i>How do I contact person x?</i> Calculate the chain of common acquaintances best for reaching person x. For practicality, we do not do a full walk of anything but just take the distinct persons in 2 steps of the user and in 2 steps of x and see the intersection.</p>
</li>
<li>
  <p>
    <i>Who are the people like me?</i> Find the top 10 people ranked by count of tags in common in the person&#39;s tag cloud. The tag cloud is the set of interests and the set of tags in blog posts of the person.</p>
</li>
<li>
  <p>
    <i>Who react to or talk about me?</i> Count of replies to material by the user, grouped by the commenting user and the site of the comment, top 20, sorted by count descending.</p>
</li>
<li>
  <p>
    <i>Who are my fans that I do not know?</i> Same as above, excluding people within 2 steps.</p>
</li>
<li>
  <p>
    <i>Who are my competitors?</i> Most prolific posters on topics of my interest that do not cite me.</p>
</li>
<li>
  <p>
    <i>Where is the action?</i> On forums where I participate, what are the top 5 threads, as measured by posts in the last day. Show count of posts in the last day and the day before that.</p>
</li>
<li>
  <p>
    <i>How do I get there? Who are the people active around both topic x and y?</i> This is defined by a person having participated during the last year in forums of x as well as of y. Forums are tagged by topics. The most active users are first. The ranking is proportional to the sum of the number of posts in x and y.</p>
</li>
</ul>
<h4>Analytic Mix</h4>
<p>These queries are typical questions about the state of the conversation space as a whole and can for example be published as a weekly summary page.</p>
<ul>
<li>
  <p>
    <b>The fastest propagating idea</b> - <i>What is the topic with the most users who have joined in the last day?</i> A user is considered to have joined if the user was not discussing this in the past 10 days.</p>
</li>
<li>
  <p>
    <b>Prime movers</b> - <i>What users start conversations?</i> A conversation is the set of material in reply to or citing a post. The reply distance can be arbitrarily long, the citing distance is a direct link to the original post or a reply there to. The number and extent of conversations contribute towards the score.</p>
</li>
<li>
  <p>
    <b>Geography</b> - Over the last 10 days, for each geographic area, show the top 50 tags. The location is the location of the poster.</p>
</li>
<li>
  <p>
    <b>Social hubs</b> - For each community, get the top 5 people who are central to it in terms of number of links to other members of the same community and in terms of being linked from posts. A community is the set of forums that have a specific topic.</p>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2007-11-08#1268">
  <rss:title>RDBMS to RDF Mapping Workshop, and Benchmarks</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2007-11-08T12:51:44Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">I was recently in Boston for the Mapping Relational Data to RDF workshop of the W3C. The common feeling was that mapping everything to RDF and querying it in terms of a generic domain ontology, mapped on demand into whatever line of business systems, would be very good if it only could be done. However, since this is not so easily done, the next best is to extract the data and then warehouse it as RDF. The obstacles perceived were of the following types: Lack of quality in the data. The different line of business systems do not in and of themselves hold enough semantics. If the meaning of data columns in relational tables were really known and explicit, these could be meaningfully used for joining across systems. But this is more complex than just mapping the metal lead to the chemical symbol Pb and back. Lack of performance in RDF storage. Data sets even in the tens-of-millions of triples do not run very well in some stores. Well, we had the Banff life sciences demo with 450M triples in a small server box running Virtuoso, so this is not universal, plus of course we are coming up with a whole different order of magnitude, as often discussed on this blog. Lack of functionality in mapping and possibly lack of pushing through enough of the query processing to the underlying data stores. Personally, I am quite aware of what to do with regard to performance of mapping and storage, and see these as eminently solvable issues. After all, we have a great investment of talent in databases in general and it can be well deployed towards RDF, as we have been doing these past couple of years. So we talk about the promise of a 360-degree view of information, with RDF being the top layer. Everybody agrees that this is a nice concept. But this is a nice concept especially when it can do the things that are the most common baseline expectation of any regular DBMS, i.e., aggregation, grouping, sub-queries, VIEWs. Now, I would not go sell a DBMS that has no COUNT operator to a data warehousing shop. The fact that OpenLink and Oracle allow RDF inside SQL, and OpenLink even adds native aggregates and grouping to SPARQL, fixes the problem with regard to specific products, but leaves the standardization issue open. Of course, any vendor will solve these questions one way or another because a database with no aggregation is a non-starter. I talked to Lee Feigenbaum, chair of the W3C DAWG, about the question of aggregates and general BI capabilities in SPARQL. He told me that, prior to his time with the DAWG, these were left out because they conflicted with the open-world assumption around RDF: You cannot count a set because by definition you do not know that you have all the members, the world being open and all that. Say what? Talk about the road to hell being paved with good intentions. Now, this is in no way Lee&#39;s or the present day DAWG&#39;s fault; as a member myself, I can attest to the good work and would under no circumstances wish any delays or revisions to SPARQL at this point. I am just pointing out a matter that all implementations should address, as a sort of precondition of entry into the real world IS space. If this can be done interoperably, so much the better. Now, out of the deliberations at the Boston workshop arose at least two ideas for follow-up activity. The first was an incubator group for RDF store and mapping benchmarking. This is very appropriate in order to dispel the bad name RDF storage and querying performance has been saddled with. As a first step in this direction, I will outline a social web oriented benchmark on this blog. The second activity was an incubator group for preparing standardization of mapping methodologies from relational schemas to RDF. We will be active on this as well. The two offshoots appear logically separate but are not necessarily so in practice. A benchmark is after all something that is supposed to promote a technology to a user base. The user base seems to wish to put all online systems and data warehouses under a common top level RDF model and then query away, introducing no further replication of data or performance cost or ETL latencies. Updating would also be nice but even query only would be very good. Personally, I&#39;d say the RDF strength is all on the query side. Transactions are taken care of well enough by what there already is, RDF stands out in integration and the ad-hoc and discovery side of the matter. Given this, we expect the value to be consumed in a heterogeneous, multi-database, federated environment. Thus a benchmark should measure this aspect of the use-case. With the right mapping and queries, we could probably demonstrate the added cost of RDF to be very low, as long as we could push all queries that can be answered by a single source to the responsible DBMS. For distributed joins, we are back at the question of optimizing distributed queries but this is a familiar one and RDF is not the principal cost factor. The subject does become quite complex at this point. We would have to take supposedly representative synthetic OLTP and BI data sets (like the ones in TPC-D, TPC-E, and TPC-H), and invent queries across them that would both make sense and be implementable in SPARQL extended with aggregates and sub-queries. Reliance on SPARQL extensions is simply unavoidable. Setting up the test systems would be non-trivial, even though there is a lot of industry experience in these matters on the database side. So, while this is probably the benchmark most relevant to the target audience, we may have to start with a simpler one. I will next outline something to the effect.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>I was recently in Boston for the <a href="http://www.w3.org/2007/03/RdfRDB/" id="link-id10f990b0">Mapping Relational Data to RDF workshop</a> of the W3C.</p> 
<p>The common feeling was that mapping everything to <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xb5de0b8">RDF</a> and querying it in terms of a generic domain ontology, mapped on demand into whatever line of business systems, would be very good if it only could be done. However, since this is not so easily done, the next best is to extract the <a href="http://dbpedia.org/resource/Data" id="link-id0xa05278e0">data</a> and then warehouse it as RDF.</p> 
<p>The obstacles perceived were of the following types:</p> 
<ul>
 <li>
  <p>Lack of quality in the data. The different line of business systems do not in and of themselves hold enough semantics. If the meaning of data columns in relational tables were really known and explicit, these could be meaningfully used for joining across systems. But this is more complex than just mapping the metal <i>lead</i> to the chemical symbol <i>Pb</i> and back.</p>
 </li>
<li>
  <p>Lack of performance in RDF storage. Data sets even in the tens-of-millions of triples do not run very well in some stores. Well, we had the Banff life sciences demo with 450M triples in a small server box running <a href="http://virtuoso.openlinksw.com" id="link-id0x1b47d218">Virtuoso</a>, so this is not universal, plus of course we are coming up with a whole different order of magnitude, as often discussed on this <a href="http://dbpedia.org/resource/Blog" id="link-id0xb6e3410">blog</a>.</p>
</li>
<li>
  <p>Lack of functionality in mapping and possibly lack of pushing through enough of the query processing to the underlying data stores.</p>
</li>
</ul>
<p>Personally, I am quite aware of what to do with regard to performance of mapping and storage, and see these as eminently solvable issues. After all, we have a great investment of talent in databases in general and it can be well deployed towards RDF, as we have been doing these past couple of years. So we talk about the promise of a 360-degree view of <a href="http://dbpedia.org/resource/Information" id="link-id0x1be90ae0">information</a>, with RDF being the top layer. Everybody agrees that this is a nice concept. But this is a nice concept especially when it can do the things that are the most common baseline expectation of any regular DBMS, i.e., aggregation, grouping, sub-queries, VIEWs. Now, I would not go sell a DBMS that has no <code>COUNT</code> operator to a data warehousing shop.</p> 
<p>The fact that OpenLink and <a href="http://dbpedia.org/resource/Oracle_Database" id="link-id0x1a882490">Oracle</a> allow RDF inside <a href="http://dbpedia.org/resource/SQL" id="link-id0x1c350498">SQL</a>, and OpenLink even adds native aggregates and grouping to <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1a82e880">SPARQL</a>, fixes the problem with regard to specific products, but leaves the standardization issue open. Of course, any vendor will solve these questions one way or another because a database with no aggregation is a non-starter.</p> 
<p>I talked to Lee Feigenbaum, chair of the W3C DAWG, about the question of aggregates and general BI capabilities in SPARQL. He told me that, prior to his time with the DAWG, these were left out because they conflicted with the <a href="http://dbpedia.org/resource/Open_world_assumption" id="link-id0x9f7ce7d8">open-world</a> assumption around RDF: You cannot count a set because by definition you do not know that you have all the members, the world being open and all that.</p> 
<p>Say what? Talk about the road to hell being paved with good intentions. Now, this is in no way Lee&#39;s or the present day DAWG&#39;s fault; as a member myself, I can attest to the good work and would under no circumstances wish any delays or revisions to SPARQL at this point. I am just pointing out a matter that all implementations should address, as a sort of precondition of entry into the real world IS space. If this can be done interoperably, so much the better.</p> 
<p>Now, out of the deliberations at the Boston workshop arose at least two ideas for follow-up activity.</p> 
<p>The first was an incubator group for RDF store and mapping benchmarking. This is very appropriate in order to dispel the bad name RDF storage and querying performance has been saddled with. As a first step in this direction, I will outline a <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1269" id="link-id10306200">social web oriented benchmark</a> on this blog.</p> 
<p>The second activity was an <a href="http://www.w3.org/2005/Incubator/rdb2rdf/" id="link-id10150a58">incubator group for preparing standardization of mapping methodologies from relational schemas to RDF</a>. We will be active on this as well.</p> 
<p>The two offshoots appear logically separate but are not necessarily so in practice. A benchmark is after all something that is supposed to promote a technology to a user base. The user base seems to wish to put all online systems and data warehouses under a common top level RDF model and then query away, introducing no further replication of data or performance cost or ETL latencies.</p> 
<p>Updating would also be nice but even query only would be very good. Personally, I&#39;d say the RDF strength is all on the query side. Transactions are taken care of well enough by what there already is, RDF stands out in integration and the ad-hoc and discovery side of the matter. Given this, we expect the value to be consumed in a heterogeneous, multi-database, federated environment. Thus a benchmark should measure this aspect of the use-case. With the right mapping and queries, we could probably demonstrate the added cost of RDF to be very low, as long as we could push all queries that can be answered by a single source to the responsible DBMS. For distributed joins, we are back at the question of optimizing distributed queries but this is a familiar one and RDF is not the principal cost factor.</p> 
<p>The subject does become quite complex at this point. We would have to take supposedly representative synthetic OLTP and BI data sets (like the ones in TPC-D, TPC-E, and <a href="http://dbpedia.org/resource/TPC-H" id="link-id0x1de16b90">TPC-H</a>), and invent queries across them that would both make sense and be implementable in SPARQL extended with aggregates and sub-queries. Reliance on SPARQL extensions is simply unavoidable. Setting up the test systems would be non-trivial, even though there is a lot of industry experience in these matters on the database side.</p> 
<p>So, while this is probably the benchmark most relevant to the target audience, we may have to start with a simpler one. I will next <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1269" id="link-id10fa7a50">outline something to the effect</a>.</p> ]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2007-09-06#1250">
  <rss:title>Virtuoso Cluster Stage 1</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2007-09-06T10:38:31Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">I recall a quote from a stock car racing movie. &quot;What is the necessary prerequisite for winning a race?&quot; asked the racing team boss. &quot;Being the fastest,&quot; answered the hotshot driver, after yet another wrecked engine. &quot;No. It is finishing the race.&quot; In the interest of finishing, we&#39;ll now leave optimizing the cluster traffic and scheduling and move to completing functionality. Our next stop is TPC-D. After this TPC-C, which adds the requirement of handling distributed deadlocks. After this we add RDF-specific optimizations. This will be Virtuoso 6 with the first stage of clustering support. This is with fixed partitions, which is just like a single database, except it runs on multiple machines. The stage after this is Virtuoso Cloud, the database with all the space filling properties of foam, expanding and contracting to keep an even data density as load and resource availability change. Right now, we have a pretty good idea of the final form of evaluating loop joins in a cluster, which after all is the main function of the thing. It makes sense to tune this to a point before going further. You want the pipes and pumps and turbines to have known properties and fittings before building a power plant. To test this, we took a table of a million short rows and made one copy partitioned over 4 databases and one copy with all rows in one database. We ran all the instances in a 4 core Xeon box. We used Unix sockets for communication. We joined the table to itself, like SELECT COUNT (*) FROM ct a, ct b WHERE b.row_no = a.row_no + 3. The + 3 causes the joined rows never to be on the same partition. With cluster, the single operation takes 3s and with a single process it takes 4s. The overall CPU time for cluster is about 30% higher, some of which is inevitable since it must combine results, serialize them, and so forth. Some real time is gained by doing multiple iterations of the inner loop (getting the row for b) in parallel. This can be further optimized to maybe 2x better with cluster but this can wait a little. Then we make a stream of 10 such queries. The stream with cluster is 14s; with the single process, it is 22s. Then we run 4 streams in parallel. The time with cluster is 39s and with a single process 36s. With 16 streams in parallel, cluster gets 2m51 and single process 3m21. The conclusion is that clustering overhead is not significant in a CPU-bound situation. Note that all the runs were at 4 cores at 98-100%, except for the first, single-client run, which had one process at 98% and 3 at 32%. The SMP single process loses by having more contention for mutexes serializing index access. Each wait carries an entirely ridiculous penalty of up to 6Âµs or so, as discussed earlier on this blog. The cluster wins by less contention due to distributed data and loses due to having to process messages and remember larger intermediate results. These balance out, or close enough. For the case with a single client, we can cut down on the coordination overhead by simply optimizing the code some more. This is quite possible, so we could get one process at 100% and 3 at 50%. The numbers are only relevant as ballpark figures and the percentages will vary between different queries. The point is to prove that we actually win and do not jump from the frying pan into the fire by splitting queries across processes. As a point of comparison, running the query clustered just as one would run it locally took 53s. We will later look at the effects of different networks, as we get to revisit the theme with some real benchmarks.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>I recall a quote from a stock car racing movie.</p>
<p>&quot;What is the necessary prerequisite for winning a race?&quot; asked the racing team boss.</p>
<p>&quot;Being the fastest,&quot; answered the hotshot driver, after yet another wrecked engine.</p>
<p>&quot;No. It is finishing the race.&quot;</p>
<p>In the interest of finishing, we&#39;ll now leave optimizing the cluster traffic and scheduling and move to completing functionality. Our next stop is TPC-D. After this <a href="http://dbpedia.org/resource/TPC-C" id="link-id0x1b4d9898">TPC-C</a>, which adds the requirement of handling distributed deadlocks. After this we add <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xa57ade08">RDF</a>-specific optimizations.</p>
<p>This will be <a href="http://virtuoso.openlinksw.com" id="link-id0x9fc61850">Virtuoso</a> 6 with the first stage of clustering support. This is with fixed partitions, which is just like a single database, except it runs on multiple machines. The stage after this is Virtuoso Cloud, the database with all the space filling properties of foam, expanding and contracting to keep an even <a href="http://dbpedia.org/resource/Data" id="link-id0x1de20610">data</a> density as load and resource availability change.</p>
<p>Right now, we have a pretty good idea of the final form of evaluating loop joins in a cluster, which after all is the main function of the thing. It makes sense to tune this to a point before going further. You want the pipes and pumps and turbines to have known properties and fittings before building a power plant.</p>
<p>To test this, we took a table of a million short rows and made one copy partitioned over 4 databases and one copy with all rows in one database. We ran all the instances in a 4 core Xeon box. We used Unix sockets for communication.</p>
<p>We joined the table to itself, like <code><b>SELECT COUNT (*) FROM ct a, ct b WHERE b.row_no = a.row_no + 3</b></code>. The <code><b>+ 3</b></code> causes the joined rows never to be on the same partition.</p>
<p>With cluster, the single operation takes 3s and with a single process it takes 4s. The overall CPU time for cluster is about 30% higher, some of which is inevitable since it must combine results, serialize them, and so forth. Some real time is gained by doing multiple iterations of the inner loop (getting the row for b) in parallel. This can be further optimized to maybe 2x better with cluster but this can wait a little.</p>
<p>Then we make a stream of 10 such queries. The stream with cluster is 14s; with the single process, it is 22s. Then we run 4 streams in parallel. The time with cluster is 39s and with a single process 36s. With 16 streams in parallel, cluster gets 2m51 and single process 3m21.</p>
<p>The conclusion is that clustering overhead is not significant in a CPU-bound situation. Note that all the runs were at 4 cores at 98-100%, except for the first, single-client run, which had one process at 98% and 3 at 32%.</p>
<p>The SMP single process loses by having more contention for mutexes serializing index access. Each wait carries an entirely ridiculous penalty of up to 6Âµs or so, <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1229" id="link-id106dca10">as discussed earlier on this blog</a>. The cluster wins by less contention due to distributed data and loses due to having to process messages and remember larger intermediate results. These balance out, or close enough.</p>
<p>For the case with a single client, we can cut down on the coordination overhead by simply optimizing the code some more. This is quite possible, so we could get one process at 100% and 3 at 50%.</p>
<p>The numbers are only relevant as ballpark figures and the percentages will vary between different queries. The point is to prove that we actually win and do not jump from the frying pan into the fire by splitting queries across processes. As a point of comparison, running the query clustered just as one would run it locally took 53s.</p>
<p>We will later look at <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1336" id="link-id108d9868">the effects of different networks</a>, as we get to revisit the theme with some real benchmarks.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2007-08-28#1246">
  <rss:title>Virtuoso and cluster capacity allocation</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2007-08-28T10:08:25Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">I just read Google&#39;s Bigtable paper. It is relevant here because it talks about keeping petabyte scale (1024TB) tables on a variable size cluster of machines. I have talked about partitioning versus distributed cache in the second to last post. The problem in short is that you do not expect a DBA to really know how to partition things, and even if the indices are correctly partitioned initially, repartitioning them is so bad that doing it online can be a problem. And repartitioning is needed whenever adding machines, unless the size increment is a doubling, which it will never be. So Oracle has really elegantly stepped around the whole problem by not partitioning for clustering in the first place. So incremental capacity change does not require repartitioning. Oracle has partitioning for other purposes but this is not tied to their cluster proposition. I did not go the cache fusion route because I could not figure a way to know with near certainty where to send a request for a given key value. In the case we are interested in, the job simply must go to the data and not the other way around. Besides, not being totally dependent on a microsecond latency interconnect and a SAN for performance enhances deployment options. Sending large batches of functions tolerates latency better than cache consistency messages which are a page at a time, unless of course you kill yourself with extra trickery for batching these too. So how to adapt to capacity change? Well, by making the unit of capacity allocation much smaller than a machine, of course. Google has done this in Bigtable by a scheme of dynamic range partitioning. The partition size is in the tens to hundreds of megabytes, something that can be moved around within reason. When the partition, called a tablet, gets too big, it splits. Just like a Btree index. The tree top must be common knowledge, as well as the allocation of partitions to servers but these can be cached here and there and do not change all the time. So how could we do something of the sort here? I know for an experiential fact that when people cannot change the server memory pool size, let alone correctly set up disk striping, they simply cannot be expected to deal with partitioning. Besides, even if you know exactly what you are doing and why, configuring and refilling large numbers of partitions by hand is error prone, tedious, time consuming, and will run out of disk and require restoring backups and all sorts of DBA activity that will have everything down for a long time, unless of course you have MIS staff such as is not easily found. The solution is not so complex. We start with a set number of machines and make a file group on each. A file group has a bunch of disk stripes and a log file and can be laid out on the local file system in the usual manner. The data goes into the file group, partitioned as defined. You still specify partitioning columns but not where each partition goes. The system will decide this by itself. When a server&#39;s file group gets too big, it splits. One half of each key&#39;s partition in the original stays where it was and the other half goes to the copy. The copies will hold rows that no longer belong there but these can be removed in the background. The new file group will be managed by the same server process and the partitioning information on all servers gets updated to reflect the existence of the new file group and the range of hash values that belong there. If a file group is kept at some reasonable size, under a few GB, these can be moved around between servers, even dynamically. If data is kept replicated, then the replicas have to split at the same time and the system will have to make sure that the replicas are kept on separate machines. So what happens to disk locality when file groups split? Nothing much. Firstly, partitioning will be set up so that consecutive values go to the same hash value, so that key compression is not ruined. Thus, consecutive numbers will be on the same page. Imagine an integer key partitioned two ways on bits 10-20. Values 0-1K go together, values 1K-2K go another way, values 2K-3K go the first way etc. Now let us suppose the first partition, the even K&#39;s splits. It could split so that multiples of 4 go one way and the rest another way. Now we&#39;d have 0-1K in place, 2-3K in the new partition, 4K-5K in place and so on. A sequential disk read, with some read ahead, would scan the partitions in parallel but the disk access would be made sequential by the read ahead logic â remember that these are controlled by the same server process. For purposes of sending functions, the file group would be the recipient, not the host, per se. The allocation of file groups to hosts could change. Now picture a transaction that touches multiple file groups. The requests going to collocated file groups can travel in the same batch and the recipient server process can run them sequentially or with a thread per file group, as may be convenient. Multiple threads per query on the same index make contention and needless thread switches. But since distinct file groups have their distinct mutexes there is less interference. For purposes of transactions, we might view a file group as deserving a its own branch. In this way we would not have to abort transactions if file groups moved. A file group split would probably have to kill all uncommitted transactions on it so as not to have to split one branch in two or deal with uncommitted data in the split. This is hardly a problem, the event being rare. For purposes of checkpoints, logging, log archival, recovery, and such, a file group is its own unit. The Bigtable paper had some ideas about combining transaction logs and such, all quite straightforward and intuitive. Writing the clustering logic with the file group, not the database process, as the main unit of location is a good idea and an entirely trivial change. This will make it possible to adjust capacity in almost real time without bringing everything to a halt by re-inserting terabytes of data in system wide repartitioning runs. Implementing this on the current Virtuoso is not a real difficulty. There is already a concept of file group, although we use only two, one for the data and one for temp. Using multiple ones is not a big deal. Supporting capacity allocation at the file group level instead of the server level can be introduced towards the middle of the clustering effort and will not greatly impact timetables. Â </dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>I just read <a href="http://labs.google.com/papers/bigtable.html" id="link-id10967a78">Google&#39;s Bigtable</a> paper. It is relevant here because it talks about keeping petabyte scale (1024TB) tables on a variable size cluster of machines.</p>
<p>I have talked about partitioning versus distributed cache in the <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1229" id="link-id10913318">second to last post</a>. The problem in short is that you do not expect a DBA to really know how to partition things, and even if the indices are correctly partitioned initially, repartitioning them is so bad that doing it online can be a problem. And repartitioning is needed whenever adding machines, unless the size increment is a doubling, which it will never be.</p>
<p>So <a href="http://dbpedia.org/resource/Oracle_Database" id="link-id0x1c4caaa0">Oracle</a> has really elegantly stepped around the whole problem by not partitioning for clustering in the first place. So incremental capacity change does not require repartitioning. Oracle has partitioning for other purposes but this is not tied to their cluster proposition.</p>
<p>I did not go the cache fusion route because I could not figure a way to know with near certainty where to send a request for a given key value. In the case we are interested in, the job simply must go to the <a href="http://dbpedia.org/resource/Data" id="link-id0xa1b52ab8">data</a> and not the other way around. Besides, not being totally dependent on a microsecond latency interconnect and a SAN for performance enhances deployment options. Sending large batches of functions tolerates latency better than cache consistency messages which are a page at a time, unless of course you kill yourself with extra trickery for batching these too.</p>
<p>So how to adapt to capacity change? Well, by making the unit of capacity allocation much smaller than a machine, of course.</p>
<p>Google has done this in Bigtable by a scheme of dynamic range partitioning. The partition size is in the tens to hundreds of megabytes, something that can be moved around within reason. When the partition, called a tablet, gets too big, it splits. Just like a Btree index. The tree top must be common <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x9f94a5f8">knowledge</a>, as well as the allocation of partitions to servers but these can be cached here and there and do not change all the time.</p>
<p>So how could we do something of the sort here? I know for an experiential fact that when people cannot change the server memory pool size, let alone correctly set up disk striping, they simply cannot be expected to deal with partitioning. Besides, even if you know exactly what you are doing and why, configuring and refilling large numbers of partitions by hand is error prone, tedious, time consuming, and will run out of disk and require restoring backups and all sorts of DBA activity that will have everything down for a long time, unless of course you have MIS staff such as is not easily found.</p>
<p>The solution is not so complex. We start with a set number of machines and make a file group on each. A file group has a bunch of disk stripes and a log file and can be laid out on the local file system in the usual manner. The data goes into the file group, partitioned as defined. You still specify partitioning columns but not where each partition goes. The system will decide this by itself. When a server&#39;s file group gets too big, it splits. One half of each key&#39;s partition in the original stays where it was and the other half goes to the copy. The copies will hold rows that no longer belong there but these can be removed in the background. The new file group will be managed by the same server process and the partitioning <a href="http://dbpedia.org/resource/Information" id="link-id0x1a3e17a0">information</a> on all servers gets updated to reflect the existence of the new file group and the range of hash values that belong there.</p>
<p>If a file group is kept at some reasonable size, under a few GB, these can be moved around between servers, even dynamically.  </p>
<p>If data is kept replicated, then the replicas have to split at the same time and the system will have to make sure that the replicas are kept on separate machines.</p>
<p>So what happens to disk locality when file groups split? Nothing much. Firstly, partitioning will be set up so that consecutive values go to the same hash value, so that key compression is not ruined. Thus, consecutive numbers will be on the same page. Imagine an integer key partitioned two ways on bits 10-20. Values 0-1K go together, values 1K-2K go another way, values 2K-3K go the first way etc.  </p>
<p>Now let us suppose the first partition, the even K&#39;s splits. It could split so that multiples of 4 go one way and the rest another way. Now we&#39;d have 0-1K in place, 2-3K in the new partition, 4K-5K in place and so on. A sequential disk read, with some read ahead, would scan the partitions in parallel but the disk access would be made sequential by the read ahead logic â remember that these are controlled by the same server process.</p>
<p>For purposes of sending functions, the file group would be the recipient, not the host, per se. The allocation of file groups to hosts could change.  </p>
<p>Now picture a transaction that touches multiple file groups. The requests going to collocated file groups can travel in the same batch and the recipient server process can run them sequentially or with a thread per file group, as may be convenient. Multiple threads per query on the same index make contention and needless thread switches. But since distinct file groups have their distinct mutexes there is less interference.</p>
<p>For purposes of transactions, we might view a file group as deserving a its own branch. In this way we would not have to abort transactions if file groups moved. A file group split would probably have to kill all uncommitted transactions on it so as not to have to split one branch in two or deal with uncommitted data in the split. This is hardly a problem, the event being rare. For purposes of checkpoints, logging, log archival, recovery, and such, a file group is its own unit. The Bigtable paper had some ideas about combining transaction logs and such, all quite straightforward and intuitive.</p>
<p>Writing the clustering logic with the file group, not the database process, as the main unit of location is a good idea and an entirely trivial change. This will make it possible to adjust capacity in almost real time without bringing everything to a halt by re-inserting terabytes of data in system wide repartitioning runs.</p>
<p>Implementing this on the current <a href="http://virtuoso.openlinksw.com" id="link-id0x1a35c638">Virtuoso</a> is not a real difficulty. There is already a concept of file group, although we use only two, one for the data and one for temp. Using multiple ones is not a big deal.</p>
<p>Supporting capacity allocation at the file group level instead of the server level can be introduced towards the middle of the clustering effort and will not greatly impact timetables.</p> Â ]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2007-08-27#1244">
  <rss:title>Virtuoso Cluster Preview</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2007-08-27T09:44:40Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">I wrote the basics of the Virtuoso clustering support over the past three weeks.Â  It can now manage connections, decide where things go, do two phase commits, insert and select data from tables partitioned over multiple Virtuoso instances.Â  It works about enough to be measured, of which I will blog more over the next two weeks. I will in the following give a features preview of what will be in the Virtuoso clustering support when it is released in the fall of this year (2007). Data Partitioning A Virtuoso database consists of indices only, so that the row of a table is stored together with the primary key.Â  Blobs are stored on separate pages when they do not fit inline within the row.Â  With clustering, partitioning can be specified index by index. Partitioning means that values of specific columns are used for determining where the containing index entry will be stored.Â  Virtuoso partitions by hash and allows specifying what parts of partitioning columns are used for the hash, for example bits 14-6 of an integer or the first 5 characters of a string.Â  Like this, key compression gains are not lost by storing consecutive values on different partitions. Once the partitioning is specified, we specify which set of cluster nodes stores this index.Â  Not every index has to be split evenly across all nodes.Â  Also, all nodes do not have to have equal slices of the partitioned index, accommodating differences in capacity between cluster nodes. Each Virtuoso instance can manage up to 32TB of data.Â  A cluster has no definite size limit. Load Balancing and Fault Tolerance When data is partitioned, an operation on the data goes where the data is. Â This provides a certain natural parallelism but we will discuss this further below. Some data may be stored multiple times in the cluster, either for fail-over or for splitting read load.Â  Some data, such as database schema, is replicated on all nodes.Â  When specifying a set of nodes for storing the partitions of a key, it is possible to specify multiple nodes for the same partition.Â  If this is the case, updates go to all nodes and reads go to a randomly picked node from the group. If one of the nodes in the group fails, operation can resume with the surviving node. Â The failed node can be brought back online from the transaction logs of the surviving nodes. A few transactions may be rolled back at the time of failure and again at the time of the failed node rejoining the cluster but these are aborts as in the case of deadlock and lose no committed data. Shared Nothing The Virtuoso architecture does not require a SAN for disk sharing across nodes.Â  This is reasonable since a few disks on a local controller can easily provide 300MB/s of read and passing this over an interconnect fabric that would also have to carry inter-node messages could saturate even a fast network. Client View A SQL or HTTP client can connect to any node of the cluster and get an identical view of all data with full transactional semantics.Â  DDL operations like table creation and package installation are limited to one node, though. Applications such as ODS will run unmodified.Â  They are installed on all nodes with a single install command.Â  After this, the data partitioning must be declared, which is a one time operation to be done cluster by cluster.Â  The only application change is specifying the partitioning columns for each index.Â  The gain is optional redundant storage and capacity not limited to a single machine.Â  The penalty is that single operations may take a little longer when not all data is managed by the same process but then the parallel throughput is increased. Â We note that the main ODS performance factor is web page logic and not database access. Â Thus splitting the web server logic over multiple nodes gives basically linear scaling. Parallel Query Execution Message latency is the principal performance factor in a clustered database.Â  Due to this, Virtuoso packs the maximum number of operations in a single message.Â  For example, when doing a loop join that reads one table sequentially and retrieves a row of another table for each row of the outer table, a large number of the join of the inner loop are run in parallel.Â  So, if there is a join of five tables that gets one row from each table and all rows are on different nodes, the time will be spent on message latency.Â  If each step of the join gets 10 rows, for a total of 100000 results, the message latency is not a significant factorÂ and the cluster will clearly outperform a single node. Also, if the workload consists of large numbers of concurrent short updates or queries, the message latencies will even out and throughput will scale up even if doing a single transaction were faster on a single node. Parallel SQL There are SQL extensions for stored procedures allowing parallelizing operations. Â For example, if a procedure has a loop doing inserts, the inserted rows can be buffered until a sufficient number is available, at which point they are sent in batches to the nodes concerned. Â Transactional semantics are kept but error detection is deferred to the actual execution. Transactions Each transaction is owned by one node of the cluster, the node to which the client is connected.Â  When more than one node besides the owner of the transaction is updated, two phase commit is used.Â  This is transparent to the application code.Â  No external transaction monitor is required, the Virtuoso instances perform these functions internally.Â  There is a distributed deadlock detection scheme based on the nodes periodically sharing transaction waiting information. Since read transactions can operate without locks, reading the last committed state of uncommitted updated rows, waiting for locks is not very common. Interconnect and Threading Virtuoso uses TCP to connect between instances.Â  A single instance can have multiple listeners at different network interfaces for cluster activity.Â  The interfaces will be used in a round-robin fashion by the peers, spreading the load over all network interfaces. A separate thread is created for monitoring each interface.Â  Long messages, such as transfers of blobs are done on a separate thread, thus allowing normal service on the cluster node while the transfer is proceeding. We will have to test the performance of TCP over Infiniband to see if there is clear gain in going to a lower level interface like MPI.Â  The Virtuoso architecture is based on streams connecting cluster nodes point to point.Â  The design does not per se gain from remote DMA or other features provided by MPI.Â  Typically, messages are quite short, under 100K. Â Flow control for transfer of blobs is however nice to have but can be written at the application level if needed.Â  We will get real data on the performance of different interconnects in the next weeks. Deployment and Management Configuring is quite simple, with each process sharing a copy of the same configuration file. Â One line in the file differs from host to host, telling it which one it is.Â  Otherwise the database configuration files are individual per host, accommodating different file system layouts etc. Â Setting up a node requires copying the executable and two configuration files, no more.Â  Â All functionality is contained in a single process.Â  There are no installers to be run or such. Changing the number or network interface of cluster nodes requires a cluster restart.Â  Changing data partitioning requires copying the data into a new table and renaming this over the old one.Â  This is time consuming and does not mix well with updates.Â  Splitting an existing cluster node requires no copying with repartitioning but shifting data between partitions does. A consolidated status report shows the general state and level of intra-cluster traffic as count of messages and count of bytes. Start, shutdown, backup, and package installation commands can only be issued from a single master node. Otherwise all is symmetrical. Present State and Next Developments The basics are now in place.Â  Some code remains to be written for such things as distributed deadlock detection, 2-phase commit recovery cycle, management functions, etc.Â  Some SQL operations like text index, statistics sampling, and index intersection need special support, yet to be written. The RDF capabilities are not specifically affected by clustering except in a couple of places.Â  Loading will be slightly revised to use larger batches of rows to minimize latency, for example. There is a pretty much infinite world of SQL optimizations for splitting aggregates, taking advantage of co-located joins etc.Â  These will be added gradually.Â  These are however not really central to the first application of RDF storage but are quite important for business intelligence, for example. We will run some benchmarks for comparing single host and clustered Virtuoso instances over the next weeks.Â  Some of this will be with real data, giving an estimate on when we can move some of the RDF data we presently host to the new platform.Â  We will benchmark against Oracle and DB2 later but first we get things to work and compare against ourselves. We roughly expect a halving in space consumption and a significant increase in single query performance and linearly scaling parallel throughput through addition of cluster nodes. The next update will be on this blog within two weeks.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>
 <b><i>I wrote the basics of the <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1229" id="link-id1383c310">Virtuoso clustering support</a> over the past three weeks.Â  It can now manage connections, decide where things go, do two phase commits, insert and select <a href="http://dbpedia.org/resource/Data" id="link-id0xbbf5988">data</a> from tables partitioned over multiple <a href="http://virtuoso.openlinksw.com" id="link-id0x1da47d98">Virtuoso</a> instances.Â  It works about enough to be measured, of which I will <a href="http://dbpedia.org/resource/Blog" id="link-id0xabf4a10">blog</a> more over the next two weeks.</i>
 </b>
</p>
<p>
 <b><i>I will in the following give a features preview of what will be in the Virtuoso clustering support when it is released in the fall of this year (2007).</i>
 </b>
</p>
<h3>Data Partitioning</h3>
<p>A Virtuoso database consists of indices only, so that the row of a table is stored together with the primary key.Â  Blobs are stored on separate pages when they do not fit inline within the row.Â  With clustering, partitioning can be specified index by index. Partitioning means that values of specific columns are used for determining where the containing index entry will be stored.Â  Virtuoso partitions by hash and allows specifying what parts of partitioning columns are used for the hash, for example bits 14-6 of an integer or the first 5 characters of a string.Â  Like this, key compression gains are not lost by storing consecutive values on different partitions.</p>
<p>Once the partitioning is specified, we specify which set of cluster nodes stores this index.Â  Not every index has to be split evenly across all nodes.Â  Also, all nodes do not have to have equal slices of the partitioned index, accommodating differences in capacity between cluster nodes.</p>
<p>Each Virtuoso instance can manage up to 32TB of data.Â  A cluster has no definite size limit.</p>
<h3>Load Balancing and Fault Tolerance</h3>
<p>When data is partitioned, an operation on the data goes where the data is. Â This provides a certain natural parallelism but we will discuss this further below.</p>
<p>Some data may be stored multiple times in the cluster, either for fail-over or for splitting read load.Â  Some data, such as database schema, is replicated on all nodes.Â  When specifying a set of nodes for storing the partitions of a key, it is possible to specify multiple nodes for the same partition.Â  If this is the case, updates go to all nodes and reads go to a randomly picked node from the group.</p>
<p>If one of the nodes in the group fails, operation can resume with the surviving node. Â The failed node can be brought back online from the transaction logs of the surviving nodes. A few transactions may be rolled back at the time of failure and again at the time of the failed node rejoining the cluster but these are aborts as in the case of deadlock and lose no committed data.</p>
<h3>Shared Nothing</h3>
<p>The Virtuoso architecture does not require a SAN for disk sharing across nodes.Â  This is reasonable since a few disks on a local controller can easily provide 300MB/s of read and passing this over an interconnect fabric that would also have to carry inter-node messages could saturate even a fast network. </p>
<h3>Client View</h3>
<p>A <a href="http://dbpedia.org/resource/SQL" id="link-id0x9fc302a0">SQL</a> or <a href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x19faa348">HTTP</a> client can connect to any node of the cluster and get an identical view of all data with full transactional semantics.Â  DDL operations like table creation and package installation are limited to one node, though.</p>
<p>Applications such as <a href="http://dbpedia.org/resource/OpenLink_Data_Spaces" id="link-id0x20cd9e98">ODS</a> will run unmodified.Â  They are installed on all nodes with a single install command.Â  After this, the data partitioning must be declared, which is a one time operation to be done cluster by cluster.Â  The only application change is specifying the partitioning columns for each index.Â  The gain is optional redundant storage and capacity not limited to a single machine.Â  The penalty is that single operations may take a little longer when not all data is managed by the same process but then the parallel throughput is increased. Â We note that the main ODS performance factor is web page logic and not database access. Â Thus splitting the web server logic over multiple nodes gives basically linear scaling.</p>
<h3>Parallel Query Execution</h3>
<p>Message latency is the principal performance factor in a clustered database.Â  Due to this, Virtuoso packs the maximum number of operations in a single message.Â  For example, when doing a loop join that reads one table sequentially and retrieves a row of another table for each row of the outer table, a large number of the join of the inner loop are run in parallel.Â  So, if there is a join of five tables that gets one row from each table and all rows are on different nodes, the time will be spent on message latency.Â  If each step of the join gets 10 rows, for a total of 100000 results, the message latency is not a significant factorÂ and the cluster will clearly outperform a single node.</p>
<p>Also, if the workload consists of large numbers of concurrent short updates or queries, the message latencies will even out and throughput will scale up even if doing a single transaction were faster on a single node.</p> <h3>Parallel SQL</h3> <p>There are SQL extensions for stored procedures allowing parallelizing operations. Â For example, if a procedure has a loop doing inserts, the inserted rows can be buffered until a sufficient number is available, at which point they are sent in batches to the nodes concerned. Â Transactional semantics are kept but error detection is deferred to the actual execution.</p>
<h3>Transactions</h3>
<p>Each transaction is owned by one node of the cluster, the node to which the client is connected.Â  When more than one node besides the owner of the transaction is updated, two phase commit is used.Â  This is transparent to the application code.Â  No external transaction monitor is required, the Virtuoso instances perform these functions internally.Â  There is a distributed deadlock detection scheme based on the nodes periodically sharing transaction waiting <a href="http://dbpedia.org/resource/Information" id="link-id0xbcc0a50">information</a>.</p>
<p>Since read transactions can operate without locks, reading the last committed state of uncommitted updated rows, waiting for locks is not very common.</p>
<h3>Interconnect and Threading</h3>
<p>Virtuoso uses TCP to connect between instances.Â  A single instance can have multiple listeners at different network interfaces for cluster activity.Â  The interfaces will be used in a round-robin fashion by the peers, spreading the load over all network interfaces. A separate thread is created for monitoring each interface.Â  Long messages, such as transfers of blobs are done on a separate thread, thus allowing normal service on the cluster node while the transfer is proceeding.</p>
<p>We will have to test the performance of TCP over <i>Infiniband</i> to see if there is clear gain in going to a lower level interface like <i>MPI</i>.Â  The Virtuoso architecture is based on streams connecting cluster nodes point to point.Â  The design does not per se gain from remote DMA or other features provided by MPI.Â  Typically, messages are quite short, under 100K. Â Flow control for transfer of blobs is however nice to have but can be written at the application level if needed.Â  We will get real data on the performance of different interconnects in the next weeks. </p>
<h3>Deployment and Management</h3>
<p>Configuring is quite simple, with each process sharing a copy of the same configuration file. Â One line in the file differs from host to host, telling it which one it is.Â  Otherwise the database configuration files are individual per host, accommodating different file system layouts etc. Â Setting up a node requires copying the executable and two configuration files, no more.Â  Â All functionality is contained in a single process.Â  There are no installers to be run or such.</p>
<p>Changing the number or network interface of cluster nodes requires a cluster restart.Â  Changing data partitioning requires copying the data into a new table and renaming this over the old one.Â  This is time consuming and does not mix well with updates.Â  Splitting an existing cluster node requires no copying with repartitioning but shifting data between partitions does.</p>
<p>A consolidated status report shows the general state and level of intra-cluster traffic as count of messages and count of bytes.</p>
<p>Start, shutdown, backup, and package installation commands can only be issued from a single master node. Otherwise all is symmetrical.</p>
<h3>Present State and Next Developments</h3>
<p>The basics are now in place.Â  Some code remains to be written for such things as distributed deadlock detection, 2-phase commit recovery cycle, management functions, etc.Â  Some SQL operations like text index, statistics sampling, and index intersection need special support, yet to be written.</p>
<p>The <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xb919200">RDF</a> capabilities are not specifically affected by clustering except in a couple of places.Â  Loading will be slightly revised to use larger batches of rows to minimize latency, for example.</p>
<p>There is a pretty much infinite world of SQL optimizations for splitting aggregates, taking advantage of co-located joins etc.Â  These will be added gradually.Â  These are however not really central to the first application of RDF storage but are quite important for business intelligence, for example.</p>
<p>We will run some benchmarks for comparing single host and clustered Virtuoso instances over the next weeks.Â  Some of this will be with real data, giving an estimate on when we can move some of the RDF data we presently host to the new platform.Â  We will benchmark against <a href="http://dbpedia.org/resource/Oracle_Database" id="link-id0x1ddf8288">Oracle</a> and <a href="http://dbpedia.org/resource/IBM_DB2" id="link-id0xa04b6ae8">DB2</a> later but first we get things to work and compare against ourselves.</p>
<p>We roughly expect a halving in space consumption and a significant increase in single query performance and linearly scaling parallel throughput through addition of cluster nodes.</p>
<p>
<i>The <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1246" id="link-id106de430">next update</a> will be on this blog within two weeks.</i>
</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2007-05-23#1196">
  <rss:title>Virtuoso Feature Update</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2007-05-23T14:04:42Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We have a few new features that we did for the WWW 2007 conference that we will be shortly adding to the open source release. Optimization for SQL IN predicate. The IN predicate with a list of values will now use an index if available. This is useful for SPARQL queries with multiple FROM graphs, for example. API for index population estimates. There is an API for getting an approximate count of matches given one or more leading key parts of an index. Row-level autocommit mode – If one updates a huge table and the application does not require transaction isolation, it is possible to do this with an automatic commit after each row. This saves the server from having to keep rollback information on millions and billions of rows and saves it from temporary rollbacks of the uncommitted data for checkpoints etc. These things can completely hang a server if there are a few tens of millions of uncommitted inserts/deletes/updates. 64-bit IDs for IRIs and RDF objects, 64-bit integer data type. With the growth of some RDF databases to the tens of billions of triples, we run out of the 32-bit range for IDs of distinct IRIs. To accommodate this before actually running out, we introduce a longer ID. Some cost model adjustments. SQL extension for producing multiple result set rows from a single table row. This is useful for mapping SPARQL queries like SELECT * FROM graph WHERE {?s ?p ?o} into a UNION of SELECT *’s from multiple tables of different width. Each term of the UNION will simply produce multiple 3 column result rows for each actual row while not having to run through the tables multiple times. Together with this, we have also fixed a number of things with the relational-to-RDF mapping. We have been testing this extensively with the Musicbrainz mapping by Fred Giasson. These changes are small and to be released shortly. There are also some larger things in the works, to be released during this summer, the next post gives an overview of these.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We have a few new features that we did for the <a href="http://www2007.org/" id="link-id12603130">WWW 2007</a> conference that we will be shortly adding to the open source release.</p>
<ul>
<li>Optimization for <a href="http://dbpedia.org/resource/SQL" id="link-id0x187b3e88">SQL</a> <code>IN</code> predicate. The IN predicate with a list of values will now use an index if available. This is useful for <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x19979788">SPARQL</a> queries with multiple <code>FROM</code> graphs, for example.</li> 
<li>
  <a href="http://docs.openlinksw.com/virtuoso/fn_key_estimate.html" id="link-idffb5400">API for index population estimates</a>. There is an API for getting an approximate count of matches given one or more leading key parts of an index.</li>
<li>
  <a href="http://docs.openlinksw.com/virtuoso/coredbengine.html#RowbyRowAutoCommit" id="link-id1097c420">Row-level autocommit mode</a> – If one updates a huge table and the application does not require transaction isolation, it is possible to do this with an automatic commit after each row. This saves the server from having to keep rollback <a href="http://dbpedia.org/resource/Information" id="link-id0x1876d260">information</a> on millions and billions of rows and saves it from temporary rollbacks of the uncommitted <a href="http://dbpedia.org/resource/Data" id="link-id0xd35cd28">data</a> for checkpoints etc. These things can completely hang a server if there are a few tens of millions of uncommitted inserts/deletes/updates.</li>
<li>64-bit IDs for IRIs and <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xe494d78">RDF</a> objects, 64-bit integer data type. With the growth of some RDF databases to the tens of billions of triples, we run out of the 32-bit range for IDs of distinct IRIs. To accommodate this before actually running out, we introduce a longer ID.</li>
<li>Some cost model adjustments.</li>
<li>SQL extension for producing multiple result set rows from a single table row. This is useful for mapping SPARQL queries like <code>SELECT * FROM graph WHERE {?s ?p ?o}</code> into a <code>UNION</code> of <code>SELECT *</code>’s from multiple tables of different width. Each term of the <code>UNION</code>  will simply produce multiple 3 column result rows for each actual row while not having to run through the tables multiple times. Together with this, we have also fixed a number of things with the relational-to-RDF mapping. We have been testing this extensively with the <a href="http://blog.musicbrainz.org/" id="link-id12109cb0">Musicbrainz</a> mapping by <a href="http://fgiasson.com/blog/" id="link-idffd52f0">Fred Giasson</a>. </li> </ul>
<p>These changes are small and to be released shortly.</p>
<p>There are also some larger things in the works, to be released during this summer, <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1197" id="link-id10cafde8">the next post</a> gives an overview of these.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2007-03-16#1159">
  <rss:title>Virtuoso Open Source 5.0 Release Imminent</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2007-03-16T09:47:00Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We are a couple of days from releasing the Virtuoso Open Source 5.0 cut. This will make the technology that we are showing with DBpedia and the various OpenLink web sites available to the public. The updates involve: Significant database engine improvements, as discussed in previous posts. Tons of RDF related bug fixes. Text index extension to SPARQL New SQL data type capturing the whole XML Schema scalar type system used in RDF. Soon to follow are: Basic inference for RDF, including type and property subsumption. Whole new disk IO system with much better disk locality. Existing databases will be automatically upgraded when started with the new Virtuoso 5.0 server. Note that after upgrade, the RDF data is not backward compatible. We will be rolling out more Virtuoso hosted semantic web content in the Linking Open Data project, part of our participation in the Semantic Web Education and Outreach activity at W3C.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We are a couple of days from releasing the <a href="http://virtuoso.openlinksw.com" id="link-id0x1d760208">Virtuoso</a> Open Source 5.0 cut. This will make the technology that we are showing with <a href="http://dbpedia.org/resource/DBpedia" id="link-id0xe620cf0">DBpedia</a> and the various OpenLink web sites available to the public.</p>
<p>The updates involve:</p>
<ul>
<li>Significant database engine improvements, as discussed in <a id="link-id107cd7f0">previous</a> <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1131" id="link-idff99620">posts</a>.</li>
<li>Tons of <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xea43080">RDF</a> related bug fixes.</li>
<li>Text index extension to <a href="http://dbpedia.org/resource/SPARQL" id="link-id0xe524328">SPARQL</a>
</li>
<li>New <a href="http://dbpedia.org/resource/SQL" id="link-id0x17ae2ad8">SQL</a> <a href="http://dbpedia.org/resource/Data" id="link-id0x150e0e68">data</a> type capturing the whole XML Schema scalar type system used in RDF.</li>
</ul>
<p>Soon to follow are:</p>
<ul>
<li>Basic inference for RDF, including type and property subsumption.</li>
<li>Whole new disk IO system with much better disk locality.</li>
</ul>
<p>Existing databases will be automatically upgraded when started with the new Virtuoso 5.0 server. Note that after upgrade, the RDF data is not backward compatible.</p>
<p>We will be rolling out more Virtuoso hosted <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x15b13750">semantic web</a> content in the <a href="http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/" id="link-id12603750">Linking Open Data project</a>, part of our participation in the Semantic Web Education and Outreach activity at W3C.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2007-02-05#1131">
  <rss:title>Comparison of Open Source Databases with TPC D Queries</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2007-02-05T11:45:32Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Last time we talked about database engine and transactions. Now we have come to the realm of query processing in our revisiting of the DBMS side of Virtuoso. Now the well established, respectable standard benchmark for the basics of query processing is TPC D with its derivatives H and R. So we have, for testing how different SQL optimizers manage the 22 queries, run a mini version of the D queries with a 1% scale database, some 30M of data, all in memory. This basically catches whether SQL implementations miss some of the expected tricks and how efficient in memory loop and hash joins and aggregation are. When we get to our next stop, high volume I/O, we will run the same with D databases in the 10G ballpark. The databases were tested on the same machine, with warm cache, taking the best run of 3. All had full statistics and were running with read committed isolation, where applicable. The data was generated using the procedures from the Virtuoso test suite. The Virtuoso version tested was 5.0, to be released shortly. The MySQL was 5.0.27, the PostgreSQL was 8.1.6. Query Query Times in Milliseconds Virtuoso PostgreSQL MySQL MySQL with InnoDB Q1 206 763 312 198 Q2 4 6 3 3 Q3 13 51 254 64 Q4 4 16 24 60 Q5 15 22 64 68 Q6 9 70 189 65 Q7 52 143 211 84 Q8 29 31 13 11 Q9 36 114 97 61 Q10 32 51 117 57 Q11 16 9 12 10 Q12 8 21 18 130 Q13 18 74 - - Q14 7 21 418 1425 Q15 14 43 389 122 Q16 16 22 18 25 Q17 1 54 26 10 Q18 82 120 - - Q19 19 8 2 17 Q20 7 15 66 52 Q21 34 86 524 278 Q22 4 323 3311 805 Total (msec) 626 2063 6068 3545 We lead by a fair margin but MySQL is hampered by obviously getting some execution plans wrong and not doing Q13 and Q18 at all, at least not under several tens of seconds; so we left these out of the table in the interest of having comparable totals. As usual, we also ran the workload on Oracle 10g R2. Since Oracle does not like their numbers being published without explicit approval, we will just say that we are even with them within the parameters described above. Oracle has a more efficient decimal type so it wins where that is central, as on Q1. Also it seems to notice that the GROUP BYs of Q18 are produced in order of grouping columns, so it needs no intermediate table for storing the aggregates. If we addressed these matters, we&#39;d lead by some 15% whereas now we are even. A faster decimal arithmetic implementation may be in the release after next. In the next posts, we will look at IO and disk allocation, and also return to RDF and LUBM.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>
<a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1116" id="link-id10598cc0">Last time</a> we talked about database engine and transactions. Now we have come to the realm of query processing in our revisiting of the DBMS side of <a href="http://virtuoso.openlinksw.com" id="link-id0x1a4279e8">Virtuoso</a>.</p>
<p>Now the well established, respectable standard benchmark for the basics of query processing is TPC D with its derivatives H and R. So we have, for testing how different <a href="http://dbpedia.org/resource/SQL" id="link-id0x17ce3a18">SQL</a> optimizers manage the 22 queries, run a mini version of the D queries with a 1% scale database, some 30M of <a href="http://dbpedia.org/resource/Data" id="link-id0x17370eb0">data</a>, all in memory. This basically catches whether SQL implementations miss some of the expected tricks and how efficient in memory loop and hash joins and aggregation are.</p>
<p>When we get to our next stop, high volume I/O, we will run the same with D databases in the 10G ballpark.</p>
<p>The databases were tested on the same machine, with warm cache, taking the best run of 3. All had full statistics and were running with read committed isolation, where applicable. The data was generated using the procedures from the Virtuoso test suite. The Virtuoso version tested was 5.0, to be released shortly. The <a href="http://dbpedia.org/resource/MySQL" id="link-id0xe435ad8">MySQL</a> was 5.0.27, the PostgreSQL was 8.1.6. </p>
<table style="width: 334px; height: 556px; " border="1"> <tbody> 
<tr> <th rowspan="2">Query</th> <th colspan="4">Query Times in Milliseconds</th> </tr> 
<tr> <th> Virtuoso </th> <th> PostgreSQL </th> <th> MySQL </th> <th> MySQL with InnoDB </th> </tr> 
<tr> <td> Q1 </td> <td align="right"> <b>206</b> </td> <td align="right"> 763 </td> <td align="right"> 312 </td> <td align="right"> 198 </td> </tr> 
<tr> <td> Q2 </td> <td align="right"> 4 </td> <td align="right"> 6 </td> <td align="right"> <b>3</b> </td> <td align="right"> <b>3</b> </td> </tr> 
<tr> <td> Q3 </td> <td align="right"> <b>13</b> </td> <td align="right"> 51 </td> <td align="right"> 254 </td> <td align="right"> 64 </td> </tr> 
<tr> <td> Q4 </td> <td align="right"> <b>4</b> </td> <td align="right"> 16 </td> <td align="right"> 24 </td> <td align="right"> 60 </td> </tr> 
<tr> <td> Q5 </td> <td align="right"> <b>15</b> </td> <td align="right"> 22 </td> <td align="right"> 64 </td> <td align="right"> 68 </td> </tr> 
<tr> <td> Q6 </td> <td align="right"> <b>9</b> </td> <td align="right"> 70 </td> <td align="right"> 189 </td> <td align="right"> 65 </td> </tr> 
<tr> <td> Q7 </td> <td align="right"> <b>52</b> </td> <td align="right"> 143 </td> <td align="right"> 211 </td> <td align="right"> 84 </td> </tr> 
<tr> <td> Q8 </td> <td align="right"> 29 </td> <td align="right"> 31 </td> <td align="right"> 13 </td> <td align="right"> <b>11</b> </td> </tr> 
<tr> <td> Q9 </td> <td align="right"> <b>36</b> </td> <td align="right"> 114 </td> <td align="right"> 97 </td> <td align="right"> 61 </td> </tr> 
<tr> <td> Q10 </td> <td align="right"> <b>32</b> </td> <td align="right"> 51 </td> <td align="right"> 117 </td> <td align="right"> 57 </td> </tr> 
<tr> <td> Q11 </td> <td align="right"> 16 </td> <td align="right"> <b>9</b> </td> <td align="right"> 12 </td> <td align="right"> 10 </td> </tr> 
<tr> <td> Q12 </td> <td align="right"> <b>8</b> </td> <td align="right"> 21 </td> <td align="right"> 18 </td> <td align="right"> 130 </td> </tr> 
<tr> <td> Q13 </td> <td align="right"> <b>18</b> </td> <td align="right"> 74 </td> <td align="right"> - </td> <td align="right"> - </td> </tr> 
<tr> <td> Q14 </td> <td align="right"> <b>7</b> </td> <td align="right"> 21 </td> <td align="right"> 418 </td> <td align="right"> 1425 </td> </tr> 
<tr> <td> Q15 </td> <td align="right"> <b>14</b> </td> <td align="right"> 43 </td> <td align="right"> 389 </td> <td align="right"> 122 </td> </tr> 
<tr> <td> Q16 </td> <td align="right"> <b>16</b> </td> <td align="right"> 22 </td> <td align="right"> 18 </td> <td align="right"> 25 </td> </tr> 
<tr> <td> Q17 </td> <td align="right"> <b>1</b> </td> <td align="right"> 54 </td> <td align="right"> 26 </td> <td align="right"> 10 </td> </tr> 
<tr> <td> Q18 </td> <td align="right"> <b>82</b> </td> <td align="right"> 120 </td> <td align="right"> - </td> <td align="right"> - </td> </tr> 
<tr> <td> Q19 </td> <td align="right"> 19 </td> <td align="right"> 8 </td> <td align="right"> <b>2</b> </td> <td align="right"> 17 </td> </tr> 
<tr> <td> Q20 </td> <td align="right"> <b>7<b> </b></b></td> <td align="right"> 15 </td> <td align="right"> 66 </td> <td align="right"> 52 </td> </tr> 
<tr> <td> Q21 </td> <td align="right"> <b>34</b> </td> <td align="right"> 86 </td> <td align="right"> 524 </td> <td align="right"> 278 </td> </tr> 
<tr> <td> Q22 </td> <td align="right"> <b>4</b> </td> <td align="right"> 323 </td> <td align="right"> 3311</td> <td align="right"> 805 </td> </tr> 
<tr> <td>Total (msec)</td> <td align="right"><b>626</b></td> <td align="right">2063</td> <td align="right">6068</td> <td align="right">3545</td> </tr> </tbody> </table>
<p>We lead by a fair margin but MySQL is hampered by obviously getting some execution plans wrong and not doing Q13 and Q18 at all, at least not under several tens of seconds; so we left these out of the table in the interest of having comparable totals.</p>
<p>As usual, we also ran the workload on <a href="http://dbpedia.org/resource/Oracle_Database" id="link-id0xe957b80">Oracle</a> 10g R2. Since Oracle does not like their numbers being published without explicit approval, we will just say that we are even with them within the parameters described above. Oracle has a more efficient decimal type so it wins where that is central, as on Q1. Also it seems to notice that the <code>GROUP BY</code>s of Q18 are produced in order of grouping columns, so it needs no intermediate table for storing the aggregates. If we addressed these matters, we&#39;d lead by some 15% whereas now we are even. A faster decimal arithmetic implementation may be in the release after next.</p> <p>In the next posts, we will look at IO and disk allocation, and also return to <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xe074e40">RDF</a> and LUBM.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2007-01-10#1116">
  <rss:title>Virtuoso 5.0 Preview</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2007-01-10T15:08:43Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">As previously said, we have a Virtuoso with brand new engine multithreading. It is now complete and passes its regular test suite. This is the basis for Virtuoso 5.0, to be available as the open source and commercial cuts as before. As one benchmark, we used the TPC-C test driver that has always been bundled with Virtuoso. We ran 100000 new orders worth of the TPC-C transaction mix first with one client and then with 4 clients, each client going to its own warehouse, so there was not much lock contention. We did this on a 4 core Intel, the working set in RAM. With the old one, 1 client took 1m43 and 4 clients took 3m47. With the new one, one client took 1m30 and 4 clients took 2m37. So, 400000 new orders in 2m37, for 152820 new orders per minute as opposed to 105720 per minute previously. Do not confuse with the official tpmC metric, that one involves a whole bunch of further rules. TPC-C has activity spread over a few different tables. With tests dealing with fewer tables, improvements in parallelism are far greater. Aside from better parallelism, we have other features. One of them is a change in the read committed isolation, so that we now return the previous committed state for uncommitted changed rows instead of waiting for the updating transaction to terminate. This is similar to what Oracle does for read committed. Also we now do log checkpoints without having to abort pending write transactions. When we have faster inserts, we actually see the RDF bulk loader run slower. This is really backwards. The reason is that while one thread parses, other threads insert and if the inserting threads are done they go to wait on a semaphore and this whole business of context switching absolutely kills performance. With slower inserts, the parser keeps ahead so there is less context switching, hence better overall throughput. I still do not get it how the OS can spend between 1.5 and 6 microseconds, several thousand instructions, deciding what to do next when there are only 3-4 eligible threads and all the rest is background which goes with a few dozen slices per second. Solaris is a little better than Linux at this but not dramatically so. Mac OS X is way worse. As said, we use Oracle 10G2 on the same platform (Linux FC5 64 bit) for sparring. It is really a very good piece of software. We have written the TPC C transactions in SQL/PL. What is surprising is that these procedures run amazingly slowly, even with a single client. Otherwise the Oracle engine is very fast. Well, as I recall, the official TPC C runs with Oracle use an OCI client and no stored procedures. Strange. While Virtuoso for example fills the initial TPC C state a little faster than Oracle, the procedures run 5-10 times slower with Oracle than with Virtuoso, all data in warm cache and a single client. While some parts of Oracle are really well optimized, all basic joins and aggregates etc, we are surprised at how they could have neglected such a central piece as the PL. Also, we have looked at transaction semantics. Serializable is mostly serializable with Oracle but does not always keep a steady count. Also it does not prevent inserts into a space that has been found empty by a serializable transaction. True, it will not show these inserts to the serializable transaction, so in this it follows the rules. Also, to make a read really repeatable, it seems that the read has to be FOR UPDATE. Otherwise one can not implement a reliable resource transaction, like changing the balance of an account. Anyway, the Virtuoso engine overhaul is now mostly complete. This is of course an open ended topic but the present batch is nearing completion. We have gone through as many as 3 implementations of hash joins, some things have yet to be finished there. Oracle has very good hash joins. The only way we could match that was to do it all in memory, dropping any persistent storage of the hash. This is of course OK if the hash is not very large and anyway hash joins go sour if the hash does not fit in working set. As next topics, we have more RDF and the LUBM benchmark to finish. Also we should revisit TPC-D. Databases are really quite complicated and extensive pieces of software. Much more so than the casual observer might think.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>As <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1108" id="link-id10c66e68">previously said</a>, we have a <a href="http://virtuoso.openlinksw.com" id="link-id0x1a5caeb8">Virtuoso</a> with brand new engine multithreading. It is now complete and passes its regular test suite. This is the basis for Virtuoso 5.0, to be available as the open source and commercial cuts as before.</p>
<p>As one benchmark, we used the <a href="http://dbpedia.org/resource/TPC-C" id="link-id0x15f8cbd8">TPC-C</a> test driver that has always been bundled with Virtuoso. We ran 100000 new orders worth of the TPC-C transaction mix first with one client and then with 4 clients, each client going to its own warehouse, so there was not much lock contention. We did this on a 4 core Intel, the working set in RAM. With the old one, 1 client took 1m43 and 4 clients took 3m47. With the new one, one client took 1m30 and 4 clients took 2m37. So, 400000 new orders in 2m37, for 152820 new orders per minute as opposed to 105720 per minute previously. Do not confuse with the official tpmC metric, that one involves a whole bunch of further rules.</p>
<p>TPC-C has activity spread over a few different tables. With tests dealing with fewer tables, improvements in parallelism are far greater.</p>
<p>Aside from better parallelism, we have other features. One of them is a change in the read committed isolation, so that we now return the previous committed state for uncommitted changed rows instead of waiting for the updating transaction to terminate. This is similar to what <a href="http://dbpedia.org/resource/Oracle_Database" id="link-id0x18184c08">Oracle</a> does for read committed. Also we now do log checkpoints without having to abort pending write transactions.</p>
<p>When we have faster inserts, we actually see the <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xde6fca0">RDF</a> bulk loader run slower. This is really backwards. The reason is that while one thread parses, other threads insert and if the inserting threads are done they go to wait on a semaphore and this whole business of context switching absolutely kills performance. With slower inserts, the parser keeps ahead so there is less context switching, hence better overall throughput. I still do not get it how the OS can spend between 1.5 and 6 microseconds, several thousand instructions, deciding what to do next when there are only 3-4 eligible threads and all the rest is background which goes with a few dozen slices per second. Solaris is a little better than Linux at this but not dramatically so. Mac OS X is way worse.</p>
<p>As said, we use Oracle 10G2 on the same platform (Linux FC5 64 bit) for sparring. It is really a very good piece of software. We have written the TPC C transactions in <a href="http://dbpedia.org/resource/SQL" id="link-id0x15b33600">SQL</a>/PL. What is surprising is that these procedures run amazingly slowly, even with a single client. Otherwise the Oracle engine is very fast. Well, as I recall, the official TPC C runs with Oracle use an OCI client and no stored procedures. Strange. While Virtuoso for example fills the initial TPC C state a little faster than Oracle, the procedures run 5-10 times slower with Oracle than with Virtuoso, all <a href="http://dbpedia.org/resource/Data" id="link-id0xd9d1150">data</a> in warm cache and a single client. While some parts of Oracle are really well optimized, all basic joins and aggregates etc, we are surprised at how they could have neglected such a central piece as the PL.</p>
<p>Also, we have looked at transaction semantics. Serializable is mostly serializable with Oracle but does not always keep a steady count. Also it does not prevent inserts into a space that has been found empty by a serializable transaction. True, it will not show these inserts to the serializable transaction, so in this it follows the rules. Also, to make a read really repeatable, it seems that the read has to be FOR UPDATE. Otherwise one can not implement a reliable resource transaction, like changing the balance of an account.</p>
<p>Anyway, the Virtuoso engine overhaul is now mostly complete. This is of course an open ended topic but the present batch is nearing completion. We have gone through as many as 3 implementations of hash joins, some things have yet to be finished there. Oracle has very good hash joins. The only way we could match that was to do it all in memory, dropping any persistent storage of the hash. This is of course OK if the hash is not very large and anyway hash joins go sour if the hash does not fit in working set.</p>
<p>As next topics, we have more RDF and the LUBM benchmark to finish. Also we should revisit TPC-D.</p>
<p>Databases are really quite complicated and extensive pieces of software. Much more so than the casual observer might think.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2006-12-22#1108">
  <rss:title>Season&#39;s Greetings from Virtuoso Development</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2006-12-22T10:03:23Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">It&#39;s been a long and very busy time since the last blog post. Now and then, circumstances call for a return to the contemplation of first principles. I have lately beheld the Platonic ideal of database-ness and translated it into engineering elegance. No quest is static and no objective is permanently achieved. Accordingly, I have redone all Virtuoso core engine structures for control of parallel execution. As we now routinely get multiple cores per chip, this is more important than before. Aside from dramatic improvements in multiprocessor performance, there is also quite a bit of optimization for basic relational operations. Of course, this is not for the pure pleasure of geek-craft; it serves a very practical purpose. RDF opens a new database frontier, where these things make a significant difference. In application scenarios involving either federated/virtual database or running typical web applications, the core concurrency of the DBMS is not really the determining factor. However, with RDF, we get a small number of very large tables and most processing goes to these tables. This is also often so with business intelligence but it is still more so with RDF. Thus the parallelism within a single index becomes essential. We have also made a point by point comparison of Virtuoso and Oracle 10g for basic relational operations. Oracle is very good, certainly in the basic relational operations like table scans and different kinds of joins. As a matter of principle, we will at the minimum match Oracle in all these things, in single and multiprocessor environments. The Virtuoso cut forthcoming in January will have all this inside. We are also considering making and publishing a basic RDBMS performance checklist, aimed at comparing specific aspects of relational engine performance. While the TPC tests give a good aggregate figure, it is sometimes interesting to look at a finer level of detail. We may not be allowed to give out numbers in all cases due to license terms but we can certainly make the test available and publish numbers for those who do not object to this. Of course, RDF is the direct beneficiary of all these efforts, since RDF loading and querying basically rests on the performance of very relational things, such as diverse types of indices and joins. More information will be forthcoming in January. Merry Christmas and productive new year to all.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>It&#39;s been a long and very busy time since <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1085" id="link-id104d7ac0">the last blog post</a>.</p>
<p>Now and then, circumstances call for a return to the contemplation of first principles. I have lately beheld the Platonic ideal of database-ness and translated it into engineering elegance. No quest is static and no objective is permanently achieved.</p>
<p>Accordingly, I have redone all <a href="http://virtuoso.openlinksw.com" id="link-id0xe3fc228">Virtuoso</a> core engine structures for control of parallel execution. As we now routinely get multiple cores per chip, this is more important than before. Aside from dramatic improvements in multiprocessor performance, there is also quite a bit of optimization for basic relational operations.</p>
<p>Of course, this is not for the pure pleasure of geek-craft; it serves a very practical purpose. <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x159636a8">RDF</a> opens a new database frontier, where these things make a significant difference. In application scenarios involving either federated/<a href="http://dbpedia.org/resource/Virtual_Database" id="link-id0xdb8dcf8">virtual database</a> or running typical web applications, the core concurrency of the DBMS is not really the determining factor. However, with RDF, we get a small number of very large tables and most processing goes to these tables. This is also often so with business intelligence but it is still more so with RDF. Thus the parallelism within a single index becomes essential.</p>
<p>We have also made a point by point comparison of Virtuoso and <a href="http://dbpedia.org/resource/Oracle_Database" id="link-id0xe68d290">Oracle</a> 10g for basic relational operations. Oracle is very good, certainly in the basic relational operations like table scans and different kinds of joins. As a matter of principle, we will at the minimum match Oracle in all these things, in single and multiprocessor environments. The Virtuoso cut forthcoming in January will have all this inside. We are also considering making and publishing a basic <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x9ec95c60">RDBMS</a> performance checklist, aimed at comparing specific aspects of relational engine performance. While the TPC tests give a good aggregate figure, it is sometimes interesting to look at a finer level of detail. We may not be allowed to give out numbers in all cases due to license terms but we can certainly make the test available and publish numbers for those who do not object to this.</p>
<p>Of course, RDF is the direct beneficiary of all these efforts, since RDF loading and querying basically rests on the performance of very relational things, such as diverse types of indices and joins.</p>
<p> More <a href="http://dbpedia.org/resource/Information" id="link-id0xe67bdc8">information</a> will be forthcoming in January.</p>
<p>Merry Christmas and productive new year to all.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2006-11-01#1074">
  <rss:title>More RDF scalability tests</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2006-11-01T19:26:40Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We have lately been busy with RDF scalability. We work with the 8000 university LUBM data set, a little over a billion triples. We can load it in 23h 46m on a box with 8G RAM. With 16G we probably could get it in 16h. The resulting database is 75G, 74 bytes per triple which is not bad. It will shrink a little more if explicitly compacted by merging adjacent partly filled pages. See Advances in Virtuoso RDF Triple Storage for an in-depth treatment of the subject. The real question of RDF scalability is finding a way of having more than one CPU on the same index tree without them hitting the prohibitive penalty of waiting for a mutex. The sure solution is partitioning, would probably have to be by range of the whole key. but before we go to so much trouble, we&#39;ll look at dropping a couple of critical sections from index random access. Also some kernel parameters may be adjustable, like a spin count before calling the scheduler when trying to get an occupied mutex. Still we should not waste too much time on platform specifics. We&#39;ll see. We just updated the Virtuoso Open Source cut. The latest RDF refinements are not in, so maybe the cut will have to be refreshed shortly. We are also now applying the relational to RDF mapping discussed in Declarative SQL Schema to RDF Ontology Mapping to the ODS applications. There is a form of the mapping in the VOS cut on the net but it is not quite ready yet. We must first finish testing it through mapping all the relational schemas of the ODS apps before we can really recommend it. This is another reason for a VOS update in the near future. We will be looking at the query side of LUBM after the ISWC 2006 conference. So far, we find queries compile OK for many SIOC use cases with the cost model that there is now. A more systematic review of the cost model for SPARQL will come when we get to the queries. We put some ideas about inferencing in the Advances in Triple Storage paper. The question is whether we should forward chain such things as class subsumption and subproperties. If we build these into the SQL engine used for running SPARQL, we probably can do these as unions at run time with good performance and better working set due to not storing trivial entailed triples. Some more thought and experimentation needs to go into this.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We have lately been busy with <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x17524ab8">RDF</a> scalability. We work with the 8000 university LUBM <a href="http://dbpedia.org/resource/Data" id="link-id0xd4ba910">data</a> set, a little over a billion triples. We can load it in 23h 46m on a box with 8G RAM. With 16G we probably could get it in 16h.</p>
<p>The resulting database is 75G, 74 bytes per triple which is not bad. It will shrink a little more if explicitly compacted by merging adjacent partly filled pages. See <a href="http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VOSBitmapIndexing" id="link-id105e5cf8">Advances in Virtuoso RDF Triple Storage</a> for an in-depth treatment of the subject.</p>
<p>The real question of RDF scalability is finding a way of having more than one CPU on the same index tree without them hitting the prohibitive penalty of waiting for a mutex. The sure solution is partitioning, would probably have to be by range of the whole key. but before we go to so much trouble, we&#39;ll look at dropping a couple of critical sections from index random access. Also some kernel parameters may be adjustable, like a spin count before calling the scheduler when trying to get an occupied mutex. Still we should not waste too much time on platform specifics. We&#39;ll see.</p>
<p>We just updated the <a href="http://virtuoso.openlinksw.com" id="link-id0x189d64b8">Virtuoso</a> Open Source cut. The latest RDF refinements are not in, so maybe the cut will have to be refreshed shortly.</p>
<p>We are also now applying the relational to RDF mapping discussed in <a href="http://virtuoso.openlinksw.com/wiki/main/Main/VOSSQLRDF" id="link-id10677bb8">Declarative SQL Schema to RDF Ontology Mapping</a> to the <a href="http://dbpedia.org/resource/OpenLink_Data_Spaces" id="link-id0xa0f5fde0">ODS</a> applications.</p>
<p>There is a form of the mapping in the VOS cut on the net but it is not quite ready yet. We must first finish testing it through mapping all the relational schemas of the ODS apps before we can really recommend it. This is another reason for a VOS update in the near future.</p>
<p>We will be looking at the query side of LUBM after the ISWC 2006 conference. So far, we find queries compile OK for many SIOC use cases with the cost model that there is now. A more systematic review of the cost model for <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x19b96630">SPARQL</a> will come when we get to the queries.</p>
<p>We put some ideas about inferencing in the Advances in Triple Storage paper. The question is whether we should forward chain such things as class subsumption and subproperties. If we build these into the <a href="http://dbpedia.org/resource/SQL" id="link-id0x19bbd098">SQL</a> engine used for running SPARQL, we probably can do these as unions at run time with good performance and better working set due to not storing trivial entailed triples. Some more thought and experimentation needs to go into this.</p>
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2006-07-18#1010">
  <rss:title>Intermediate RDF Loading Results</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2006-07-18T11:28:16Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Following from the post on a new multithreaded RDF loader, here are some intermediate results and action plans based on these. The experiments were made on a dual 1.6GHz Sun SPARC with 4G RAM and 2 SCSI disks. The data sets were the 48M triple Wikipedia data set and the 1.9M triple Wordnet data set. 100% CPU means one CPU constantly active. 100% disk means one thread blocked on the read system call at all times. Starting with an empty database, loading the Wikipedia set took 315 minutes, amounting to about 2500 triples per second. After this, loading the Wordnet data set with cold cache and 48M triples already in the table took 4 minutes 12 seconds, amounting to 6838 triples per second. Loading the Wikipedia data had CPU usage up to 180% but over the whole run CPU usage was around 50% with disk I/O around 170%. Loading the larger data set was significantly I/O bound while loading the smaller set was more CPU bound, yet was not at full 200% CPU. The RDF quad table was indexed on GSPO and PGOS. As one would expect, the bulk of I/O was on the PGOS index. We note that the pages of this index were on the average only 60% full. Thus the most relevant optimization seems to be to fill the pages closer to 90%. This will directly cut about a third of all I/O plus will have an additional windfall benefit in the form of better disk cache hit rates resulting from a smaller database. The most practical way of having full index pages in the case of unpredictable random insert order will be to take sets of adjacent index leaf pages and compact the rows so that the last page of the set goes empty. Since this is basically an I/O optimization, this should be done when preparing to write the pages to disk, hence concerning mostly old dirty pages. Insert and update times will not be affected since these operations will not concern themselves with compaction. Thus the CPU cost of background compaction will be negligible in comparison with writing the pages to disk. Naturally this will benefit any relational application as well as free text indexing. RDF and free text will be the largest beneficiaries due to the large numbers of short rows inserted in random order. Looking at the CPU usage of the tests, locating the place in the index where to insert, which by rights should be the bulk of the time cost, was not very significant, only about 15%. Thus there are many unused possibilities for optimization,for example writing some parts of the loader current done as stored procedures in C. Also the thread usage of the loader, with one thread parsing and mapping IRI strings to IRI ID&#39;s and 6 threads sharing the inserting could be refined for better balance, as we have noted that the parser thread sometimes forms a bottleneck. Doing the updating of the IRI name to IRI id mapping on the insert thread pool would produce some benefit. Anyway, since the most important test was I/O bound, we will first implement some background index compaction and then revisit the experiment. We expect to be able to double the throughput of the Wikipedia data set loading.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Following from <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1000" id="link-id105e5f28">the post on a new multithreaded RDF loader</a>, here are some intermediate results and action plans based on these.</p>
<p>The experiments were made on a dual 1.6GHz Sun SPARC with 4G RAM and 2 SCSI disks. The <a href="http://dbpedia.org/resource/Data" id="link-id0x189d5d20">data</a> sets were the 48M triple Wikipedia data set and the 1.9M triple Wordnet data set. 100% CPU means one CPU constantly active. 100% disk means one thread blocked on the read system call at all times.</p>
<p>Starting with an empty database, loading the Wikipedia set took 315 minutes, amounting to about 2500 triples per second. After this, loading the Wordnet data set with cold cache and 48M triples already in the table took 4 minutes 12 seconds, amounting to 6838 triples per second. Loading the Wikipedia data had CPU usage up to 180% but over the whole run CPU usage was around 50% with disk I/O around 170%. Loading the larger data set was significantly I/O bound while loading the smaller set was more CPU bound, yet was not at full 200% CPU.</p>
<p>The <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x9dc76d50">RDF</a> quad table was indexed on GSPO and PGOS. As one would expect, the bulk of I/O was on the PGOS index. We note that the pages of this index were on the average only 60% full. Thus the most relevant optimization seems to be to fill the pages closer to 90%. This will directly cut about a third of all I/O plus will have an additional windfall benefit in the form of better disk cache hit rates resulting from a smaller database.</p>
<p>The most practical way of having full index pages in the case of unpredictable random insert order will be to take sets of adjacent index leaf pages and compact the rows so that the last page of the set goes empty. Since this is basically an I/O optimization, this should be done when preparing to write the pages to disk, hence concerning mostly old dirty pages. Insert and update times will not be affected since these operations will not concern themselves with compaction. Thus the CPU cost of background compaction will be negligible in comparison with writing the pages to disk. Naturally this will benefit any relational application as well as free text indexing. RDF and free text will be the largest beneficiaries due to the large numbers of short rows inserted in random order.</p>
<p>Looking at the CPU usage of the tests, locating the place in the index where to insert, which by rights should be the bulk of the time cost, was not very significant, only about 15%. Thus there are many unused possibilities for optimization,for example writing some parts of the loader current done as stored procedures in C. Also the thread usage of the loader, with one thread parsing and mapping IRI strings to IRI ID&#39;s and 6 threads sharing the inserting could be refined for better balance, as we have noted that the parser thread sometimes forms a bottleneck. Doing the updating of the IRI name to IRI id mapping on the insert thread pool would produce some benefit.</p>
<p>Anyway, since the most important test was I/O bound, we will first implement some background index compaction and then revisit the experiment. We expect to be able to double the throughput of the Wikipedia data set loading.</p>
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2006-07-17#1007">
  <rss:title>More Thoughts on ORDBMS Clients, .NET and RDF</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2006-07-17T11:47:30Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Continuing on from the previous post... If Microsoft opens the right interfaces for independent developers, we see many exciting possibilities for using ADO.NET 3.0 with Virtuoso. Microsoft quite explicitly states that their thrust is to decouple the client side representation of data as .NET objects from the relational schema on the database. This is a worthy goal. But we can also see other possible applications of the technology when we move away from strictly relational back ends. This can go in two directions: Towards object oriented database (OODBMS) and towards making applications for the semantic web. In the OODBMS direction, we could equate Virtuoso table hierarchies with .NET classes and create a tighter coupling between client and database, going as it were in the other direction from Microsoft&#39;s intended decoupling. For example, we could do typical OODBMS tricks such as pre-fetch of objects based on storage clustering. The simplest case of this is like virtual memory, where the request for one byte brings in the whole page or group of pages. The basic idea is that what is created together probably gets used together and if all objects are modeled as subclasses of (sub-tables) of a common superclass, then, regardless of instance type, what is created together (has consecutive IDs) will indeed tend to cluster on the same page. These tricks can deliver good results in very navigational applications like GIS or CAD. But these are rather specialized things and we do not see OODBMS making any great comeback. But what is more interesting and more topical in the present times is making clients for the RDF world. There, the OWL ontology could be used to make the .NET classes and the DBMS could, when returning URIs serving as subjects of triple include specified predicates on these subjects, enough to allow instantiating .NET instances as &quot;proxies&quot; of these RDF objects. Of course, only predicates for which the client has a representation are relevant, thus some client-server handshake is needed at the start. What data could be pre-fetched is like the intersection of a concise bounded description and what the client has classes for. The rest of the mapping would be very simple, with IRIs becoming pointers, multi-valued predicates lists, and so on. IRIs for which the RDF type is not known or inferable could be left out or represented as a special class with name-value pairs for its attributes, same with blank nodes. In this way, .NET&#39;s considerable UI capabilities could directly be exploited for visualizing RDF data, only given that the data complies reasonably well with a known ontology. If a SPARQL query returned a result-set, IRI type columns would be returned as .NET instances and the server would pre-fetch enough data for filling them in. For a CONSTRUCT, a collection object could be returned with the objects materialized inside. If the interfaces allow passing an Entity SQL string, these could possibly be specialized to allow for a SPARQL string instead. LINQ might have to be extended to allow for SPARQL type queries, though. Many of these questions will be better answerable as we get more details on Microsoft&#39;s forthcoming ADO .NET release. We hope that sufficient latitude exists for exploring all these interesting avenues of development.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Continuing on from <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1002" id="link-id1064f0c8">the previous post</a>... If Microsoft opens the right interfaces for independent developers, we see many exciting possibilities for using <a href="http://msdn2.microsoft.com/en-us/data/aa937699.aspx" id="link-id10f3ab60">ADO.NET</a> 3.0 with <a href="http://virtuoso.openlinksw.com" id="link-id0x98d60b0">Virtuoso</a>.</p>
<p>Microsoft quite explicitly states that their thrust is to decouple the client side representation of <a href="http://dbpedia.org/resource/Data" id="link-id0x175112a8">data</a> as .NET objects from the relational schema on the database. This is a worthy goal.</p>
<p>But we can also see other possible applications of the technology when we move away from strictly relational back ends. This can go in two directions: Towards object oriented database (OODBMS) and towards making applications for the <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0xdbba5b0">semantic web</a>.</p>
<p>In the OODBMS direction, we could equate Virtuoso table hierarchies with .NET classes and create a tighter coupling between client and database, going as it were in the other direction from Microsoft&#39;s intended decoupling. For example, we could do typical OODBMS tricks such as pre-fetch of objects based on storage clustering. The simplest case of this is like virtual memory, where the request for one byte brings in the whole page or group of pages. The basic idea is that what is created together probably gets used together and if all objects are modeled as subclasses of (sub-tables) of a common superclass, then, regardless of instance type, what is created together (has consecutive IDs) will indeed tend to cluster on the same page. These tricks can deliver good results in very navigational applications like GIS or CAD. But these are rather specialized things and we do not see OODBMS making any great comeback.</p>
<p>But what is more interesting and more topical in the present times is making clients for the <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xe2c1e68">RDF</a> world. There, the OWL ontology could be used to make the .NET classes and the DBMS could, when returning URIs serving as subjects of triple include specified predicates on these subjects, enough to allow instantiating .NET instances as &quot;proxies&quot; of these RDF objects. Of course, only predicates for which the client has a representation are relevant, thus some client-server handshake is needed at the start. What data could be pre-fetched is like the intersection of a concise bounded description and what the client has classes for. The rest of the mapping would be very simple, with IRIs becoming pointers, multi-valued predicates lists, and so on. IRIs for which the RDF type is not known or inferable could be left out or represented as a special class with name-value pairs for its attributes, same with blank nodes.</p>
<p>In this way, .NET&#39;s considerable UI capabilities could directly be exploited for visualizing RDF data, only given that the data complies reasonably well with a known ontology.</p>
<p>If a <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x16b86e90">SPARQL</a> query returned a result-set, IRI type columns would be returned as .NET instances and the server would pre-fetch enough data for filling them in. For a CONSTRUCT, a collection object could be returned with the objects materialized inside. If the interfaces allow passing an <a href="http://dbpedia.org/resource/Entity" id="link-id0x19a26180">Entity</a> <a href="http://dbpedia.org/resource/SQL" id="link-id0x1d8ea998">SQL</a> string, these could possibly be specialized to allow for a SPARQL string instead. LINQ might have to be extended to allow for SPARQL type queries, though.</p>
<p>Many of these questions will be better answerable as we get more details on Microsoft&#39;s forthcoming <a href="http://dbpedia.org/resource/ADO.NET" id="link-id0xde74a60">ADO</a> .NET release. We hope that sufficient latitude exists for exploring all these interesting avenues of development.</p>]]></content:encoded>
 </rss:item>
</rdf:RDF>