The special contribution of the Berlin SPARQL Benchmark (BSBM) to the RDF world is to raise the question of doing OLTP with RDF.
Of course, here we immediately hit the question of comparisons with relational databases. To this effect, BSBM also specifies a relational schema and can generate the data as either triples or SQL inserts.
The benchmark effectively simulates the case of exposing an existing RDBMS as RDF. OpenLink Software calls this RDF Views. Oracle is beginning to call this semantic covers. The RDB2RDF XG, a W3C incubator group, has been active in this area since Spring, 2008.
We believe this is relevant because RDF promises to be the interoperability factor between potentially all of traditional IS. If data is online for human consumption, it may be online via a SPARQL end-point as well. The economic justification will come from discoverability and from applications integrating multi-source structured data. Online shopping is a fine use case.
Warehousing all the world's publishable data as RDF is not our first preference, nor would it be the publisher's. Considerations of duplicate infrastructure and maintenance are reason enough. Consequently, we need to show that mapping can outperform an RDF warehouse, which is what we'll do here.
First, we found that making the query plan took much too long in proportion to the run time. With BSBM this is an issue because the queries have lots of joins but access relatively little data. So we made a faster compiler and along the way retouched the cost model a bit.
But the really interesting part with BSBM is mapping relational data to RDF. For us, BSBM is a great way of showing that mapping can outperform even the best triple store. A relational row store is as good as unbeatable with the query mix. And when there is a clear mapping, there is no reason the SPARQL could not be directly translated.
If Chris Bizer et al launched the mapping ship, we will be the ones to pilot it to harbor!
We filled two Virtuoso instances with a BSBM200000 data set, for 100M triples. One was filled with physical triples; the other was filled with the equivalent relational data plus mapping to triples. Performance figures are given in "query mixes per hour". (An update or follow-on to this post will provide elapsed times for each test run.)
With the unmodified benchmark we got:
In both cases, most of the time was spent on Q6, which looks for products with one of three words in the label. We altered Q6 to use text index for the mapping, and altered the databases accordingly. (There is no such thing as an e-commerce site without a text index, so we are amply justified in making this change.)
The following were measured on the second run of a 100 query mix series, single test driver, warm cache.
We then ran the same with 4 concurrent instances of the test driver. The qmph here is 400 / the longest run time.
The system used was 64-bit Linux, 2GHz dual-Xeon 5130 (8 cores) with 8G RAM. The concurrent throughputs are a little under 4 times the single thread throughput, which is normal for SMP due to memory contention. The numbers do not evidence significant overhead from thread synchronization.
The query compilation represents about 1/3 of total server side CPU. In an actual online application of this type, queries would be parameterized, so the throughputs would be accordingly higher. We used the StopCompilerWhenXOverRunTime = 1 option here to cut needless compiler overhead, the queries being straightforward enough.
StopCompilerWhenXOverRunTime = 1
We also see that the advantage of mapping can be further increased by more compiler optimizations, so we expect in the end mapping will lead RDF warehousing by a factor of 4 or so.
Reporting Rules. The benchmark spec should specify a form for disclosure of test run data, TPC style. This includes things like configuration parameters and exact text of queries. There should be accepted variants of query text, as with the TPC.
Multiuser operation. The test driver should get a stream number as parameter, so that each client makes a different query sequence. Also, disk performance in this type of benchmark can only be reasonably assessed with a naturally parallel multiuser workload.
Add business intelligence. SPARQL has aggregates now, at least with Jena and Virtuoso, so let's use these. The BSBM business intelligence metric should be a separate metric off the same data. Adding synthetic sales figures would make more interesting queries possible. For example, producing recommendations like "customers who bought this also bought xxx."
For the SPARQL community, BSBM sends the message that one ought to support parameterized queries and stored procedures. This would be a SPARQL protocol extension; the SPARUL syntax should also have a way of calling a procedure. Something like select proc (??, ??) would be enough, where ?? is a parameter marker, like ? in ODBC/JDBC.
select proc (??, ??)
Add transactions.Especially if we are contrasting mapping vs. storing triples, having an update flow is relevant. In practice, this could be done by having the test driver send web service requests for order entry and the SUT could implement these as updates to the triples or a mapped relational store. This could use stored procedures or logic in an app server.
The time of most queries is less than linear to the scale factor. Q6 is an exception if it is not implemented using a text index. Without the text index, Q6 will inevitably come to dominate query time as the scale is increased, and thus will make the benchmark less relevant at larger scales.
We include the sources of our RDF view definitions and other material for running BSBM with our forthcoming Virtuoso Open Source 5.0.8 release. This also includes all the query optimization work done for BSBM. This will be available in the coming days.
About this entry:
Author: Orri Erling
Published: 08/06/2008 19:35 GMT
08/06/2008 16:29 GMT
Comment Status: 0 Comments