BSBM With Triples and Mapped Relational Data

Details

The special contribution of the Berlin SPARQL Benchmark (BSBM) to the RDF world is to raise the question of doing OLTP with RDF.

Of course, here we immediately hit the question of comparisons with relational databases. To this effect, BSBM also specifies a relational schema and can generate the data as either triples or SQL inserts.

The benchmark effectively simulates the case of exposing an existing RDBMS as RDF. OpenLink Software calls this RDF Views. Oracle is beginning to call this semantic covers. The RDB2RDF XG, a W3C incubator group, has been active in this area since Spring, 2008.

But why an OLTP workload with RDF to begin with?

We believe this is relevant because RDF promises to be the interoperability factor between potentially all of traditional IS. If data is online for human consumption, it may be online via a SPARQL end-point as well. The economic justification will come from discoverability and from applications integrating multi-source structured data. Online shopping is a fine use case.

Warehousing all the world's publishable data as RDF is not our first preference, nor would it be the publisher's. Considerations of duplicate infrastructure and maintenance are reason enough. Consequently, we need to show that mapping can outperform an RDF warehouse, which is what we'll do here.

What We Got

First, we found that making the query plan took much too long in proportion to the run time. With BSBM this is an issue because the queries have lots of joins but access relatively little data. So we made a faster compiler and along the way retouched the cost model a bit.

But the really interesting part with BSBM is mapping relational data to RDF. For us, BSBM is a great way of showing that mapping can outperform even the best triple store. A relational row store is as good as unbeatable with the query mix. And when there is a clear mapping, there is no reason the SPARQL could not be directly translated.

If Chris Bizer et al launched the mapping ship, we will be the ones to pilot it to harbor!

We filled two Virtuoso instances with a BSBM200000 data set, for 100M triples. One was filled with physical triples; the other was filled with the equivalent relational data plus mapping to triples. Performance figures are given in "query mixes per hour". (An update or follow-on to this post will provide elapsed times for each test run.)

With the unmodified benchmark we got:

Physical Triples: 1297 qmph

Mapped Triples: 3144 qmph

In both cases, most of the time was spent on Q6, which looks for products with one of three words in the label. We altered Q6 to use text index for the mapping, and altered the databases accordingly. (There is no such thing as an e-commerce site without a text index, so we are amply justified in making this change.)

The following were measured on the second run of a 100 query mix series, single test driver, warm cache.

Physical Triples: 5746 qmph

Mapped Triples: 7525 qmph

We then ran the same with 4 concurrent instances of the test driver. The qmph here is 400 / the longest run time.

Physical Triples: 19459 qmph

Mapped Triples: 24531 qmph

The system used was 64-bit Linux, 2GHz dual-Xeon 5130 (8 cores) with 8G RAM. The concurrent throughputs are a little under 4 times the single thread throughput, which is normal for SMP due to memory contention. The numbers do not evidence significant overhead from thread synchronization.

The query compilation represents about 1/3 of total server side CPU. In an actual online application of this type, queries would be parameterized, so the throughputs would be accordingly higher. We used the StopCompilerWhenXOverRunTime = 1 option here to cut needless compiler overhead, the queries being straightforward enough.

We also see that the advantage of mapping can be further increased by more compiler optimizations, so we expect in the end mapping will lead RDF warehousing by a factor of 4 or so.

Suggestions for BSBM

Reporting Rules. The benchmark spec should specify a form for disclosure of test run data, TPC style. This includes things like configuration parameters and exact text of queries. There should be accepted variants of query text, as with the TPC.
Multiuser operation. The test driver should get a stream number as parameter, so that each client makes a different query sequence. Also, disk performance in this type of benchmark can only be reasonably assessed with a naturally parallel multiuser workload.
Add business intelligence. SPARQL has aggregates now, at least with Jena and Virtuoso, so let's use these. The BSBM business intelligence metric should be a separate metric off the same data. Adding synthetic sales figures would make more interesting queries possible. For example, producing recommendations like "customers who bought this also bought xxx."
For the SPARQL community, BSBM sends the message that one ought to support parameterized queries and stored procedures. This would be a SPARQL protocol extension; the SPARUL syntax should also have a way of calling a procedure. Something like select proc (??, ??) would be enough, where ?? is a parameter marker, like ? in ODBC/JDBC.
Add transactions.Especially if we are contrasting mapping vs. storing triples, having an update flow is relevant. In practice, this could be done by having the test driver send web service requests for order entry and the SUT could implement these as updates to the triples or a mapped relational store. This could use stored procedures or logic in an app server.

Comments on Query Mix

The time of most queries is less than linear to the scale factor. Q6 is an exception if it is not implemented using a text index. Without the text index, Q6 will inevitably come to dominate query time as the scale is increased, and thus will make the benchmark less relevant at larger scales.

We include the sources of our RDF view definitions and other material for running BSBM with our forthcoming Virtuoso Open Source 5.0.8 release. This also includes all the query optimization work done for BSBM. This will be available in the coming days.

Comments

Re:BSBM With Triples and Mapped Relational Data

Hi Orri and Ivan,

> Consequently, we need to show that mapping can outperform an RDF
> warehouse, which is what we'll do here.

Yes. I was already guessing for a while that SPARQL against RDF-mapped
relational DBs should be faster than SPARQL against triple stores.
With D2R Server it turned out that some queries are much faster, but
also that D2R Server really performas bad on others (especially Q5).
The bad performance with some queries was no surprise as there is
still lots of room for improvements in D2R Servers SPARQL-to-SQL query
rewriting algorithm.
Another observation was that the distance between native RDF stores
and RDF-mapped RDBs increases with dataset size.
So it looks like that if you have more than 50M triples and schemata
that somehow fits into a RDB, you should go for the RDF solution.

> We also see that the advantage of mapping can be further increased
> by more compiler optimizations, so we expect in the end mapping will
> lead RDF warehousing by a factor of 4 or so.

Being able to show a factor 4 on all dataset sizes would be very
interesting!

> Suggestions for BSBM
>
> * Reporting Rules. The benchmark spec should specify a form for
> disclosure of test run data, TPC style. This includes things like
> configuration parameters and exact text of queries. There should
> be accepted variants of query text, as with the TPC.

We have started collecting stuff that should go into the
full-disclosure report in section 6.2 of the benchmark spec
http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html#reporting
but did not had the time to define a proper format for this yet (I
guess we will have some XML format). We will define the format for
version 2 of the benchmark, which will be released together with
updated results in about 3-4 weeks.

If you think that there is something missing from this list, please
let us know.

> * Multiuser operation. The test driver should get a stream number as
> parameter, so that each client makes a different query sequence.
> Also, disk performance in this type of benchmark can only be
> reasonably assessed with a naturally parallel multiuser workload.

Yes. This is already on our todo list and will also be part of the
next release.

> * Add business intelligence. SPARQL has aggregates now, at least
> with Jena and Virtuoso, so let's use these. The BSBM business
> intelligence metric should be a separate metric off the same data.
> Adding synthetic sales figures would make more interesting queries
> possible. For example, producing recommendations like "customers
> who bought this also bought xxx."

Hmm, yes and no. I would love to extend the benchmark with a BI query
mix, but aggregates are not yet an official part of SPARQL. Our goal
with the benchmark was to define a tool to compare stores that
implement the current SPARQL specs but not to fix these specs. Thus,
we stayed in the bounderies of the current spec and of couse ran into
all the know problems of SPARQL (no aggregates, no free-text search,
no proper negation). All these things were discussed at the SPARQL 2
BOF at WWW2008 and I hope that they are all on Ivan Herman's list for
the charter of a new SPARQL WG.

> * For the SPARQL community, BSBM sends the message that one ought to
> support parameterized queries and stored procedures. This would be
> a SPARQL protocol extension; the SPARUL syntax should also have a
> way of calling a procedure. Something like select proc (??, ??)
> would be enough, where ?? is a parameter marker, like ? in
> ODBC/JDBC.

Also a great idea and maybe something Ivan does not have on his list
yet.

> * Add transactions.Especially if we are contrasting mapping vs.
> storing triples, having an update flow is relevant. In practice,
> this could be done by having the test driver send web service
> requests for order entry and the SUT could implement these as
> updates to the triples or a mapped relational store. This could
> use stored procedures or logic in an app server.

In principle yes, but we also wanted to design a benchmark that some
current RDF stores are able to run.
If I look at the current data load times of the SUTs I'm not so sure
that they like update streams ;-)

But I agree that update streams are clearly something that we should
have in the future.

> Comments on Query Mix
>
> The time of most queries is less than linear to the scale factor. Q6
> is an exception if it is not implemented using a text index. Without
> the text index, Q6 will inevitably come to dominate query time as
> the
> scale is increased, and thus will make the benchmark less relevant
> at
> larger scales.

You are right and it is again a problem of us trying to stay in the
bounderies of the SPARQL spec.
No sane person would use a regex for this kind of free-text search,
but SPARQL only offers the regex function and nothing else.

Maybe we should be a bit less strict here and allow proprietary
variants of Q6 until SPARQL got fixed.

> Next
>
> We include the sources of our RDF view definitions and other
> material
> for running BSBM with our forthcoming Virtuoso Open Source 5.0.8
> release. This also includes all the query optimization work done for
> BSBM. This will be available in the coming days.

Great. We are looking forward to rerun the benchmark with the new
virtuoso release on our box. Especially being able to confirm the
factor 4 advance of RDF-mapped RDFs against RDF stores would be fun
;-)

Cheers

Chris and Andreas

Posted by Chris Bizer on 08/07/2008 06:21 GMT

Comments URL for this entry: http://www.openlinksw.com/mt-tb/Http/comments?id=1409

Orri Erling's Weblog

Details

Subscribe

Tag Cloud

Post Categories

Recent Articles