Chris Bizer

chris@bizer.de — Thu, 07 Aug 2008 10:21:20 GMT

Hi Orri and Ivan,

> Consequently, we need to show that mapping can outperform an RDF
> warehouse, which is what we'll do here.

Yes. I was already guessing for a while that SPARQL against RDF-mapped
relational DBs should be faster than SPARQL against triple stores.
With D2R Server it turned out that some queries are much faster, but
also that D2R Server really performas bad on others (especially Q5).
The bad performance with some queries was no surprise as there is
still lots of room for improvements in D2R Servers SPARQL-to-SQL query
rewriting algorithm.
Another observation was that the distance between native RDF stores
and RDF-mapped RDBs increases with dataset size.
So it looks like that if you have more than 50M triples and schemata
that somehow fits into a RDB, you should go for the RDF solution.

> We also see that the advantage of mapping can be further increased
> by more compiler optimizations, so we expect in the end mapping will
> lead RDF warehousing by a factor of 4 or so.

Being able to show a factor 4 on all dataset sizes would be very
interesting!

> Suggestions for BSBM
>
> * Reporting Rules. The benchmark spec should specify a form for
> disclosure of test run data, TPC style. This includes things like
> configuration parameters and exact text of queries. There should
> be accepted variants of query text, as with the TPC.

We have started collecting stuff that should go into the
full-disclosure report in section 6.2 of the benchmark spec
http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html#reporting
but did not had the time to define a proper format for this yet (I
guess we will have some XML format). We will define the format for
version 2 of the benchmark, which will be released together with
updated results in about 3-4 weeks.

If you think that there is something missing from this list, please
let us know.

> * Multiuser operation. The test driver should get a stream number as
> parameter, so that each client makes a different query sequence.
> Also, disk performance in this type of benchmark can only be
> reasonably assessed with a naturally parallel multiuser workload.

Yes. This is already on our todo list and will also be part of the
next release.

> * Add business intelligence. SPARQL has aggregates now, at least
> with Jena and Virtuoso, so let's use these. The BSBM business
> intelligence metric should be a separate metric off the same data.
> Adding synthetic sales figures would make more interesting queries
> possible. For example, producing recommendations like "customers
> who bought this also bought xxx."

Hmm, yes and no. I would love to extend the benchmark with a BI query
mix, but aggregates are not yet an official part of SPARQL. Our goal
with the benchmark was to define a tool to compare stores that
implement the current SPARQL specs but not to fix these specs. Thus,
we stayed in the bounderies of the current spec and of couse ran into
all the know problems of SPARQL (no aggregates, no free-text search,
no proper negation). All these things were discussed at the SPARQL 2
BOF at WWW2008 and I hope that they are all on Ivan Herman's list for
the charter of a new SPARQL WG.

> * For the SPARQL community, BSBM sends the message that one ought to
> support parameterized queries and stored procedures. This would be
> a SPARQL protocol extension; the SPARUL syntax should also have a
> way of calling a procedure. Something like select proc (??, ??)
> would be enough, where ?? is a parameter marker, like ? in
> ODBC/JDBC.

Also a great idea and maybe something Ivan does not have on his list
yet.

> * Add transactions.Especially if we are contrasting mapping vs.
> storing triples, having an update flow is relevant. In practice,
> this could be done by having the test driver send web service
> requests for order entry and the SUT could implement these as
> updates to the triples or a mapped relational store. This could
> use stored procedures or logic in an app server.

In principle yes, but we also wanted to design a benchmark that some
current RDF stores are able to run.
If I look at the current data load times of the SUTs I'm not so sure
that they like update streams ;-)

But I agree that update streams are clearly something that we should
have in the future.

> Comments on Query Mix
>
> The time of most queries is less than linear to the scale factor. Q6
> is an exception if it is not implemented using a text index. Without
> the text index, Q6 will inevitably come to dominate query time as
> the
> scale is increased, and thus will make the benchmark less relevant
> at
> larger scales.

You are right and it is again a problem of us trying to stay in the
bounderies of the SPARQL spec.
No sane person would use a regex for this kind of free-text search,
but SPARQL only offers the regex function and nothing else.

Maybe we should be a bit less strict here and allow proprietary
variants of Q6 until SPARQL got fixed.

> Next
>
> We include the sources of our RDF view definitions and other
> material
> for running BSBM with our forthcoming Virtuoso Open Source 5.0.8
> release. This also includes all the query optimization work done for
> BSBM. This will be available in the coming days.

Great. We are looking forward to rerun the benchmark with the new
virtuoso release on our box. Especially being able to confirm the
factor 4 advance of RDF-mapped RDFs against RDF stores would be fun
;-)

Cheers

Chris and Andreas

BSBM With Triples and Mapped Relational Data

Chris Bizer