Benchmarks, Redux (part 1): On RDF Benchmarks

This post introduces a series on RDF benchmarking. In these posts I will cover the following:

Correct misleading information about us in the recent Berlin report: The load rate is off-the wall and the update mix is missing. We supply the right numbers and explain how to load things so that one gets decent performance.
Discuss configuration options for Virtuoso.
Tell a story about multithreading and its perils and how vectoring and scale-out can save us.
Analyze the run time behavior of Virtuoso 6 Single, 6 Cluster, and 7 Single.
Look at the benefits of SSDs (solid-state storage devices) over HDDs (hard disk devices; spinning platters), and I/O matters in general.
Talk in general about modalities of benchmark running, and how to reconcile vendors doing what they know best with the air of legitimacy of a third party. Whether to do things a la TPC or a la TREC? We will hopefully try a bit of both, at least so I have proposed to our partners in LOD2, the EU FP7 that also funded the recent Berlin report.
Outline the desiderata for an RDF benchmark that is not just an RDF-ized relational workload, the Social Intelligence Benchmark.
Talk about BSBM in specific. What does it measure?
Discuss some experiments with the BI use case of BSBM.
Document how the results mentioned here were obtained and suggest practices for benchmark running and disclosure.

The background is that the LOD2 FP7 project is supposed to deliver a report about the state of the art and benchmark laboratory by March 1. The Berlin report is a part thereof. In the project proposal we talk about an ongoing benchmarking activity and about having up-to-date installations of the relevant RDF stores and RDBMS.

Since this is taxpayer money for supposedly the common good, I see no reason why such a useful thing should be restricted to the project participants. On the other hand, running a display window of stuff for benchmarking, when in at least in some cases licenses prohibit unauthorized publishing of benchmark results might be seen to conflict with the spirit of the license if not its letter. We will see.

For now, my take is that we want to run benchmarks of all interesting software, inviting the vendors to tell us how to do that if they will, and maybe even letting them perform those runs themselves. Then we promise not to disclose results without the vendor's permission. Access to the installations is limited to whoever operates the equipment. Configuration files and detailed hardware specs and such on the other hand will be made public. If a run is published, it will be with permission and in a format that includes full information for replicating the experiment.

In the LOD2 proposal we also in so many words say that we will stretch the limits of the state of the art. This stretching is surely not limited to the project's own products but should also include the general benchmarking aspect. I will say with confidence that running single server benchmarks at a max 200 Mtriples of data is not stretching anything.

So to ameliorate this situation, I thought to run the same at 10x the scale on a couple of large boxes we have access to. 1 and 2 billion triples are still comfortably single server scales. Then we could go for example to Giovanni's cluster at DERI and do 10 and 20 billion triples, this should fly reasonably on 8 or 16 nodes of the DERI gear. Or we might talk to SEALS who by now should have their own lab. Even Amazon EC2 might be an option, although not the preferred one.

So I asked everybody about config instructions, which produced a certain amount of dismay as I might be said to be biased and to be skirting the edges of conflict of interest. The inquiry was not altogether negative though since Ontotext and Garlik provided some information. We will look into these this and next week. We will not publish any information without asking first.

In this series of posts I will only talk about OpenLink Software.