In the interest of participating in a community benchmark development process, I will here outline some desiderata and explain how we could improve on LUBM. I will also touch on the message such an effort ought to convey.
A blow-by-blow analysis of the performance of a complex system such as a DBMS is more than fits within the scope of human attention at one go. This is why this all must be abbreviated into a single metric. Only when thus abbreviated, can this information be used in context. The metric's practical value is relative to how well it predicts the performance of the system in some real task. This means a task not likely to be addressed by an alternative technology, unless the challenger clearly beats the incumbent.
A benchmark is promotional material, both well as for the technology being benchmarked as a whole. This is why the benchmark, whatever it does, should do something that the technology does well, surely better than any alternative technology. A case in point is that one ought not to take a pure relational workload and RDF-ize it, for then the relational variant is likely to come out on top.
In this regard LUBM is not so bad because its reliance on class and property hierarchies and the occasional transitivity or inference rule makes the workload typically RDF, a little ways apart from a purely relational implementation of the task.
RDF's claim to fame is linked data. This means giving things globally unique names and thereby making anything joinable with anything else, insofar there is agreement on the names. RDF is a key to a new class of problems, call it web scale database. Web scale here refers first to heterogeneity and multiplicity of independent sources and secondly to volume of data.
Now there are plenty of relational applications with very large volumes of data. On the non-relational side, there are even larger applications, such as web search engines. All these have a set schema and a specific workload they are meant to address. RDF versions of such are conceivable but hold no intrinsic advantage if considered in the specific niche alone.
The claim to fame of RDF is not to outperform these on their home turf but to open another turf altogether, allowing agile joining and composing of all these resources.
This is why a benchmark, i.e., an an advertisement for the RDF value proposition, should not just take a relational workload and RDF-ize it. The benchmark should carry some of the web in it.
If we just intend to measure how well an RDF store joins triples to other triples, LUBM is almost good enough. If it defined a query mix with different frequencies for short and long queries and a concurrent query metric, it would be pretty much there. Our adaptation of it is adequate for counting joins per second. But joins per second is not a value proposition.
So we have two questions:
If we just take the RDF model and SPARQL, how do we make a benchmark that fills in what LUBM does not cover?
The answers to the first are not very complex:
Add some optionals. Have different frequencies of occurrence for some properties.
Add different graphs. Make queries joining between graphs and drawing on different graphs. Querying against all graphs of the store is not a part of the language. Still this would be useful but leave it out for now.
Add some filters and arithmetic. Not much can be done there, though because expressions cannot be returned and there is no aggregation or grouping.
Split the workload into short and long queries. The short should be typical for online use and the long ones for analysis. Different execution frequencies for different queries is a must. Analysis is limited by lack of grouping, expressions or aggregation. Still, something can be contrived by looking for a pattern that does not exist or occurs extremely rarely. Producing result sets of millions of rows is not realistic.
Many of the LUBM queries return thousands of rows, even when scoped to a single university. This is not very realistic. No user interface displays that sort of quantity. Of course, the intermediate results can be large as you please but the output must be somehow ranked. SPARQL has order by and limit, so these will have to be used. TPC H for example has almost always a group by/order by combination and sometimes a result rows limit.
The degree of inference in LUBM is about right, mostly sub-classes and sub-properties, nothing complex. We certainly regard this as a database benchmark more than a knowledge representation or rule system one.
LUBM does an OK job of defining a scale factor. I think that a concurrent query metric can just be so many queries per time at a given scale. The number of clients, I would say, can be decided by the test sponsor, taking whatever works best. A load balancer or web server can always be tuned to enforce some limit on concurrency. I don't think that a scale rule like in TPC C, where it says that only so many transactions per minute are allowed per warehouse is needed here. The effect of this is that when reporting a higher throughput, one has to automatically have a bigger database.
There is nothing to prevent these improvements from being put into a subsequent version of LUBM.
Building something that shows RDF at its best is a slightly different proposition. For this, we cannot be limited to the SPARQL recommendation and must allow custom application code and language extensions. Examples would be scripting similar to SQL stored procedures and extensions such as we have made for sub-queries and aggregation, explained a couple of posts back.
Maybe the Billion Triples challenge produces some material that we can use for this. We need to go for spaces that are not easily reached with SQL, have distributed computing, federation, discovery, demand driven import of data and such like.
I'll write more about ways of making RDF shine in some future post.
There are two kinds of workloads: online and offline. Online is what must be performed in an interactive situation, without significant human perceptible delay, i.e. within 500 ms. Anything else is offline.
Because this is how any online system is designed, this should be reflected in the benchmark. Ideally we would make two benchmarks.
About this entry:
Author: Orri Erling
Published: 02/05/2008 11:47 GMT
03/25/2008 14:43 GMT
Comment Status: 0 Comments