Let us talk about what ought to be benchmarked in the context of RDF.

A point that often gets brought up by RDF-ers when talking about benchmarks is that there already exist systems which perform very well at TPC-H and similar workloads, and therefore there is no need for RDF to go there. It is, as it were, somebody else's problem; besides, it is a solved one.

On the other hand, being able to express what is generally expected of a query language might not be a core competence or a competitive edge, but it certainly is a checklist item.

BSBM seems to be adopted as a de facto RDF benchmark, as there indeed is almost nothing else. But we should not lose sight of the fact that this is in fact a relational schema and workload that has just been straightforwardly transformed to RDF. BSBM was made, after all, in part for measuring RDB to RDF mapping. Thus BSBM is no more RDF-ish than a trivially RDF-ized TPC-H would be. TPC-H is however a bit more difficult if also a better thought out benchmark than the BSBM BI Mix proposal. But I do not expect an RDF audience to have any enthusiasm for this as this is indeed a very tough race by now, and besides one in which RDB and SQL will keep some advantage. However, using this as a validation test is meaningful, as there exists a validation dataset and queries that we already have RDF-ized. We could publish these and call this "RDF-H".

In the following I will outline what would constitute an RDF-friendly, scientifically interesting benchmark. The points are in part based on discussions with Peter Boncz of CWI.

The Social Network Intelligence Benchmark (SNIB) takes the social web Facebook-style schema Ivan Mikhailov and I made last year under the name of Botnet BM. In LOD2, CWI is presently working on this.

The data includes DBpedia as a base component used for providing conversation topics, information about geographical locales of simulated users, etc. DBpedia is not very large, around 200M-300M triples, but it is diverse enough.

The data will have correlations, e.g., people who talk about sports tend to know other people who talk about the same sport, and they are more likely to know people from their geographical area than from elsewhere.

The bulk of the data consists of a rich history of interactions including messages to individuals and groups, linking to people, dropping links, joining and leaving groups, and so forth. The messages are tagged using real-world concepts from DBpedia, and there is correlation between tagging and textual content since both are generated from Dbpedia articles. Since there is such correlation, NLP techniques like entity and relationship extraction can be used with the data even though this is not the primary thrust of SNIB.

There is variation in frequency of online interaction, and this interaction consist of sessions. For example, one could analyze user behavior per time of day for online ad placement.

The data probably should include propagating memes, fashions, and trends that travel on the social network. With this, one could query about their origin and speed of propagation.

There should probably be cases of duplicate identities in the data, i.e., one real person using many online accounts to push an agenda. Resolving duplicate identities makes for nice queries.

Ragged data with half-filled profiles and misspelled identifiers like person and place names are a natural part of the social web use case. The data generator should take this into account.

  • Distribution of popularity and activity should follow a power-law-like pattern; actual measures of popularity can be sampled from existing social networks even though large quantities of data cannot easily be extracted.

  • The dataset should be predictably scalable. For the workload considered, the relative importance of the queries or other measured tasks should not change dramatically with the scale.

For example some queries are logarithmic to data size (e.g., find connections to a person), some are linear (e.g., find average online time of sports fans on Sundays), and some are quadratic or worse (e.g., find two extremists of the same ideology that are otherwise unrelated). Making a single metric from such parts may not be meaningful. Therefore, SNIB might be structured into different workloads.

The first would be an online mix with typically short lookups and updates, around O ( log ( n ) ).

The Business Intelligence Mix would be composed of queries around OO ( n log ( n ) ). Even so, with real data, choice of parameters will provide dramatic changes in query run-time. Therefore a run should be specified to have a predictable distribution of "hard" and "easy" parameter choices. In the BSBM BI mix modification, I did this by defining some to be drill downs from a more general to a more specific level of a hierarchy. This could be done here too in some cases; other cases would have to be defined with buckets of values.

Both the real world and LOD2 are largely concerned with data integration. The SNIB workload can have aspects of this, for example, in resolving duplicate identities. These operations are more complex than typical database queries, as the attributes used for joining might not even match in the initial data.

One characteristic of these is the production of sometimes large intermediate results that need to be materialized. Doing these operations in practice requires procedural control. Further, running algorithms like network analytics (e.g., Page rank, centrality, etc.) involves aggregation of intermediate results that is not very well expressible in a query language. Some basic graph operations like shortest path are expressible but then are not in unextended SPARQL 1.1; as these would for example involve returning paths, which are explicitly excluded from the spec.

These are however the areas where we need to go for a benchmark that is more than a repackaging of a relational BI workload.

We find that such a workload will have procedural sections either in application code or stored procedures. Map-reduce is sometimes used for scaling these. As one would expect, many cluster databases have their own version of these control structures. Therefore some of the SNIB workload could even be implemented as map-reduce jobs alongside parallel database implementations. We might here touch base with the LarKC map-reduce work to see if it could be applied to SNIB workloads.

We see a three-level structure emerging. There is an Online mix which is a bit like the BSBM Explore mix, and an Analytics mix which is on the same order of complexity as TPC-H. These may have a more-or-less fixed query formulation and test driver. Beyond these, yet working on the same data, we have a set of Predefined Tasks which the test sponsor may implement in a manner of their choice.

We would finally get to the "raging conflict" between the "declarativists" and the "map reductionists." Last year's VLDB had a lot of map-reduce papers. I know of comparisons between Vertica and map reduce for doing a fairly simple SQL query on a lot of data, but here we would be talking about much more complex jobs on more interesting (i.e., less uniform) data.

We might even interest some of the cluster RDBMS players (Teradata, Vertica, Greenplum, Oracle Exadata, ParAccel, and/or Aster Data, to name a few) in running this workload using their map-reduce analogs.

We see that as we get to topics beyond relational BI, we do not find ourselves in an RDF-only world but very much at a crossroads of many technologies, e.g., map-reduce and its database analogs, various custom built databases, graph libraries, data integration and cleaning tools, and so forth.

There is not, nor ought there to be, a sheltered, RDF-only enclave. RDF will have to justify itself in a world of alternatives.

This must be reflected in our benchmark development, so relational BI is not irrelevant; in fact, it is what everybody does. RDF cannot be a total failure at this, even if this were not RDF's claim to fame. The claim to fame comes after we pass this stage, which is what we intend to explore in SNIB.

Benchmarks, Redux Series