This post presents some ideas and use cases for RDF store benchmarking.

Use Cases

  • Basic triple storage and retrieval. The LUBM benchmark captures many aspects of this.
  • Recursive rule application. The simpler cases of this are things like transitive closure.
  • Mapping of relational data to RDF. Since relational benchmarks are well established, as in the TPC benchmarks, the schemas and test data generation can come from there. The problem is that the D/H/R benchmarks consist of aggregates and grouping exclusively but SPARQL does not have these.

Benchmarking Triple Stores

An RDF benchmark suite should meet the following criteria:

  • Have a single scale factor.
  • Produce a single metric, queries per unit of time, for example. The metric should be concisely expressible, for example 10 qpsR at 100M, options 1, 2, 3. Due to the heterogeneous nature of the systems under test, the result's short form likely needs to specify the metric, scale and options included in the test.
  • Have optional parts, such as different degrees of inferencing and maybe language extensions such as full text, as this is a likely component of any social software.
  • Have a specification for a full disclosure report, TPC style, even though we can skip the auditing part in the interest of making it easy for vendors to publish results and be listed.
  • Have a subject domain where real data are readily available and which is broadly understood by the community. For example, SIOC data about on-line communities seems appropriate. Typical degree of connectedness, number of triples per person etc can be measured from real files .
  • Have a diverse enough workload. This should include initial bulk load of data, some adding of triples during the run and continuous query load.

The query load should illustrate the following types of operations:

  • Basic lookups, such as would be made for filling in a person's home page in a social networks app. List data of user plus names and emails of friends. Relatively short joins, unions, and optionals.
  • Graph operations like shortest path from individual to individual in a social network.
  • Selecting data with drill down, as in faceted browsing. For example, start with articles having tag t, see distinct tags of articles with tag t, select another tag t2 to see the distinct tags of articles with both t and t2 and so forth.
  • Retrieving all closely related nodes, as in composing a SIOC snapshot over a person's post in different communities, the recent activity report for a forum etc. These will be construct or describe queries. The coverage of describe is unclear, hence construct may be better.

If we take an application like LinkedIn as a model, we can get a reasonable estimate of the relative frequency of different queries. For the queries per second metric, we can define the mix similarly to TPC C. We count executions of the main query and divide by running time. Within this time, for every 10 executions of the main query there are varying numbers of executions of secondary queries, typically more complex ones.

Full Disclosure Report

The report contains basic TPC-like items such as:

  • Metric qps/scale/options
  • Software used, DBMS, RDF toolkit if separate
  • Hardware. Number, clock and type of CPUs per machine, number of machines in cluster, RAM per machine, disks per machine, manufacturer, price of hardware/software

These can go into a summary spreadsheet that is just like the TPC ones.

Additionally, the full report should include:

  • Configuration files for DBMS, web server, other components.
  • Parameters for test driver, i.e., number of clicks, how many concurrent clicks. The tester determines the degree of parallelism that gets the best throughput and should indicate this in the report. Making a graph of throughput as function of concurrent clients is a lot of work and maybe not necessary here.
  • Duration in real time. Since for any large database with a few G of working set the warm up time is easily 30 minutes, the warm up time should be mentioned but not included in the metric. The measured interval should not be less than 1h in duration and should reflect a "steady state," as defined in the TPC rules.
  • Source code of server side application logic. This can be inference rules, stored procedures, dynamic web pages or any other server side software-like thing that exists or is modified for the purpose of the test.
  • Specification of test driver. If there is a commonly used test driver, its type, parameters and version. If the test driver is custom, reference to its source code.
  • Database sizes. For a preallocated database of n G, how much was free after the initial load, how much after the test run? How many bytes per triple.
  • CPU/IO. This may not always be readily measurable but is interesting still. Maybe a realistic spec is listing the sum of CPU minutes across all  server machines and server processes. For IO, maybe the system totals from iostat before and after the full run, including load and warm-up. If the DBMS and RDF toolkits are separate, it is interesting to know the division of CPU time between them.

Test Drivers

OpenLink has a multithreaded C program that simulates n web users multiplexed over m threads. For example, 10000 users with 100 threads, each user with its own state, so that they carry out their respective usage patterns independently, getting served as soon as the server is available, still having no more than m requests going at any time. The usage pattern is something like go check the mail, browse the catalogue, add to shopping cart etc. This can be modified to browse a social network database and produce the desired query mix. This generates HTTP requests, hence would work against a SPARQL end point or any set of dynamic web pages.

The program produces a running report of the clicks per second rate and statistics at the end, listing the min/avg/max times per operation.

This can be packaged as a separate open source download once the test spec is agreed upon.

For generating test data, a modification of the LUBM generator is probably the most convenient choice.

Benchmarking Relational to RDF Mapping

This area is somewhat more complex than triple storage.

At least the following factors enter into the evaluation: 

  • Degree of SPARQL compliance. For example, can one have a variable as predicate? Are there limits on optionals and unions?
  • Are the data being queried split over multiple RDBMS and joined between them?
  • Type of use case. Is this about navigational lookups or about statistics? OLTP or OLAP? It would be the former, as SPARQL does not really have aggregation. Still, many of the interesting queries are about comparing large data sets.

The rationale for mapping relational data to RDF is often data integration. Even in simple cases like the OpenLink Data Spaces applications, a single SPARQL query will often result in a union of queries over distinct relational schemas, each somewhat similar but different in its details.

A test for mapping should represent this aspect. Of course, translating a column into a predicate is easy and useful, specially when copying data. Still, the full power of mapping seems to involve a single query over disparate sources with disparate schemas.

A real world case is OpenLink's ongoing work for mapping WordPress, Mediawiki, phpBB, Drupal, and possibly other popular web applications into SIOC.

Using this as a benchmark might make sense because the source schemas are widely known, there is a lot of real world data in these systems, and the test driver might even be the same as with the above proposed triple store benchmark. The query mix might have to be somewhat tailored.

Another "enterprise style" scenario might be to take the TPC C and TPC D databases — after all both have products, customers and orders — and map them into a common ontology. Then there could be queries sometimes running on only one, sometimes joining both.

Considering the times and the audience, the WordPress/Mediawiki scenario might be culturally more interesting and more fun to demo.

The test has two aspects: Throughput and coverage. I think these should be measured separately.

The throughput can be measured with queries that are generally sensible, such as "get articles by an author that I know with tags t1 and t2."

Then there are various pathological queries that work specially poorly with mapping. For example, if the types of subjects are not given, if the predicate is known at run time only, if the graph is not given, we get a union of everything joined with another union of everything and many of the joins between the terms of the different unions are identically empty but the software may not know this.

In a real world case, I would simply forbid such queries. In the benchmarking case, these may be of some interest. If the mapping is clever enough, it may survive cases like "list all predicates and objects of everything called gizmo where the predicate is in the product ontology".

It may be good to divide the test into a set of straightforward mappings and special cases and measure them separately. The former will be queries that a reasonably written application would do for producing user reports.