DBpedia Benchmark Revisited

We ran the DBpedia benchmark queries again with different configurations of Virtuoso. I had not studied the details of the matter previously but now did have a closer look at the queries.

Comparing numbers given by different parties is a constant problem. In the case reported here, we loaded the full DBpedia 3, all languages, with about 198M triples, onto Virtuoso v5 and Virtuoso Cluster v6, all on the same 4 core 2GHz Xeon with 8G RAM. All databases were striped on 6 disks. The Cluster configuration was with 4 processes in the same box.

We ran the queries in two variants:

With graph specified in the SPARQL FROM clause, using the default indices.
With no graph specified anywhere, using an alternate indexing scheme.

The times below are for the sequence of 5 queries; individual query times are not reported. I did not do a line-by-line review of the execution plans since they seem to run well enough. We could get some extra mileage from cost model tweaks, especially for the numeric range conditions, but we will do this when somebody comes up with better times.

First, about Virtuoso v5: Because there is a query in the set that specifies no condition on S or O and only P, this simply cannot be done with the default indices. With Virtuoso Cluster v6 it sort-of can, because v6 is more space efficient.

So we added the index:

create bitmap index rdf_quad_pogs on rdf_quad (p, o, g, s);

	Virtuoso v5 with gspo, ogps, pogs	Virtuoso Cluster v6 with gspo, ogps	Virtuoso Cluster v6 with gspo, ogps, pogs
cold	210 s	136 s	33.4 s
warm	0.600 s	4.01 s	0.628 s

OK, so now let us do it without a graph being specified. For all platforms, we drop any existing indices, and --

create table r2 (g iri_id_8, s, iri_id_8, p iri_id_8, o any, primary key (s, p, o, g)) alter index R2 on R2 partition (s int (0hexffff00)); log_enable (2); insert into r2 (g, s, p, o) select g, s, p, o from rdf_quad; drop table rdf_quad; alter table r2 rename RDF_QUAD; create bitmap index rdf_quad_opgs on rdf_quad (o, p, g, s) partition (o varchar (-1, 0hexffff)); create bitmap index rdf_quad_pogs on rdf_quad (p, o, g, s) partition (o varchar (-1, 0hexffff)); create bitmap index rdf_quad_gpos on rdf_quad (g, p, o, s) partition (o varchar (-1, 0hexffff));

The code is identical for v5 and v6, except that with v5 we use iri_id (32 bit) for the type, not iri_id_8 (64 bit). We note that we run out of IDs with v5 around a few billion triples, so with v6 we have double the ID length and still manage to be vastly more space efficient.

With the above 4 indices, we can query the data pretty much in any combination without hitting a full scan of any index. We note that all indices that do not begin with s end with s as a bitmap. This takes about 60% of the space of a non-bitmap index for data such as DBpedia.

If you intend to do completely arbitrary RDF queries in Virtuoso, then chances are you are best off with the above index scheme.

	Virtuoso v5 with gspo, ogps, pogs	Virtuoso Cluster v6 with spog, pogs, opgs, gpos
warm	0.595 s	0.617 s

The cold times were about the same as above, so not reproduced.

Graph or No Graph?

It is in the SPARQL spirit to specify a graph and for pretty much any application, there are entirely sensible ways of keeping the data in graphs and specifying which ones are concerned by queries. This is why Virtuoso is set up for this by default.

On the other hand, for the open web scenario, dealing with an unknown large number of graphs, enumerating graphs is not possible and questions like which graph of which source asserts x become relevant. We have two distinct use cases which warrant different setups of the database, simple as that.

The latter use case is not really within the SPARQL spec, so implementations may or may not support this. For example Oracle or Vertica would not do this well since they partition data according to graph or predicate, respectively. On the other hand, stores that work with one quad table, which is most of the ones out there, should do it maybe with some configuring, as shown above.

Frameworks like Jena are not to my knowledge geared towards having a wildcard for graph, although I would suppose this can be arranged by adding some "super-graph" object, a graph of all graphs. I don't think this is directly supported and besides most apps would not need it.

Once the indices are right, there is no difference between specifying a graph and not specifying a graph with the queries considered. With more complex queries, specifying a graph or set of graphs does allow some optimizations that cannot be done with no graph specified. For example, bitmap intersections are possible only when all leading key parts are given.

Conclusions

The best warm cache time is with v5; the five queries run under 600 ms after the first go. This is noted to show that all-in-memory with a single thread of execution is hard to beat.

Cluster v6 performs the same queries in 623 ms. What is gained in parallelism is lost in latency if all operations complete in microseconds. On the other hand, Cluster v6 leaves v5 in the dust in any situation that has less than 100% hit rate. This is due to actual benefit from parallelism if operations take longer than a few microseconds, such as in the case of disk reads. Cluster v6 has substantially better data layout on disk, as well as fewer pages to load for the same content.

This makes it possible to run the queries without the pogs index on Cluster v6 even when v5 takes prohibitively long.

The morale of the story is to have a lot of RAM and space-efficient data representation.

The DBpedia benchmark does not specify any random access pattern that would give a measure of sustained throughput under load, so we are left with the extremes of cold and warm cache of which neither is quite realistic.

Chris Bizer and I have talked on and off about benchmarks and I have made suggestions that we will see incorporated into the Berlin SPARQL benchmark, which will, I believe, be much more informative.

Appendix: Query Text

For reference, the query texts specifying the graph are below. To run without specifying the graph, just drop the FROM <http://dbpedia.org> from each query. The returned row counts are indicated below each query's text.

sparql SELECT ?p ?o FROM <http://dbpedia.org> WHERE {
  <http://dbpedia.org/resource/Metropolitan_Museum_of_Art> ?p ?o };

-- 1337 rows

sparql PREFIX p: <http://dbpedia.org/property/>
SELECT ?film1 ?actor1 ?film2 ?actor2
FROM <http://dbpedia.org> WHERE {
  ?film1 p:starring <http://dbpedia.org/resource/Kevin_Bacon> .
  ?film1 p:starring ?actor1 .
  ?film2 p:starring ?actor1 .
  ?film2 p:starring ?actor2 . };

--  23910 rows

sparql PREFIX p: <http://dbpedia.org/property/>
SELECT ?artist ?artwork ?museum ?director FROM <http://dbpedia.org> 
WHERE {
  ?artwork p:artist ?artist .
  ?artwork p:museum ?museum .
  ?museum p:director ?director };

-- 303 rows

sparql PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?s ?homepage FROM <http://dbpedia.org>  WHERE {
   <http://dbpedia.org/resource/Berlin> geo:lat ?berlinLat .
   <http://dbpedia.org/resource/Berlin> geo:long ?berlinLong . 
   ?s geo:lat ?lat .
   ?s geo:long ?long .
   ?s foaf:homepage ?homepage .
   FILTER (
     ?lat        <=     ?berlinLat + 0.03190235436 &&
     ?long       >=     ?berlinLong - 0.08679199218 &&
     ?lat        >=     ?berlinLat - 0.03190235436 && 
     ?long       <=     ?berlinLong + 0.08679199218) };

-- 56 rows

sparql PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX p: <http://dbpedia.org/property/>
SELECT ?s ?a ?homepage FROM <http://dbpedia.org>  WHERE {
   <http://dbpedia.org/resource/New_York_City> geo:lat ?nyLat .
   <http://dbpedia.org/resource/New_York_City> geo:long ?nyLong . 
   ?s geo:lat ?lat .
   ?s geo:long ?long .
   ?s p:architect ?a .
   ?a foaf:homepage ?homepage .
   FILTER (
     ?lat        <=     ?nyLat + 0.3190235436 &&
     ?long       >=     ?nyLong - 0.8679199218 &&
     ?lat        >=     ?nyLat - 0.3190235436 && 
     ?long       <=     ?nyLong + 0.8679199218) };

-- 13 rows

Orri Erling's Weblog

Details

Subscribe

Tag Cloud

Post Categories