We ran the DBpedia benchmark queries again with different
configurations of Virtuoso. I had not studied the details of the
matter previously but now did have a closer look at the
queries.
Comparing numbers given by different parties is a constant
problem. In the case reported here, we loaded the full DBpedia 3,
all languages, with about 198M triples, onto Virtuoso v5 and Virtuoso Cluster v6,
all on the same 4 core 2GHz Xeon with 8G RAM. All databases were
striped on 6 disks. The Cluster configuration was with 4 processes
in the same box.
We ran the queries in two variants:
- With graph
specified in the SPARQL
FROM
clause, using the default indices.
- With no graph specified anywhere, using an
alternate indexing scheme.
The times below are for the sequence of 5 queries; individual
query times are not reported. I did not do a line-by-line review of
the execution plans since they seem to run well enough. We could
get some extra mileage from cost model tweaks, especially for the
numeric range conditions, but we will do this when somebody comes up
with better times.
First, about Virtuoso v5: Because there is a query in the set that
specifies no condition on S or O and only P, this simply cannot be
done with the default indices. With Virtuoso Cluster v6 it sort-of can, because v6 is
more space efficient.
So we added the index:
create bitmap index rdf_quad_pogs on rdf_quad (p, o, g, s);
|
Virtuoso v5 with gspo, ogps, pogs
|
Virtuoso Cluster v6 with gspo, ogps
|
Virtuoso Cluster v6 with gspo, ogps, pogs
|
cold
|
210 s |
136 s |
33.4 s |
warm
|
0.600 s |
4.01 s |
0.628 s |
OK, so now let us do it without a graph being specified. For
all platforms, we drop any existing indices, and --
create table r2 (g iri_id_8, s, iri_id_8, p iri_id_8, o any, primary key (s, p, o, g))
alter index R2 on R2 partition (s int (0hexffff00));
log_enable (2);
insert into r2 (g, s, p, o) select g, s, p, o from rdf_quad;
drop table rdf_quad;
alter table r2 rename RDF_QUAD;
create bitmap index rdf_quad_opgs on rdf_quad (o, p, g, s) partition (o varchar (-1, 0hexffff));
create bitmap index rdf_quad_pogs on rdf_quad (p, o, g, s) partition (o varchar (-1, 0hexffff));
create bitmap index rdf_quad_gpos on rdf_quad (g, p, o, s) partition (o varchar (-1, 0hexffff));
The code is identical for v5 and v6, except that with v5 we use
iri_id (32 bit)
for the type, not iri_id_8 (64 bit)
. We note that
we run out of IDs with v5 around a few billion triples, so with v6
we have double the ID length and still manage to be vastly more
space efficient.
With the above 4 indices, we can query the data pretty much in
any combination without hitting a full scan of any index. We note
that all indices that do not begin with s end with s as a bitmap.
This takes about 60% of the space of a non-bitmap index for data such
as DBpedia.
If you intend to do completely arbitrary RDF queries in
Virtuoso, then chances are you are best off with the above index
scheme.
|
Virtuoso v5 with gspo, ogps, pogs
|
Virtuoso Cluster v6 with spog, pogs, opgs, gpos
|
warm
|
0.595 s |
0.617 s |
The cold times were about the same as above, so not
reproduced.
Graph or No Graph?
It is in the SPARQL spirit to specify a graph and for pretty
much any application, there are entirely sensible ways of keeping
the data in graphs and specifying which ones are concerned by
queries. This is why Virtuoso is set up for this by default.
On the other hand, for the open web scenario, dealing with an
unknown large number of graphs, enumerating graphs is not possible
and questions like which graph of which source asserts x become
relevant. We have two distinct use cases which warrant different
setups of the database, simple as that.
The latter use case is not really within the SPARQL spec, so
implementations may or may not support this. For example Oracle or
Vertica would not do this well since they partition data according
to graph or predicate, respectively. On the other hand, stores that
work with one quad table, which is most of the ones out there,
should do it maybe with some configuring, as shown above.
Frameworks like Jena are not to my knowledge geared towards
having a wildcard for graph, although I would suppose this can be
arranged by adding some "super-graph" object, a graph of all
graphs. I don't think this is directly supported and besides most
apps would not need it.
Once the indices are right, there is no difference between
specifying a graph and not specifying a graph with the queries considered. With
more complex queries, specifying a graph or set of graphs does
allow some optimizations that cannot be done with no graph specified.
For example, bitmap intersections are possible only when all
leading key parts are given.
Conclusions
The best warm cache time is with v5; the five queries run under
600 ms after the first go. This is noted to show that all-in-memory with
a single thread of execution is hard to beat.
Cluster v6 performs the same queries in 623 ms. What is gained in
parallelism is lost in latency if all operations complete in
microseconds. On the other hand, Cluster v6 leaves v5 in the dust in
any situation that has less than 100% hit rate. This is due to
actual benefit from parallelism if operations take longer than a
few microseconds, such as in the case of disk reads. Cluster v6 has
substantially better data layout on disk, as well as fewer pages to
load for the same content.
This makes it possible to run the queries without the pogs
index on Cluster v6 even when v5 takes prohibitively long.
The morale of the story is to have a lot of RAM and space-efficient data representation.
The DBpedia benchmark does not specify any random access
pattern that would give a measure of sustained throughput under
load, so we are left with the extremes of cold and warm cache of
which neither is quite realistic.
Chris Bizer and I have talked on and off about benchmarks and
I have made suggestions that we will see incorporated into the
Berlin SPARQL benchmark, which will, I believe, be much more
informative.
Appendix: Query Text
For reference, the query texts specifying the graph are below. To
run without specifying the graph, just drop the FROM
<http://dbpedia.org>
from each query. The returned row counts are indicated
below each query's text.
sparql SELECT ?p ?o FROM <http://dbpedia.org> WHERE {
<http://dbpedia.org/resource/Metropolitan_Museum_of_Art> ?p ?o };
-- 1337 rows
sparql PREFIX p: <http://dbpedia.org/property/>
SELECT ?film1 ?actor1 ?film2 ?actor2
FROM <http://dbpedia.org> WHERE {
?film1 p:starring <http://dbpedia.org/resource/Kevin_Bacon> .
?film1 p:starring ?actor1 .
?film2 p:starring ?actor1 .
?film2 p:starring ?actor2 . };
-- 23910 rows
sparql PREFIX p: <http://dbpedia.org/property/>
SELECT ?artist ?artwork ?museum ?director FROM <http://dbpedia.org>
WHERE {
?artwork p:artist ?artist .
?artwork p:museum ?museum .
?museum p:director ?director };
-- 303 rows
sparql PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?s ?homepage FROM <http://dbpedia.org> WHERE {
<http://dbpedia.org/resource/Berlin> geo:lat ?berlinLat .
<http://dbpedia.org/resource/Berlin> geo:long ?berlinLong .
?s geo:lat ?lat .
?s geo:long ?long .
?s foaf:homepage ?homepage .
FILTER (
?lat <= ?berlinLat + 0.03190235436 &&
?long >= ?berlinLong - 0.08679199218 &&
?lat >= ?berlinLat - 0.03190235436 &&
?long <= ?berlinLong + 0.08679199218) };
-- 56 rows
sparql PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX p: <http://dbpedia.org/property/>
SELECT ?s ?a ?homepage FROM <http://dbpedia.org> WHERE {
<http://dbpedia.org/resource/New_York_City> geo:lat ?nyLat .
<http://dbpedia.org/resource/New_York_City> geo:long ?nyLong .
?s geo:lat ?lat .
?s geo:long ?long .
?s p:architect ?a .
?a foaf:homepage ?homepage .
FILTER (
?lat <= ?nyLat + 0.3190235436 &&
?long >= ?nyLong - 0.8679199218 &&
?lat >= ?nyLat - 0.3190235436 &&
?long <= ?nyLong + 0.8679199218) };
-- 13 rows