(Posted verbatim from Orri Erling's
Blog.)
More
RDF scalability tests: "
We have lately been busy with RDF scalability. We work with the
8000 university LUBM data set, a little over a billion triples. We
can load it in 23h 46m on a box with 8G RAM. With 16G we probably
could get it in 16h.
The resulting database is 75G, 74 bytes per triple which is not
bad. It will shrink a little more if explicitly compacted by
merging adjacent partly filled pages. See Advances in Virtuoso
Triple Storage for an in-depth treatment of the subject.
The real question of RDF scalability is finding a way of having
more than one CPU on the same index tree without them hitting the
prohibitive penalty of waiting for a mutex. The sure solution is
partitioning, would probably have to be by range of the whole key.
but before we go to so much trouble, well look at dropping a couple
of critical sections from index random access. Also some kernel
parameters may be adjustable, like a spin count before calling the
scheduler when trying to get an occupied mutex. Still we should not
waste too much time on platform specifics. Well see.
We just updated the Virtuoso Open Source cut. The latest RDF
refinements are not in, so maybe the cut will have to be refreshed
shortly.
We are also now applying the relational to RDF mapping discussed
in Declarative
SQL Schema to RDF Ontology Mapping to the ODS applications.
There is a form of the mapping in the VOS cut on the net but it
is not quite ready yet. We must first finish testing it through
mapping all the relational schemas of the ODS apps before we can
really recommend it. This is another reason for a VOS update in the
near future.
We will be looking at the query side of LUBM after the ISWC 2006
conference. So far, we find queries compile OK for many SIOC use
cases with the cost model that there is now. A more systematic
review of the cost model for SPARQL will come when we get to the
queries.
We put some ideas about inferencing in the Advances in Triple
Storage paper. The question is whether we should forward chain such
things as class subsumption and subproperties. If we build these
into the SQL engine used for running SPARQL, we probably can do
these as unions at run time with good performance and better
working set due to not storing trivial entailed triples. Some more
thought and experimentation needs to go into this.