We have lately been busy with RDF scalability. We work with the 8000 university LUBM data set, a little over a billion triples. We can load it in 23h 46m on a box with 8G RAM. With 16G we probably could get it in 16h.

The resulting database is 75G, 74 bytes per triple which is not bad. It will shrink a little more if explicitly compacted by merging adjacent partly filled pages. See Advances in Virtuoso RDF Triple Storage for an in-depth treatment of the subject.

The real question of RDF scalability is finding a way of having more than one CPU on the same index tree without them hitting the prohibitive penalty of waiting for a mutex. The sure solution is partitioning, would probably have to be by range of the whole key. but before we go to so much trouble, we'll look at dropping a couple of critical sections from index random access. Also some kernel parameters may be adjustable, like a spin count before calling the scheduler when trying to get an occupied mutex. Still we should not waste too much time on platform specifics. We'll see.

We just updated the Virtuoso Open Source cut. The latest RDF refinements are not in, so maybe the cut will have to be refreshed shortly.

We are also now applying the relational to RDF mapping discussed in Declarative SQL Schema to RDF Ontology Mapping to the ODS applications.

There is a form of the mapping in the VOS cut on the net but it is not quite ready yet. We must first finish testing it through mapping all the relational schemas of the ODS apps before we can really recommend it. This is another reason for a VOS update in the near future.

We will be looking at the query side of LUBM after the ISWC 2006 conference. So far, we find queries compile OK for many SIOC use cases with the cost model that there is now. A more systematic review of the cost model for SPARQL will come when we get to the queries.

We put some ideas about inferencing in the Advances in Triple Storage paper. The question is whether we should forward chain such things as class subsumption and subproperties. If we build these into the SQL engine used for running SPARQL, we probably can do these as unions at run time with good performance and better working set due to not storing trivial entailed triples. Some more thought and experimentation needs to go into this.