(Posted verbatim from Orri Erling's
Weblog):
RDF
Bulk Loading Revisited: "
We have made new benchmarks with loading the 47 million triples
of the Wikipedia links data set. So far, our best result is 40
minutes with a dual core Xeon with 8G memory. This comes to about
18000 triples per second with between 1.2 and 2 CPU cores busy,
slightly depending on configuration parameters. Our previous best
result was with a dual 1.6GHz SPARC with 7700 triples per second on
loading the 2M triple Wordnet data set.
These are memory based speeds. We have implemented an automatic
background compaction for database tables and have tried the
Wikipedia load with and without. The CPU cost of the compaction was
about 10% with a slight gain in real time due to less IO.
But the real deal remains IO. With the compaction on, we got 91
bytes per triple, all included, i.e. two indices on the triples
table, dictionaries from IRI IDs to URIs etc. The compaction is
rather simple, it just detects adjacent dirty pages about to be
written to disk and sees if the set of contiguous dirty pages would
fit on fewer pages than they now take. If so, it rewrites the pages
and frees the ones left over. It does not touch clean pages. With
some more logic it could also compact clean pages, provided the
result did not have more dirty pages than the initial situation.
With more aggressive compaction we will get about 75 bytes per
triple. We will try this.
But the real gains will come from index compression with
bitmaps. For the Wikipedia data set, this will cut one of the
indices to about a third of its current size. This is also the
index with the more random access, so the benefit is compounded in
terms of working set. At that point we will be looking at about 50
bytes per triple. We will see next week how this works with the
LUBM RDF benchmark.