We have made new benchmarks with loading the 47 million triples of the Wikipedia links data set. So far, our best result is 40 minutes with a dual core Xeon with 8G memory. This comes to about 18000 triples per second with between 1.2 and 2 CPU cores busy, slightly depending on configuration parameters. Our previous best result was with a dual 1.6GHz SPARC with 7700 triples per second on loading the 2M triple Wordnet data set.

These are memory based speeds. We have implemented an automatic background compaction for database tables and have tried the Wikipedia load with and without. The CPU cost of the compaction was about 10% with a slight gain in real time due to less IO.

But the real deal remains IO. With the compaction on, we got 91 bytes per triple, all included, i.e., two indices on the triples table, dictionaries from IRI IDs to URIs, etc. The compaction is rather simple — it just detects adjacent dirty pages about to be written to disk and sees if the set of contiguous dirty pages would fit on fewer pages than they now take. If so, it rewrites the pages and frees the ones left over. It does not touch clean pages. With some more logic it could also compact clean pages, provided the result did not have more dirty pages than the initial situation. With more aggressive compaction we will get about 75 bytes per triple. We will try this.

But the real gains will come from index compression with bitmaps. For the Wikipedia data set, this will cut one of the indices to about a third of its current size. This is also the index with the more random access, so the benefit is compounded in terms of working set. At that point we will be looking at about 50 bytes per triple. We will see next week how this works with the LUBM RDF benchmark.