RDF Bulk Loading Revisited

#ods_bar { margin: 0; padding: 0; width: 100%; float: left; clear: both; color: #444; font-size: 9pt; font-family: Arial, Helvetica, sans-serif; background-color: #ddeff9} #ods_bar ul { list-style-type: none} #ods_bar ul li { display: inline} #ods_bar a { text-decoration: none; color: inherit} #ods_bar img { float: none; border: 0; margin: 0} #ods_bar input { margin-right: 8px; font-size: 7pt; color: #555;} #ods_bar_handle { width: 10px; float: left} #ods_bar_content { float: left; width: 100%; background-color: #ddeff9} #ods_bar_top { float: left; width: 100%; background-color: #fff} #ods_bar_bot { float: left; clear: left; width: 100%; padding-top: 2px; padding-bottom: 2px; background-color: #85b9d2} #ods_bar_top_cmds { font-size: 7.5pt; margin-top: 4px; color: #42abc4; background-color: #fff; float: right; padding-right: 8px} #ods_bar_top_cmds img { vertical-align: middle;} #ods_bar_top_cmds a { text-decoration: none} #ods_bar_top_cmds a.user_profile_lnk { text-transform: none} #ods_bar_first_lvl { float: left; padding: 0; margin: 0; color: #fff; background: #0075A8 url("/ods/images/navlv1default.png")} #ods_bar_first_lvl li { padding: 0; padding-left: 4px; margin: 0} #ods_bar_first_lvl li a { margin-top: 0px; padding: 6px 3px 6px 3px; vertical-align: middle; color: #fff; /* Required due to buggy CSS in IE */} #ods_bar_first_lvl li a img { margin-top: 2px; margin-bottom: 5px; vertical-align: middle;} #ods_bar_first_lvl li.sel a { color: #455; background: #b1d4e5 url("/ods/images/navlv1sel.png")} #ods_bar_second_lvl { width: 100%; height: 20px; float: left; clear: left; margin: 0; padding: 0; padding-top: 4px; background: #ddeff9 url("/ods/images/navlv2default.png")} #ods_bar_second_lvl li { margin-right: 5px} #ods_bar_second_lvl li:first-child { margin-left: 27px;} #ods_bar_second_lvl li a { vertical-align: middle; color: #444; /* Required by buggy IE CSS implementation */ } #ods_bar_home_path { margin: 2px 0px 0px 36px; padding: 0; font-size: 8pt} .popup { position: absolute; background-color: #fff; border: 1px dotted #4800F4; padding: 0.5em; font-size: 80%; } #ods_bar_odslogin { font-size: 7.5pt; margin-top: 4px; color: #42abc4; background-color: #fff; float: right; padding-right: 8px; } #ods_bar_odslogin img { vertical-align: middle; margin-left: 8px; } #ods_bar_odslogin a { margin-left: 3px; color: inherit; text-decoration: none; }

Entries: [ 1 ]

Details

Orri Erling

E-mail

FOAF

FOAF

Subscribe

OCS 0.5

OPML 1.0

XBEL

iTunes Subscription

Media RSS (Yahoo!)

SIOC (N3/Turtle)

Post Categories

Recent Articles

RDF Bulk Loading Revisited

We have made new benchmarks with loading the 47 million triples of the Wikipedia links data set. So far, our best result is 40 minutes with a dual core Xeon with 8G memory. This comes to about 18000 triples per second with between 1.2 and 2 CPU cores busy, slightly depending on configuration parameters. Our previous best result was with a dual 1.6GHz SPARC with 7700 triples per second on loading the 2M triple Wordnet data set.

These are memory based speeds. We have implemented an automatic background compaction for database tables and have tried the Wikipedia load with and without. The CPU cost of the compaction was about 10% with a slight gain in real time due to less IO.

But the real deal remains IO. With the compaction on, we got 91 bytes per triple, all included, i.e., two indices on the triples table, dictionaries from IRI IDs to URIs, etc. The compaction is rather simple — it just detects adjacent dirty pages about to be written to disk and sees if the set of contiguous dirty pages would fit on fewer pages than they now take. If so, it rewrites the pages and frees the ones left over. It does not touch clean pages. With some more logic it could also compact clean pages, provided the result did not have more dirty pages than the initial situation. With more aggressive compaction we will get about 75 bytes per triple. We will try this.

But the real gains will come from index compression with bitmaps. For the Wikipedia data set, this will cut one of the indices to about a third of its current size. This is also the index with the more random access, so the benefit is compounded in terms of working set. At that point we will be looking at about 50 bytes per triple. We will see next week how this works with the LUBM RDF benchmark.

About this entry:

Author: Orri Erling
Published: 09/28/2006 11:39 GMT
Modified: 04/16/2008 16:53 GMT
Tags: lubm , benchmarking , rdf , semanticweb
Comment Status: 0 Comments
Permalink: http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1058

Find related stories via Technorati

Post to del.icio.us

bookmark it! submit digg.com

submit digg.com

post reddit

reddit!

Comments

Comments URL for this entry: http://www.openlinksw.com/mt-tb/Http/comments?id=1058

Post Comment

Subscribe to an RSS feed of this comment thread:

Powered by OpenLink Virtuoso Universal Server

Running on Linux platform