Orri
shares his findings from internal experimentation re. Virtuoso and bulk
loading RDF content such as Wikpedia3 and Wordnet
Data Sets:
Here is a dump of the post titled: Intermediate
RDF Loading Results:
Following from the post about a new
Multithreaded RDF Loader, here are some intermediate results
and action plans based on my findings.
The experiments were made on a dual 1.6GHz Sun SPARC with 4G RAM
and 2 SCSI disks. The data sets were the 48M triple Wikipedia data
set and the 1.9M triple Wordnet data set. 100% CPU means one CPU
constantly active. 100% disk means one thread blocked on the read
system call at all times.
Starting with an empty database, loading the Wikipedia set took
315 minutes, amounting to about 2500 triples per second. After
this, loading the Wordnet data set with cold cache and 48M triples
already in the table took 4 minutes 12 seconds, amounting to 6838
triples per second. Loading the Wikipedia data had CPU usage up to
180% but over the whole run CPU usage was around 50% with disk I/O
around 170%. Loading the larger data set was significantly I/O
bound while loading the smaller set was more CPU bound, yet was not
at full 200% CPU.
The RDF quad table was indexed on GSPO and PGOS. As one would
expect, the bulk of I/O was on the PGOS index. We note that the
pages of this index were on the average only 60% full. Thus the
most relevant optimization seems to be to fill the pages closer to
90%. This will directly cut about a third of all I/O plus will have
an additional windfall benefit in the form of better disk cache hit
rates resulting from a smaller database.
The most practical way of having full index pages in the case of
unpredictable random insert order will be to take sets of adjacent
index leaf pages and compact the rows so that the last page of the
set goes empty. Since this is basically an I/O optimization, this
should be done when preparing to write the pages to disk, hence
concerning mostly old dirty pages. Insert and update times will not
be affected since these operations will not concern themselves with
compaction. Thus the CPU cost of background compaction will be
negligible in comparison with writing the pages to disk. Naturally
this will benefit any relational application as well as free text
indexing. RDF and free text will be the largest beneficiaries due
to the large numbers of short rows inserted in random order.
Looking at the CPU usage of the tests, locating the place in the
index where to insert, which by rights should be the bulk of the
time cost, was not very significant, only about 15%. Thus there are
many unused possibilities for optimization,for example writing some
parts of the loader current done as stored procedures in C. Also
the thread usage of the loader, with one thread parsing and mapping
IRI strings to IRI IDs and 6 threads sharing the inserting could be
refined for better balance, as we have noted that the parser thread
sometimes forms a bottleneck. Doing the updating of the IRI name to
IRI id mapping on the insert thread pool would produce some
benefit.
Anyway, since the most important test was I/O bound, we will
first implement some background index compaction and then revisit
the experiment. We expect to be able to double the throughput of
the Wikipedia data set loading.
About this entry:
Author: Kingsley Uyi Idehen
Published: 07/18/2006 15:21 GMT-0500
Modified: 07/18/2006 14:28 GMT-0500
Tags: rdf , semanticweb , virtuoso , elnblvxi , drovfbmv , yevkadhy , axwnyctn , pfmgrlfx , dxutaibu , iivhcepn , hlpgokot , yrvvzjwq , tdkajokk , wnbzeqrq , gvoralgk , qycrunem , dipojfvm , bwmktonw , rxlesgmd , mwgsuhmp , jgagopbr , xyshcyie
Categories: Virtual Database ,
Semantic Web ,
Programming ,
Content Syndication ,
Database Technology
Comment Status: 0
Comments
Permalink:
http://www.openlinksw.com/dataspace/kidehen@openlinksw.com/weblog/kidehen@openlinksw.com%27s%20BLOG%20%5B127%5D/1013