We have been playing with the Wikipedia3 RDF data set, 48 million triples or so. We have for a long time foreseen the need for a special bulk loader for RDF but this brought this into immediate relevance.
So I wrote a generic parallel extension to Virtuoso/PL and SQL. This consists of a function for creating a queue that will feed async requests to be served on a thread pool of configurable size. Each of the worker threads has its own transaction and the owner of the thread pool can look at or block for return states of individual request . This is a generic means for delegating work to async threads from Virtuoso/PL. Of course this can also be used at a lower level for parallelizing single SQL queries, for example aggregation of a large table or creating an index on a large table. Many applications, such as the ODS Feed Manager will also benefit, since this makes it more convenient to schedule parallel downloads from news sources and the like. This extension will make its way into the release after next.
But back to RDF. We presently have the primary key of the triple store as GSPO and a second index as PGOS. Using this mechanism, we will experiment with different multithreaded loading configurations. One thread translates from the IRI text representation to the IRI IDs, one thread may insert into the GSPO index, which is typically local and a few threads will share the inserting into the PGOS key. The latter key is inserted in random order, whereas the former is inserted mainly in ascending order when loading new data. In this way, we should be able to keep full load on several CPUs and even more disks.
It turns out that the new async queue plus thread pool construct is very handy for any pipeline or symmetric parallelization. When this is well tested, I will update the documents and maybe do a technical article about this.
Transactionality is not an issue in the bulk load situation. The graph being loaded will anyway be incomplete until it is loaded, other graphs will not be affected and no significant amount of locks will be held at any time by the bulk loader threads.
Also later, when looking at within-query and other parallelization, we have many interesting possibilities. For example, we may measure the CPU and IO load and adjust the size of the shareable thread pool accordingly. All SQL or web requests get their thread just as they now do, and extra threads may be made available for opportunistic parallelization up until we have full CPU and IO utilization. Still, this will not lead to long queries preempting short ones, since all get at least one thread. I may post some results of parallel RDF loading later on this blog.
About this entry:
Author: Orri Erling
Published: 07/13/2006 10:35 GMT
04/16/2008 16:13 GMT
Comment Status: 0 Comments