We have now arrived at the RDF specific parts of the Virtuoso cluster effort.
This has to do with special tricks for dealing with the mapping of IRIs and object values and their internal IDs. A quad will refer to the graph, subject, predicate, and object by an internal ID. The object, if short enough, can be inlined so that no join is needed to get its external form in most cases.
Now this is old stuff.
Clustering does not change this, except that now the tables for the mappings between IDs and their external form are partitioned. So more often than not, getting the string for the ID involves an RPC. The most recently used mappings are of course cached outside of the table but still, having a network round trip each time an IRI is returned for the first time in a while is no good.
The solution is the same as always, namely doing one round trip per partition, with all the IDs concerned in a single message. The same applies to reading documents, where strings are translated to IDs and new IDs are made, only now we have a distributed read-write transaction.
In terms of programming model we have a general purpose partitioned pipe. One might think of this in terms of a single control structure combining map and reduce.
Now consider returning results for a text search:
SELECT doc_id, summary (doc_id, 'search pattern') FROM (SELECT TOP 10 doc_id FROM docs WHERE CONTAINS (text, 'search pattern', score) ORDER BY relevance (doc_id, score, 'search pattern') DESC) f;
The summary and relevance are each a function of the doc_id and search pattern. The relevance function would typically access some precomputed site rank and combine this with a hit score that is a function of the word frequencies in the individual hit.
Now we are not specifically in the business of text search but this well known example will serve to make a more general point.
How does a query like this work when everything is partitioned?
The text index (inverted file from word to positions in each document) can be split by word or by document ID or both. In either case, it will produce the hits fairly quickly, partitioning or not. The question is sorting the million hits according to criteria that involve joining with document or site metadata. To produce fast response, this loop must run in parallel but then each score is independent, so no problem. Finally, summaries have to be made only for the items actually returned and again each summary is independent.
A text search engine does not depend on a general purpose query processor for scheduling things in this way.
The case is different with RDF where we must do basically the same things with ad-hoc queries. As it happens, the above sample query will run as described if only the summary and relevance functions are properly declared.
So we have a special declaration for a partitioned function call. Further, the partitioned function call (in this case summary and relevance) will dispatch according to a given index, thus going to run on a node hosting the actual data. This is like the map part of map-reduce. But this is not all. The functions can return either a final value or a next step. This can be regarded as a second map or in some cases a reduce step. The next step is another partitioned function that gets the output of the previous one as its input and may use the same or different partitioning key.
Now the functions can be large or small, including very small, like a single index lookup, where the RPC delay is an order of magnitude greater than the time to perform the function. The partitioned pipe manages batching these together and overlapping processing on all nodes. Plus the work can be transactional, all bound in a single distributed transaction or it can be each task for itself with error retries etc, as in the usual map-reduce situation where relatively large tasks are sent around.
At present, we have all this implemented and we are running tests with large RDF data sets on clustered Virtuoso.