We showed our billion triples demo at the ISWC 2008 poster session. Generally people liked what they saw, as we basically did what one always had wanted to do with SPARQL but never could. This means firstly full SQL parity, with sub-queries, aggregation, full text, etc. Beyond SQL, we have transitive sub-queries, owl:sameAs at run time, and other inference things, all on demand.

The live demo is at http://b3s.openlinksw.com/. This site is under development and may not be on all the time. We are taking it in the direction of hosting the whole LOD cloud. This is an evolving operation where we will continue showcasing how one can ask increasingly interesting questions from a growing online database, in the spirit of the billion triples charter.

In the words of Jim Hendler, we were not selected for the finale because this would have made the challenge a database shootout instead of a more research-oriented event. There is some point to this since if the event becomes like the TPC benchmarks, this will limit the entrance to full time database players. Anyway, we got a special mention in the intro of the challenge track.

The winner was Semaplorer, a federated SPARQL query system. There is some merit to this, as we ourselves are not convinced that centralization is always the right direction. As discussed in the DARQ Matter of Federation post, we have a notion of how to do this production-strength with our cluster engine, now also over wide area networks. We shall see.

Why Not Just Join?

The entries from Deri and LARKC (MaRVIN, "Massive RDF Versatile Inference Network") were doing materialization of inference results in a cluster environment. The thing they were not doing was joining across partitions. Thus, the data was partitioned on whatever criterion and then the data in each partition was further refined according to rules known to all partitions. Deri did not address joining further.

"Nature shall be the guide of the alchemist," goes the old adage. We can look at MaRVIN as an example of this dictum. Networks of people are low bandwidth, not nearly fully connected. Asking a colleague for information is expensive and subject to misunderstanding; asking another research group might never produce an answer.

Even looking at one individual, we have no reason to think that the human expert would do complete reasoning. Indeed, the brain is a sort of compute cluster, but it does not have flat latency point to point connectivity — some joins are fast; others are not even tried, for all we know.

A database running on a cluster is a sort of counter-example. A database with RDF workload will end up joining across partitions pretty much all of the time.

MaRVIN's approach to joining could be likened to a country dance: Boys get to take a whirl with different girls according to a complex pattern. For match-making, some matches are produced early but one never knows if the love of a lifetime might be just around the corner. Also, if the dancers are inexperienced, they will have little ability to evaluate how good a match they have with their partner. A few times around the dance floor are needed to get the hang of things.

The question is, at what point will it no longer be possible to join across the database? This depends on the interconnect latency. The higher the latency, the more useful the square-dancing approach becomes.

Another practical consideration is the fact that RDF reasoners are not usually built for distributed memory multiprocessors. If the reasoner must be a plug-in component, then it cannot be expected to be written for grids.

We can think of a product safety use case: Find cosmetics that have ingredients that are considered toxic in the amounts they are present in each product. This can be done as a database query with some transitive operations, like running through a cosmetics taxonomy and a poisons database. If the business logic deciding whether the presence of an ingredient in the product is a health hazard is very complex, we can get a lot of joins.

The MaRVIN way would be to set up a ball where each lipstick and eyeliner dances with every poison and then see if matches are made. The matching logic could be arbitrarily complex since it would run locally. Of course here, some domain knowledge is needed in order to set up the processing so that each product and poison carry all the associated information with them. Dancing with half a partner can bias one's perceptions: Again, it is like nature, sometimes not all cards are on the table.

It would seem that there is some setup involved before answering a question: Composition of partitions, frequency of result exchange, etc. How critical the domain knowledge implicit in the setup is for the quality of results is an interesting question.

The question is, at what point will a cluster using distributed database operations for inference become impractical? Of course, it is impractical from the get-go if the reasoners and query processors are not made for this. But what if they are? We are presently evaluating different message patterns for joining between partitions. The baseline is some 250,000 random single-triple lookups per second per core. Using a cluster increases this throughput. The increase is more or less linear depending on whether all intermediate results pass via one coordinating node (worst case) or whether each node can decide which other node will do the next join step for each result (best case). For example, a DISTINCT operation requires that data passes through a single place but JOINing and aggregation in general do not.

We will still publish numbers during this November.