I was recently in Boston for the Mapping Relational Data to RDF workshop of the W3C.

The common feeling was that mapping everything to RDF and querying it in terms of a generic domain ontology, mapped on demand into whatever line of business systems, would be very good if it only could be done. However, since this is not so easily done, the next best is to extract the data and then warehouse it as RDF.

The obstacles perceived were of the following types:

  • Lack of quality in the data. The different line of business systems do not in and of themselves hold enough semantics. If the meaning of data columns in relational tables were really known and explicit, these could be meaningfully used for joining across systems. But this is more complex than just mapping the metal lead to the chemical symbol Pb and back.

  • Lack of performance in RDF storage. Data sets even in the tens-of-millions of triples do not run very well in some stores. Well, we had the Banff life sciences demo with 450M triples in a small server box running Virtuoso, so this is not universal, plus of course we are coming up with a whole different order of magnitude, as often discussed on this blog.

  • Lack of functionality in mapping and possibly lack of pushing through enough of the query processing to the underlying data stores.

Personally, I am quite aware of what to do with regard to performance of mapping and storage, and see these as eminently solvable issues. After all, we have a great investment of talent in databases in general and it can be well deployed towards RDF, as we have been doing these past couple of years. So we talk about the promise of a 360-degree view of information, with RDF being the top layer. Everybody agrees that this is a nice concept. But this is a nice concept especially when it can do the things that are the most common baseline expectation of any regular DBMS, i.e., aggregation, grouping, sub-queries, VIEWs. Now, I would not go sell a DBMS that has no COUNT operator to a data warehousing shop.

The fact that OpenLink and Oracle allow RDF inside SQL, and OpenLink even adds native aggregates and grouping to SPARQL, fixes the problem with regard to specific products, but leaves the standardization issue open. Of course, any vendor will solve these questions one way or another because a database with no aggregation is a non-starter.

I talked to Lee Feigenbaum, chair of the W3C DAWG, about the question of aggregates and general BI capabilities in SPARQL. He told me that, prior to his time with the DAWG, these were left out because they conflicted with the open-world assumption around RDF: You cannot count a set because by definition you do not know that you have all the members, the world being open and all that.

Say what? Talk about the road to hell being paved with good intentions. Now, this is in no way Lee's or the present day DAWG's fault; as a member myself, I can attest to the good work and would under no circumstances wish any delays or revisions to SPARQL at this point. I am just pointing out a matter that all implementations should address, as a sort of precondition of entry into the real world IS space. If this can be done interoperably, so much the better.

Now, out of the deliberations at the Boston workshop arose at least two ideas for follow-up activity.

The first was an incubator group for RDF store and mapping benchmarking. This is very appropriate in order to dispel the bad name RDF storage and querying performance has been saddled with. As a first step in this direction, I will outline a social web oriented benchmark on this blog.

The second activity was an incubator group for preparing standardization of mapping methodologies from relational schemas to RDF. We will be active on this as well.

The two offshoots appear logically separate but are not necessarily so in practice. A benchmark is after all something that is supposed to promote a technology to a user base. The user base seems to wish to put all online systems and data warehouses under a common top level RDF model and then query away, introducing no further replication of data or performance cost or ETL latencies.

Updating would also be nice but even query only would be very good. Personally, I'd say the RDF strength is all on the query side. Transactions are taken care of well enough by what there already is, RDF stands out in integration and the ad-hoc and discovery side of the matter. Given this, we expect the value to be consumed in a heterogeneous, multi-database, federated environment. Thus a benchmark should measure this aspect of the use-case. With the right mapping and queries, we could probably demonstrate the added cost of RDF to be very low, as long as we could push all queries that can be answered by a single source to the responsible DBMS. For distributed joins, we are back at the question of optimizing distributed queries but this is a familiar one and RDF is not the principal cost factor.

The subject does become quite complex at this point. We would have to take supposedly representative synthetic OLTP and BI data sets (like the ones in TPC-D, TPC-E, and TPC-H), and invent queries across them that would both make sense and be implementable in SPARQL extended with aggregates and sub-queries. Reliance on SPARQL extensions is simply unavoidable. Setting up the test systems would be non-trivial, even though there is a lot of industry experience in these matters on the database side.

So, while this is probably the benchmark most relevant to the target audience, we may have to start with a simpler one. I will next outline something to the effect.