The W3C has recently launched an incubator group about mapping relational data to RDF.

From participating in the group for the few initial sessions, I get the following impressions.

There is a segment of users, for example from the biomedical community, who do heavy duty data integration and look to RDF for managing complexity. Unifying heterogeneous data under OWL ontologies, reasoning, and data integrity, are points of interest.

There is another segment that is concerned with semantifying the document web, which topic includes initiatives such as Triplify and semantic web search such as Sindice. The emphasis there is on minimizing entry cost and creating critical mass. The next one to come will clean up the semantics, if these need be cleaned up at all.

(Some cleanup is taking place with Yago and Zitgist, but this is a matter for a different post.)

Thus, technically speaking, the mapping landscape is diverse, but ETL (extract-transform-load) seems to predominate. The biomedical people make data warehouses for answering specific questions. The web people are interested in putting data out in the expectation that the next player will warehouse it and allow running complex meshups against the whole of the RDF-ized web.

As one would expect, these groups see different issues and needs. Roughly speaking, one is about quality and structure and the other is about volume.

Where do we stand?

We are with the research data warehousers in saying that the mapping question is very complex and that it would indeed be nice to bypass ETL and go to the source RDBMS(s) on demand. Projects in this direction are ongoing.

We are with the web people in building large RDF stores with scalable query answering for arbitrary RDF, for example, hosting a lot of the Linking Open Data sets, and working with Zitgist.

These things are somewhat different.

At present, both the research warehousers and the web scalers predominantly go for ETL.

This is fine by us as we definitely are in the large RDF store race.

Still, mapping has its point. A relational store will perform quite a bit faster than a quad store if it has the right covering indices or application-specific compressed columnar layout. Thus, there is nothing to block us from querying analytics in SPARQL, once the obviously necessary extensions of sub-query, expressions and aggregation are in place.

To cite an example, the Ordnance Survey of the UK has a GIS system running on Oracle with an entry pretty much for each mailbox, lamp post, and hedgerow in the country. According to Ordnance Survey, this would be 1 petatriple, 1e15 triples. "Such a big server farm that we'd have to put it on our map," as Jenny Harding put it at ESWC2008. I'd add that an even bigger map entry would be the power plant needed to run the 100,000 or so PCs this would take. This is counting 10 gigatriples per PC, which would not even give very good working sets.

So, on-the-fly RDBMS-to-RDF mapping in some cases is simply necessary. Still, the benefits of RDF for integration can be preserved if the translation middleware is smart enough. Specifically, this entails knowing what tables can be joined with what other tables and pushing maximum processing to the RDBMS(s) involved in the query.

You can download the slide set I used for the Virtuoso presentation for the RDB to RDF mapping incubator group (PPT; other formats coming soon). The main point is that real integration is hard and needs smart query splitting and optimization, as well as real understanding of the databases and subject matter from the information architect. Sometimes in the web space it can suffice to put data out there with trivial RDF translation and hope that a search engine or such will figure out how to join this with something else. For the enterprise, things are not so. Benefits are clear if one can navigate between disjoint silos but making this accurate enough for deriving business conclusions, as well as efficient enough for production, is a soluble and non-trivial question.

We will show the basics of this with the TPC-H mapping, and by joining this with physical triples. We will also make a set of TPC-H format table sets, make mappings between keys in one to keys in the other, and show joins between the two. The SPARQL querying of one such data store is a done deal, including the SPARQL extensions for this. There is even a demo paper, Business Intelligence Extensions for SPARQL (PDF; other formats coming soon), by us on the subject in the ESWC 2008 proceedings. If there is an issue left, it is just the technicality of always producing SQL that looks hand-crafted and hence is better understood by the target RDBMS(s). For example, Oracle works better if one uses an IN sub-query instead of the equivalent existence test.

Follow this blog for more on the topic; published papers are always a limited view on the matter.