Many of you will know about the W3C relational-to-RDF mapping incubator activity. The group is planning to suggest forming a working group for drawing up a specification for relational-to-RDF mapping.

To this effect, I recently summarized the group discussions and some of our own experiences around the topic at <http://esw.w3.org/topic/Rdb2RdfXG/ReqForMappingByOErling>.

I will here discuss this less formally and more in the light of our own experience. A working group goal statement must be neutral vis à vis the following points, even if any working group will unavoidably encounter these issues on the way. A blog post on the other hand can be more specific.

I gave a talk to the RDB2RDF XG this spring, with these slides.

The main point is that people would really like to map on-the-fly, if they only could. Making an RDF warehouse is not of value in itself, but it is true that in some cases this cannot be avoided.

At first sight, one would think that a mapping specification could be neutral as regards whether one stores the mapped triples as triples or makes them on demand. There is almost no comparison between the complexity of doing non-trivial mappings on-the-fly versus mapping as ETL. Some of this complexity spills over into the requirements for a mapping language.

Eliminating JOINs

We expect to have a situation where one virtual triple can have many possible sources. The mapping is a union of mapped databases. Any integration scenario will have this feature. In such a situation, if we are JOINing using such triples, we end up with UNIONs of all databases that could produce the triples in question. This is generally not desired. Therefore, in the on-demand mapping case, there must be a lot of type inference logic that is not relevant in the ETL scenario.

To make the point clearer, suppose a query like "list the organizations whose representatives have published about xx." Suppose that there are three databases mapped, all of which have a table of organizations, a table of persons with affiliation to organizations, a table of publications by these persons, and finally a table of tags for the publications. Now, we want the laboratories that have published with articles with tag XX. It is a matter of common sense in this scenario that a publication will have the author and the author's affiliation in the same database. However, the RDB-to-RDF mapping does not necessarily know this, if all that it is told is that a table makes IRIs of publications by applying a certain pattern to the primary key of the publications table. To infer what needs to be inferred, the system must realize that IRIs from one mapping are disjoint from IRIs from another: A paper in database X will usually not have an author in database Y. The IDs in database Y, even if perchance equal to the IDs in X, do not mean the same thing, and there is no point joining across databases by them.

This entire question is a non-issue in the ETL scenario, but is absolutely vital in the real-time mapping. This is also something that must be stated, at least implicitly, in any mapping. If a mapping translates keys of one place to IRIs with one pattern, and keys from another using another pattern, it must be inferable from the patterns whether the sets of IRIs will be disjoint.

This is critical. Otherwise we will be joining everything to everything else, and there will be orders of magnitude of penalty compared to hand-crafted SQL over the same data sources.

Expectations and Limitations on Queries

SPARQL queries translate quite well to SQL when there is only one table that can produce a triple with a subject of a given class, when there are few columns that can map to a given predicate, and when classes and predicates are literals in the query.

Virtuoso has some SQL extensions for dealing with breaking a wide table into a row per column. This facilitates dealing with predicates that are not known at query compile time. If the table in question is not managed by Virtuoso, Virtuoso's SQL virtualization/federation takes care of the matter. If a mapping system goes directly to third-party SQL, no such tricks can be used.

The above example suggests that for supporting on-the-fly mapping without relying on owning the SQL underneath, some subsets of SPARQL may have to be defined. For example, one will probably have to require that all predicates be literals. The alternative is prohibitive run-time cost and complexity.

But we must not lose the baby with the bath-water. Aside from offering global identifiers, RDF's attractions include subclasses and sub-predicates. In relational terms, these translate to UNIONs and do involve some added cost. A mapping system just has to have means of dealing with this cost, and of recognizing cases where this cost is prohibitive. Some further work is likely to be required for defining well-behaved subsets of SPARQL and mappings.

ETL Ou Ne Pas ETL?

Whether to warehouse or not? If one has hundreds of sources, of which some are not even relational, some ETL would seem necessary. Kashiup Vipul gave a position paper at last year's RDB-to-RDF mapping workshop in Cambridge, Massachusetts, about a system of relational mapping and on-demand RDF-izers of diverse semi-structured biomedical data, e.g., spreadsheets. The issue certainly exists, and any mapping work will likely encounter integration scenarios where one part is fairly neatly mapped from relational stores, and another part comes from a less structured repository of ETLed physical triples.

Our take is that if something is a large or very large relational store, then map; else, ETL. With Virtuoso, we can mix mapped and local triples, but this is not a generally available feature of triple stores and standardization will likely have to wait until there are more implementations.

Conclusions

  • If you map on demand, watch out for an explosion of UNIONs when integrating sources that talk of similar things.
  • If you integrate lots of sources, some ETL is likely unavoidable. Look for ways of dealing with part ETL, part mapping. ETLing everything is not always best or even possible.
  • If you map a single fairly-clean RDB to RDF, mapping will work well, potentially much faster than triple storage. Higher storage density and more data per index lookup on the relational side.
  • If you map on demand, some restrictions to SPARQL may be practically necessary. These have to do with variables in predicate position, variables in class position, etc. Individual implementations may support these, but standardization will likely have to put limits on them.

This was a quick summary, by no means comprehensive, on what an eventual RDB2RDF working group would come across. This is a sort of addendum to the requirements I outlined on the ESW wiki.