This is a revised version of the talk I will be giving at the Semdata workshop at VLDB 2010.

The paper shows how we store TPC-H data as RDF with relational-level efficiency and how we query both RDF and relational versions in comparable time. We also compare row-wise and column-wise storage formats as implemented in Virtuoso.

A question that has come up a few times during the Semdata initiative is how semantic data will avoid the fate of other would-be database revolutions like OODBMS and deductive databases.

The need and opportunity are driven by the explosion of data in quantity and diversity of structure. The competition consists of analytics RDBMS, point solutions done with map-reduce or the like, and lastly in some cases from key-value stores with relaxed schema but limited querying.

The benefits of RDF are the ever expanding volume of data published in it, reuse of vocabulary, and well-defined semantics. The downside is efficiency. This is not so much a matter of absolute scalability — you can run an RDF database on a cluster — but a question of relative cost as opposed to alternatives.

The baseline is that for relational-style queries, one should get relational performance or close enough. We outline in the paper how RDF reduces to a run-time-typed relational column-store, and gets all the compression and locality advantages traditionally associated with such. After memory is no longer the differentiator, the rest is engineering. So much for the scalability barrier to adoption.

I do not need to talk here about the benefits of linked data and more or less ad hoc integration per se. But again, to make these practical, there are logistics to resolve: How to keep data up to date? How to distribute it incrementally? How to monetize freshness? We propose some solutions for these, looking at diverse-RDF replication and RDB-to-RDF replication in Virtuoso.

But to realize the ultimate promise of RDF/Linked Data/Semdata, however we call it, we must look farther into the landscape of what is being done with big data. Here we are no longer so much running against the RDBMS, but against map-reduce and key-value stores.

Given the psychology of geekdom, the charm of map-reduce is understandable: One controls what is going on, can work in the usual languages, can run on big iron without being picked to pieces by the endless concurrency and timing and order-of-events issues one gets when programming a cluster. Tough for the best, and unworkable for the rest.

The key-value store has some of the same appeal, as it is the DBMS laid bare, so to say, made understandable, without the again intractably-complex questions of fancy query planning and distributed ACID transactions. The psychological rewards of the sense of control are there, never mind the complex query; one can always hard code a point solution for the business question, if really must — maybe even in map-reduce.

Besides, for some things that go beyond SQL (for example, with graph structures), there really isn't a good solution.

Now, enter Vertica, Greenplum, VectorWise (a MonetDB project derivative from Ingres) and Virtuoso, maybe others, who all propose some combination of SQL- and explicit map-reduce-style control structures. This is nice but better is possible.

Here we find the next frontier of Semdata. Take Joe Hellerstein et al's work on declarative logic for the data centric data center.

We have heard it many times — when the data is big, the logic must go to it. We can take declarative, location-conscious rules, à la BOOM and BLOOM, and combine these with the declarative query, well-defined semantics, parallel-database capability of the leading RDF stores. Merge this with locality compression and throughput from the best analytics DBMS.

Here we have a data infrastructure that subsumes map-reduce as a special case of arbitrary distributed-parallel control flow, can send the processing to the data, and has flexible queries and schema-last capability.

Further, since RDF more or less reduces to relational columns, the techniques of caching and reuse and materialized joins and demand-driven indexing, à la MonetDB, are applicable with minimal if any adaptation.

Such a hybrid database-fusion frontier is relevant because it addresses heterogenous, large-scale data, with operations that are not easy to reduce to SQL, still without loss of the advantages of SQL. Apply this to anything from enhancing the business intelligence process by faster integration, including integration with linked open data to the map-reduce bulk processing of today. Do it with strong semantics and inference close to the data.

In short, RDF stays relevant by tackling real issues, with scale second to none, and decisive advantages in time-to-integrate and expressive power.

Last week I was at the LOD2 kick off and a LarKC meeting. The capabilities envisioned in this and the following post mirror our commitments to the EU co-funded LOD2 project. This week is VLDB and the Semdata workshop. I will talk more about how these trends are taking shape within the Virtuoso product development roadmap in future posts.