At this close of the year, I'll give a little recap of the past year in terms of Virtuoso development, and take a look at where we are headed for 2008.

A year ago, I was in the middle of redoing the Virtuoso database engine for better SMP performance. We redid the way traversal of index structures and cache buffers was serialized for SMP, and generally compared Virtuoso and Oracle engines function by function. We had just returned from the ISWC 2006 in Athens, Georgia, and the Virtuoso database was becoming a usable triple store.

Soon thereafter, we confirmed that all this worked when we put out the first cut of Dbpedia with Chris Bizer, et al, and were working with Alan Ruttenberg on what would become the Banff health care and life sciences demo.

The WWW 2007 conference in Banff, Canada, was a sort of kick-off for the Linking Open Data movement, which started as a community project under SWEO, the W3C interest group for Semantic Web Education and Outreach, and has gained a life of its own since.

Right after WWW 2007, the Virtuoso development effort split onto two tracks: one for enhancing the then new 5.0 release; and one for building a new generation of Virtuoso, notably featuring clustering and double storage density for RDF.

The first track produced constant improvements to the relational-to-RDF mapping functionality, SPARQL enhancements, and Redland-, Jena- and Sesame-compatible client libraries with Virtuoso as as a triple store. These things have been out with testers for a while and are all generally available as of this writing.

The second track started with adding key compression to the storage engine, specifically with regard to RDF, even though there are some gains in relational applications as well. With RDF, the space consumption drops to about half, all without recourse to any non-random access compatible compression like gzip. Since the start of August, we turned to clustering and are now code complete, pretty much with all the tricks one would expect, of course full function SQL and taking advantage of co-located joins and doing aggregation and generally all possible processing where the data is. I have covered details of this along the way in previous posts. The key point is that now the thing is written and works with test cases.

In late October, we were at the W3C workshop for mapping relational data to RDF. For us, this confirmed the importance of mapping and scalability in general. Ivan Herman proposed forming a W3C incubator group on benchmarking. Also a W3C incubator group of relational to RDF mapping is being formed.

Now, scalability has two sides. One is dealing with volume, and the other is dealing with complexity. Volume alone will not help if interesting queries cannot be formulated. Hence, we recently extended SPARQL with sub-queries so that we can now express at least any SQL workloads, which was previously not the case. It is sort of a contradiction in terms to say that SPARQL is the universal language for information integration while not being able to express, for example, the TPC-H queries. Well, we fixed this. A separate post will highlight how. The W3C process will eventually follow, as the necessity of these things is undeniable, on the unimpeachable authority of the whole SQL world. Anyway, for now, SPARQL as it is ought to become a recommendation and extensions can be addressed later.

For now, the only RDF benchmark that seems to be out there is the loading part of the LUBM. We did a couple of enhancements of our own for that just recently, but much bigger things are on the way. Also, the billion triples challenge is an interesting initiative in the area. We all recognize that loading any number of triples is a finite problem with known solutions. The challenge is running interesting queries on large volumes.

Our present emphasis is demonstrating both RDF data warehousing and RDF mapping with complex queries and large data. We start with the TPC-H benchmark and doing the queries both through mapping to SQL against any RDBMS — Oracle, DB2, Virtuoso or other — and by querying the physical RDF rendition of the data in Virtuoso. From there, we move to querying a collection of RDBMS hosting similar data.

Doing this with performance at the level of direct SQL in the case of mapping and not very much slower with physical triples is an important milestone on the way to a real world enterprise data web. Real life has harder and more unexpected issues than a benchmark, but at any rate doing the benchmark without breaking a sweat is a step on the way. We sent a paper to ESWC 2008 about that but it was rather incomplete. By the time of the VLDB submissions deadline in March we'll have more meat.

Another tack soon to start is a re-architecting of Zitgist around clustered Virtuoso. Aside matters of scale, we will make a number of qualitatively new things possible. Again, more will be released in the first quarter of 2008.

Beyond these short and mid-term goals we have the introduction of entirely dynamic and demand driven partitioning, a la Google Bigtable or Amazon Dynamo. Now, regular partitioning will do for a while yet but this is the future when we move the the vision of linked data everywhere.

In conclusion, this year we have built the basis and the next year is about deployment. The bulk of really new development is behind us and now we start applying. Also, the community will find adoption easier due to our recent support of the common RDF APIs.