Perseus, Andromeda, and RDF

It has been several months since my last blog post. In this day and age of the attention economy, what gives me the insolence so to neglect my duty to mindshare?

Well, Perseus wasn't blogging or checking his email either, when he went to fetch the Gorgon's head. As Joseph Campbell puts it, the hero breaks into a world separate from the ordinary in order to bring back a blessing which will revitalize the community.

Thus, I deliberately withdrew from the public conversation, in faith that it would take care of itself and that I would still not be altogether forgotten. As it happens, I was confirmed in this when recently invited to submit a talk for the Semdata workshop at VLDB 2010.

Great deeds are not only personal accomplishments but also play a role in a broader context. The quest may appear remote and difficult to execute but its outcome can be quite tangible: Andromeda needed no elaborate sales pitch to convince her of the advantages of not being eaten by the sea serpent.

Thus right after the meeting in Sofia last March, I followed the vertical treasure map into the realm of first principles. As Perseus received advice from Athena, so was I informed by the Platonic ideas of locality and concurrency.

The great quests have an outer and inner aspect. Likewise here, bringing the ideas to physical reality gave me a great deal of material on cognitive function itself. For human and computer alike, it appears that the main reason why anything at all works is cache. Locality and parallelism again. Maybe I will say something more about memory, attention, interface, and paradigm some other time. On the other hand, such material is bound to be unpopular even if valid.

By now, you may ask yourself what I am talking about.

We remember that Andromeda's fix was due to her mother, Cassiopeia, having claimed greater beauty than the daughters of the sea-god Poseidon. To transpose the archetype into the present, it is like Tim B-L saying that OWLs (by the way sacred to Athena) are more semantic than Codd's brainchild. Yet the relational community sees RDF as something not quite serious. A matter of scale(s) — just think of the sea serpent.

So, I am talking about what I alluded to in the 2010 New Year's statement on this blog: RDF as a viable alternative to relational for big data. This means that RDF is no longer a specialty niche where, due to the hopeless task of bringing everything into a relational model, the fact of everything taking several times both the time and space is tolerated because there is no real alternative.

The value proposition is that for any current RDF user, the present assets will go four times farther than before with the next release of Virtuoso. For a prospective RDF user, the cost of keeping an ETLed RDF integration warehouse is now in the same ballpark as the relational cost, except that schema is now flexible, and the time to integrate and answer is accordingly shorter. For users of analytics-oriented RDBMS, the next Virtuoso is a full cluster-capable SQL column store. Its merits compared to others in this space will be published later with benchmarks like TPC-H. As an extra bonus for such users, Virtuoso brings SQL federation and a growth path to RDF, should this become interesting.

This is accomplished by introducing a new column-wise compressed-storage engine with corresponding changes to query execution. The general principles are explained in Daniel Abadi's famous Ph.D. thesis. The compression is tuned by the data itself, without user intervention. Further, our implementation remains capable of run-time-typing, thus the column-store advantages to RDF are obtained without going to a task-specific schema. But since data types, even if determined at run-time, are still in practice repetitive, the advantages of running on homogenous vectors are not lost.

When storing an RDF extraction of TPC-H data, we get a storage usage of 6.3 bytes per quad. If you do not care about queries where the predicate is unspecified, the storage requirement drops to 4.7 bytes per quad. Whether storing the data as RDF quads or as Vertica-style multicolumn projections, the working set is about the same. Since having enough of the data in memory is the sine qua non prerequisite of flexible querying, the point is made. QED.

In Virtuoso also, relational remains a bit faster but a penalty of 1.3x or so for RDF is quite tolerable, considering that a priori schema is no longer needed.

This means that we are coming into an age where the warehouse becomes an ad hoc asset, to be filled with RDF, without the need to develop an a priori universal schema for all data one may ever wish to integrate, now or in the future. The data can be stored as RDF and projected from there into any form that may be needed at any time, whether the target format is more RDF or a task-specific relational schema.

Availability is planned for late 2010, first as a Virtuoso Open Source preview.

Orri Erling's Weblog

Details

Subscribe

Tag Cloud

Post Categories

Recent Articles

Comments

Post Comment