Linked Data and Virtuoso in 2010

It is again time for the end-of-year blog post.

In 2009, RDF scalability questions were solved in their broad outline and the corresponding Virtuoso release was built and used in production internally. Its general availability is now imminent while it has been available on a case-by-case basis thus far.

In 2010, we take on a new challenge: To bring RDF closer to parity with equivalent relational solutions. This will also entail some significant improvements to our relational technology.

Storage density is a key ingredient of performance. Some of the advances will be in this area; other advances will be in increased parallelism of execution. Right now we run things in vectored batches in cluster situations where message latency forces operations to be shipped in large chunks. Next we will do this across the board, also in single servers. The advantages of this for cache behavior and other factors are known in the literature.

Looking at environmental factors, we have a new SPARQL at a Working Draft stage. We have basic parity with SQL expressivity, which is a prerequisite for RDF to become a data model that can be an alternative to relational outside of very specialized contexts.

As the standards process makes SPARQL closer to being an alternative to SQL for data integration, we will make the database engine technology such that RDF's inherent penalty in terms of storage overhead and processing time substantially decreases. This will make RDF a workable integration medium also in places where it was not such before. Of course, an application-specific schema will retain some advantage over a generic one, but then one can have a purely relational application on Virtuoso as well. Just think of the possibility of an application-specific schema emerging by itself in a workload-driven fashion.

As background data for an increasing number of fields becomes available as linked data, using this together with proprietary data for analytics and discovery becomes increasingly interesting. This is the initial line of RDF data warehousing. The biomedical field has many examples. The technologies we will release during 2010 will be geared towards enabling a second line of RDF applications, where ad hoc agile integration with RDF as a lingua franca becomes a real alternative to relational solutions with ETL point solutions for harvesting information from diverse systems. One may see how RDF's flexibility and expressivity may add to agility in any number of situations where data from heterogenous sources needs to be integrated. Which of today's business scenarios does not face this issue?

References:

Linked Data & The Year 2009
Retrospective and Outlook for 2008
Other Scalability and Benchmarking posts

Kingsley Idehen's Blog Data Space

Details

Subscribe

Tag Cloud

Post Categories

Subscribe

Recent Articles

Comments

Post Comment

Kingsley Idehen's Blog Data Space

Details

Subscribe

Tag Cloud

Post Categories

Subscribe

Recent Articles

Related

Comments

Post Comment