It is again time for the end-of-year blog
post.
In 2009, RDF scalability questions were solved in their
broad outline and the corresponding Virtuoso release was built and used in
production internally. Its general availability is now imminent
while it has been available on a case-by-case basis thus far.
In 2010, we take on a new challenge: To bring RDF closer to
parity with equivalent relational solutions. This will also entail
some significant improvements to our relational technology.
Storage density is a key ingredient of performance. Some of the
advances will be in this area; other advances will be in increased
parallelism of execution. Right now we run things in vectored
batches in cluster situations where message latency forces
operations to be shipped in large chunks. Next we will do this
across the board, also in single servers. The advantages of this
for cache behavior and other factors are known
in the literature.
Looking at environmental factors, we have a new SPARQL at a Working Draft stage. We have
basic parity with SQL expressivity, which is a prerequisite
for RDF to become a data model that can be an alternative to
relational outside of very specialized contexts.
As the standards process makes SPARQL closer to being an
alternative to SQL for data integration, we will make the database
engine technology such that RDF's inherent penalty in terms of
storage overhead and processing time substantially decreases. This
will make RDF a workable integration medium also in places where it
was not such before. Of course, an application-specific schema will retain some advantage over a
generic one, but then one can have a purely relational application
on Virtuoso as well. Just think of the possibility of an
application-specific schema emerging by itself in a workload-driven
fashion.
As background data for an increasing number of fields becomes
available as linked data, using this together with
proprietary data for analytics and discovery becomes increasingly
interesting. This is the initial line of RDF data warehousing. The
biomedical field has many examples. The technologies we will
release during 2010 will be geared towards enabling a second line
of RDF applications, where ad hoc agile integration with RDF as a
lingua franca becomes a real alternative to relational solutions
with ETL point solutions for harvesting information from diverse systems. One may
see how RDF's flexibility and expressivity may add to agility in
any number of situations where data from heterogenous sources needs
to be integrated. Which of today's business scenarios does not face
this issue?
References: