In Hoc Signo Vinces (part 5 of n) -- The Return of SQL Federation

Details

In past years, Virtuoso has mostly been known as an RDF store. Some of you will recall that Virtuoso has always had SQL and SQL federation capabilities.

With the coming of age of the Virtuoso column store, where this becomes a strong contender for SQL warehousing, the SQL federation aspect is also revitalized.

In the previous article, we saw that Virtuoso can load files at well over gigabit-ethernet wire speed. The same of course applies to SQL federation. We can copy the 100 GB TPC-H dataset between two Virtuoso instances in only slightly more time than it takes to load the data from files. In a network situation, the network is likely to be the slowest link when extracting data from other SQL stores into Virtuoso. So, to be "semantically elastic," federating has become warehousing. The articles to follow will show excellent query speed for analytics. The combination of this with connectivity to any existing SQL infrastructure makes Virtuoso an easy-to-deploy accelerator cache for almost any data integration situation. This in fact also simplifies query execution, because the more data one can have locally, the more query optimization choices there are, and performance becomes much more predictable than in situations where queries execute across many heterogenous systems. The win is compounded by reducing loads on the line-of-business databases. The missing link in this case becomes heterogenous log shipping. One can usually not modify a line of business system; for example, adding triggers for tracking changes is generally not done. Being able to read transaction logs of all the most common DBMS would offer a solution.

The barrier to having one's own extract of data for analysis has become much lower. Even the ETL step can be easily streamlined by the SQL federation. For very time-sensitive applications, one can always keep a local copy of a history in a union with the most recent data accessed from the line-of-business system. At the end of the TPC-H series, we will show examples of a near real-time analytics system that keeps up to date with an Oracle database.

For RDF users, this means we have the capacity to extract RDF at bulk load speed from any relational source, whether local or remote. For the test system discussed in the TPC-H series, RDF load shows a sustained throughput of around 320K triples per second. This means that an RDF materialization of the 100 GB TPC-H dataset, about 12.5 billion triples, is done in under 11 hours. This is a vast improvement over the present, and we will show the details in a forthcoming article.