I had a conversation with Andy Seaborne of Epimorphics, initial founder of the Jena RDF Framework tool chain and editor of many W3C recommendations, among which the two SPARQLs. We exchanged some news; I told Andy about our progress in cutting the RDF-to-SQL performance penalty and doing more and better SQL tricks. Andy asked me if there were use cases doing analytics over RDF, not in the business intelligence sense, but in the sense of machine learning or discovery of structure. There is, in effect, such work, notably in data set summarization and description. A part of this has to do with learning the schema, like one would if wanting to put triples into tables when appropriate. CWI in LOD2 has worked in this direction, as has DERI (Giovanni Tummarello's team), in the context of giving hints to SPARQL query writers. I would also mention Chris Bizer et al., at University of Mannheim, with their data integration work, which is all about similarity detection in a schema-less world, e.g., the 150M HTML tables in the Common Crawl, briefly mentioned in the previous blog. Jens Lehmann from University of Leipzig has also done work in learning a schema from the data, this time in OWL.

Andy was later on a panel where Phil Archer asked him whether SPARQL was slow by nature or whether this was a matter of bad implementations. Andy answered approximately as follows: "If you allow for arbitrary ad hoc structure, you will always pay something for this. However, if you tell the engine what your data is like, it is no different from executing SQL." This is essentially the gist of our conversation. Most likely we will make this happen via adaptive schema for the regular part and exceptions as quads.

Later I talked with Phil about the "SPARQL is slow" meme. The fact is that Virtuoso SPARQL will outperform or match PostGIS SQL for Geospatial lookups against the OpenStreetMap dataset. Virtuoso SQL will win by a factor of 5 to 10. Still, the SPARQL is slow meme is not entirely without a basis in fact. I would say that the really blatant cases that give SPARQL a bad name are query optimization problems. With 50 triple patterns in a query there are 50-factorial ways of getting a bad plan. This is where the catastrophic failures of 100+ times worse than SQL come from. The regular penalty of doing triples vs tables is somewhere between 2.5 (Star Schema Benchmark) and 10 (lookups with many literals), quite acceptable for many applications. Some really bad cases can occur with regular expressions on URI strings or literals, but then, if this is the core of the application, it should use a different data model or an n-gram index.

The solutions, including more dependable query plan choice, will flow from adaptive schema which essentially reduces RDF back into relational, however without forcing schema first and with accommodation for exceptions in the data.

Phil noted here that there already exist many (so far, proprietary) ways of describing the shape of a graph. He said there would be a W3C activity for converging these. If so, a vocabulary that can express relationships, the types of related entities, their cardinalities, etc., comes close to a SQL schema and its statistics. Such a thing can be the output of data analysis, or the input to a query optimizer or storage engine, for using a schema where one in fact exists. Like this, there is no reason why things would be less predictable than with SQL. The idea of a re-convergence of data models is definitely in the air; this is in no sense limited to us.

Linked Geospatial Data 2014 Workshop posts: