SEMANTiCS 2014 (part 3 of 3): Conversations

I was asked for an oracular statement about the future of relational database (RDBMS) at the conference. The answer, without doubt or hesitation, is that this is forever. But this does not mean that the RDBMS world would be immutable, quite the opposite.

The specializations converge. The RDBMS becomes more adaptable and less schema-first. Of course the RDBMS also take new data models beside the relational. RDF and other property graph models, for instance.

The schema-last-ness is now well in evidence. For example, PostgreSQL has an hstore column type which is a list of key-value pairs. Vertica has a feature called flex tables where a column can be added on a row-by-row basis.

Specialized indexing for text and geometries is a well established practice. However, dedicated IR systems, often Lucene derivatives, can offer more transparency in the IR domain for things like vector-space-models and hit-scoring. There is specialized faceted search support which is quite good. I do not know of an RDBMS that would do the exact same trick as Lucene for facets, but, of course, in the forever expanding scope of RDB, this is added easily enough.

JSON is all the rage in the web developer world. Phil Archer even said in his keynote, as a parody of the web developer: " I will never touch that crap of RDF or the semantic web; this is a pipe dream of reality ignoring academics and I will not have it. I will only use JSON-LD."

XML and JSON are much the same thing. While most databases have had XML support for over a decade, there is a crop of specialized JSON systems like MongoDB. PostgreSQL also has a JSON datatype. Unsurprisingly, MarkLogic too has JSON, as this is pretty much the same thing as their core competence of XML.

Virtuoso, too, naturally has a JSON parser, and mapping this to the native XML data type is a non-issue. This should probably be done.

Stefano Bertolo of the EC, also LOD2 project officer, used the word Cambrian explosion when talking about the proliferation of new database approaches in recent years.

Hadoop is a big factor in some environments. Actian Vector (née VectorWise), for example, can use this as its file system. HDFS is singularly cumbersome for this but still not impossible and riding the Hadoop bandwagon makes this adaptation likely worthwhile.

Graphs are popular in database research. We have a good deal of exposure to this via LDBC. Going back to an API for database access, as is often done in graph database, can have its point, especially as a reaction to the opaque and sometimes hard to predict query optimization of declarative languages. This just keeps getting more complex, so a counter-reaction is understandable. APIs are good if crossed infrequently and bad otherwise. So, graph database APIs will develop vectoring, is my prediction and even recommendation in LDBC deliverables.

So, there are diverse responses to the same evolutionary pressures. These are of initial necessity one-off special-purpose systems, since the time to solution is manageable. Doing these things inside an RDBMS usually takes longer. The geek also likes to start from scratch. Well, not always, as there have been some cases of grafting some entirely non-MySQL-like functionality, e.g. Infobright and Kickfire, onto MySQL.

From the Virtuoso angle, adding new data and control structures has been done many times. There is no reason why this cannot continue. The next instances will consist of some graph processing (BSP, or Bulk Synchronous Processing) in the query languages. Another recent example is an interface for pluggable specialized content indices. One can make chemical structure indices, use alternate full text indices, etc., with this.

Most of this diversification has to do with physical design. The common logical side is a demand for more flexibility in schema and sometimes in scaling, e.g., various forms of elasticity in growing scale-out clusters, especially with the big web players.

The diversification is a fact, but the results tend to migrate into the RDBMS given enough time.

On the other hand, when a new species like the RDF store emerges, with products that do this and no other thing and are numerous enough to form a market, the RDBMS functionality seeps in. Bigdata has a sort of multicolumn table feature, if I am not mistaken. We just heard about the wish for strict schema, views, and triggers. By all means.

From the Virtuoso angle, with structure awareness, the difference of SQL and RDF gradually fades, and any advance can be exploited to equal effect on either side.

Right now, I would say we have convergence when all the experimental streams feel many of the same necessities.

Of course you cannot have a semantic tech conference without the matter of the public SPARQL end point coming up. The answer is very simple: If you have operational need for SPARQL accessible data, you must have your own infrastructure. No public end points. Public end points are for lookups and discovery; sort of a dataset demo. If operational data is in all other instances the responsibility of the one running the operation, why should it be otherwise here? Outsourcing is of course possible, either for platform (cloud) or software (SaaS). To outsource something with a service level, the service level must be specifiable. A service level cannot be specified in terms of throughput with arbitrary queries but in terms of well defined transactions; hence the services world runs via APIs, as in the case of Open PHACTS. For arbitrary queries (i.e., analytics on demand), with the huge variation in performance dependent on query plans and configuration of schema, the best is to try these things with platform on demand in a cloud. Like this, there can be a clear understanding of performance, which cannot be had with an entirely uncontrolled concurrent utilization. For systems in constant operation, having one's own equipment is cheaper, but still might be impossible to procure due to governance.

Having clarified this, the incentives for operators also become clearer. A public end point is a free evaluation; a SaaS deal or product sale is the commercial offering.

Anyway, common datasets like DBpedia are available preconfigured on AWS with a Virtuoso server. For larger data, there is a point to making ready-to-run cluster configurations available for evaluation, now that AWS has suitable equipment (e.g., dual E5 2670 with 240 GB RAM and SSD for USD 2.8 an hour). According to Amazon, up to five of these are available at a time without special request. We will try this during the fall and make the images available.