SNB Interactive, Part 2: Modeling Choices

Details

SNB Interactive is the wild frontier, with very few rules. This is necessary, among other reasons, because there is no standard property graph data model, and because the contestants support a broad mix of programming models, ranging from in-process APIs to declarative query.

In the case of Virtuoso, we have played with SQL and SPARQL implementations. For a fixed schema and well known workload, SQL will always win. The reason is that SQL allows materialization of multi-part indices and data orderings that make sense for the application. In other words, there is transparency into physical design. An RDF/SPARQL-based application may also have physical design by means of structure-aware storage, but this is more complex and here we are just concerned with speed and having things work precisely as we intend.

Schema Design

SNB has a regular schema described by a UML diagram. This has a number of relationships, of which some have attributes. There are no heterogenous sets, i.e., no need for run-time typed attributes or graph edges with the same label but heterogenous end-points. Translation into SQL or SPARQL is straightforward. Edges with attributes (e.g., the foaf:knows relation between people) would end up represented as a subject with the end points and the effective date as properties. The relational implementation has a two-part primary key and the effective date as a dependent column. A native property graph database would use an edge with an extra property for this, as such are typically supported.

The only table-level choice has to do with whether posts and comments are kept in the same or different data structures. The Virtuoso schema uses a single table for both, with nullable columns for the properties that occur only in one. This makes the queries more concise. There are cases where only non-reply posts of a given author are accessed. This is supported by having two author foreign key columns each with its own index. There is a single nullable foreign key from the reply to the post/comment being replied to.

The workload has some frequent access paths that need to be supported by index. Some queries reward placing extra columns in indices. For example, a common pattern is accessing the most recent posts of an author or a group of authors. There, having a composite key of ps_creatorid, ps_creationdate, ps_postid pays off since the top-k on creationdate can be pushed down into the index without needing a reference to the table.

The implementation is free to choose data types for attributes, particularly datetimes. The Virtuoso implementation adopts the practice of the Sparksee and Neo4j implementations and represents this is a count of milliseconds since epoch. This is less confusing, faster to compare, and more compact than a native datetime datatype that may or may not have timezones, etc. Using a built-in datetime seems to be nearly always a bad idea. A dimension table or a number for a time dimension avoids the ambiguities of a calendar or at least makes these explicit.

The benchmark allows procedurally maintained materializations of intermediate results for use by queries as long as these are maintained transaction-by-transaction. For example, each person could have the 20 newest posts by their immediate contacts precomputed. This would reduce Q2 "top of the wall" to a single lookup. This does not however appear to be worthwhile. The Virtuoso implementation does do one such materialization for Q14: A connection weight is calculated for every pair of persons that know each other. This is related to the count of replies by either to content generated by the other. If there does not exist a single reply in either direction, the weight is taken to be 0. This weight is precomputed after bulk load and subsequently maintained each time a reply is added. The table for this is the only row-wise structure in the schema and represents a half-matrix of connected people, i.e., person1, person2 -> weight. Person1 is by convention the one with the smaller p_personid. Note that comparing IDs in this way is useful but not normally supported by SPARQL/RDF systems. SPARQL would end up comparing strings of URIs with disastrous performance implications unless an implementation-specific trick were used.

In the next installment, we will analyze an actual run.