The first keynote of Semantics 2014 was by Phil Archer of the W3C, entitled "10 Years of Achievement." After my talk, in the questions, Phil brought up the matter of the upcoming W3C work group charter on RDF Data Shapes. We had discussed this already at the reception the night before and I will here give some ideas about this.

After the talk, my answer was that naturally the existence of something that expressed the same sort of thing as SQL DDL, with W3C backing, can only be a good thing and will give the structure awareness work by OpenLink in Virtuoso and probably others a more official seal of approval. Quite importantly, this will be a facilitator of interoperability and will raise this from a product specific optimization trick to a respectable, generally-approved piece of functionality.

This is the general gist of the matter and can hardly be otherwise. But underneath is a whole world of details, which we discussed at the reception.

Phil noted that there was controversy around whether a lightweight OWL-style representation or SPIN should function as the basis for data shapes.

Phil stated in the keynote that the W3C considered the RDF series of standards as good and complete, but would still have working groups for filling in gaps as these came up. This is what I had understood from my previous talks with him at the Linking Geospatial Data workshop in London earlier this year.

So, against this backdrop, as well as what I had discussed with Ralph Hodgson of Top Quadrant at a previous LDBC TUC meeting in Amsterdam, SPIN seems to me a good fit.

Now, it turns out that we are talking about two different use cases. Phil said that the RDF Data Shapes use case was about making explicit what applications required of data. For example, all products should have a unit price, and this should have one value that is a number.

The SPIN proposition on the other hand, as Ralph himself put it in the LDBC meeting, is providing to the linked data space functionality that roughly corresponds to SQL views. Well, this is one major point, but SPIN involves more than this.

So, is it DDL or views? These are quite different. I proposed to Phil that there was in fact little point in fighting over this; best to just have two profiles.

To be quite exact, even SQL DDL equivalence is tricky, since enforcing this requires a DBMS; consider, for instance, foreign key and check constraints. At the reception, Phil stressed that SPIN was certainly good but since it could not be conceived without a SPARQL implementation, it was too heavy to use as a filter for an application that, for example, just processed a stream of triples.

The point, as I see it, is that there is a wish to have data shape enforcement, at least to a level, in a form that can apply to a stream without random access capability or general purpose query language. This can make sense for some big data style applications, like an ETL-stage pre-cooking of data before the application. Applications mostly run against a DBMS, but in some cases, this could be a specialized map-reduce or graph analytics job also, so no low cost random access.

My own take is that views are quite necessary, especially for complex query; this is why Virtuoso has the SPARQL macro extension. This will do, by query expansion, a large part of what general purpose inference will do, except for complex recursive cases. Simple recursive cases come down to transitivity and still fit the profile. SPIN is a more generic thing, but has a large intersection with SPARQL macro functionality.

My other take is that structure awareness needs a way of talking about structure. This is a use case that is clearly distinct from views.

A favorite example of mine is the business rule that a good customer is one that has ordered more than 5 times in the last year, for a total of more than so much, and has no returns or complaints. This can be stated as a macro or SPIN rule with some aggregates and existences. This cannot be stated in any of the OWL profiles. When presented with this, Phil said that this was not the use case. Fair enough. I would not want to describe what amounts to SQL DDL in these terms either.

A related topic that has come up in other conversations is the equivalent of the trigger. One use case of this is enforcement of business rules and complex access rights for updates. So, we see that the whole RDBMS repertoire is getting recreated.

Now, talking from the viewpoint of the structure-aware RDF store, or the triple-stream application for that matter, I will outline some of what data shapes should do. The triggers and views matter is left out, here.

The commonality of bulk-load, ETL, and stream processing, is that they should not rely on arbitrary database access. This would slow them down. Still, they must check the following sorts of things:

  • Data types
  • Presence of some required attributes
  • Cardinality — e.g., a person has no more than one date of birth
  • Ranges — e.g., a product's price is a positive number; gender is male/female; etc.
  • Limited referential integrity — e.g., a product has one product type, and this is a subject of the RDF type product type.
  • Limited intra-subject checks — e.g.. delivery date is greater-than-or-equal-to ship date.

All these checks depend on previous triples about the subject; for example, these checks may be conditional on the subject having a certain RDF type. In a data model with a join per attribute, some joining cannot be excluded. Checking conditions that can be resolved one triple at a time is probably not enough, at least not for the structure-aware RDF store case.

But, to avoid arbitrary joins which would require a DBMS, we have to introduce a processing window. The triples in the window must be cross-checkable within the window. With RDF set semantics, some reference data may be replicated among processing windows (e.g., files) with no ill effect.

A version of foreign key declarations is useful. To fit within a processing window, complete enforcement may not be possible but the declaration should still be possible, a little like in SQL where one can turn off checking.

In SQL, it is conventional to name columns by prefixing them with an abbreviation of the table name. All the TPC schemas are like that, for example. Generally in coding, it is good to prefix names with data type or subsystem abbreviation. In RDF, this is not the practice. For reuse of vocabularies, where a property may occur in anything, the namespace or other prefix denotes where the property comes from, not where it occurs.

So, in TPC-H, l_partkey and ps_partkey are both foreign keys that refer to part, plus that l_partkey is also a part of a composite foreign key to partsupp. By RDF practices, these would be called rdfh:hasPart. So, depending on which subject type we have, rdfh:hasPart is 30:1 or 4:1. (distinct subjects:distinct objects) Due to this usage, the property's features are not dependent only on the property, but on the property plus the subject/object where it occurs.

In the relational model, when there is a parent and a child item (one to many), the child item usually has a composite key prefixed with the parent's key, with a distinguishing column appended, e.g., l_orderkey, l_linenumber. In RDF, this is rdfh:hasOrder as a property of the lineitem subject. In SQL, there is no single part lineitem subject at all, but in RDF, one must be made since everything must be referenceable with a single value. This does not have to matter very much, as long as it is possible to declare that lineitems will be primarily accessed via their order. It is either this or a scan of all lineitems. Sometimes a group of lineitems are accessed by the composite foreign key of l_partkey, l_suppkey. There could be a composite index on these. Furthermore, for each l_partkey, l_suppkey in lineitem there exists a partsupp. In an RDF translation, the rdfh:hasPart and rdfh:hasSupplier, when they occur in a lineitem subject, specify exactly one subject of type partsupp. When they occur in a partsupp subject, they are unique as a pair. Again, because names are not explicit as to where they occur and what role they play, the referential properties do not depend only on the name, but on the name plus included data shape. Declaring and checking all this is conventional in the mainstream and actually useful for query optimization also.

Take the other example of a social network where the foaf:knows edge is qualified by a date when this edge was created. This may be by reification, or more usually by an "entitized" relationship where the foaf:knows is made into a subject with the persons who know each other and the date of acquaintance as properties. In a SQL schema, this is a key person1, person2 -> date. In RDF, there are two join steps to go from person1 to person2; in SQL, 1. This is eliminated by saying that the foaf:knows entity is usually referenced by the person1 Object or person2 Object, not the Subject identifier of the foaf:knows.

This allows making the physical storage by O, S, G -> O2, O3, …. A secondary index with S, G, O still allows access by the mandatory subject identifier. In SQL, a structure like this is called a clustered table. In other words, the row is arranged contiguous with a key that is not necessarily the primary key.

So, identifying a clustering key in RDF can be important.

Identifying whether there are value-based accesses on a given Object without making the Object a clustering key is also important. This is equivalent to creating a secondary index in SQL. In the tradition of homogenous access by anything, such indexing may be on by default, except if the property is explicitly declared of low cardinality. For example, an index on gender makes no sense. The same is most often true of rdfs:type. Some properties may have many distinct values (e.g., price), but are still not good for indexing, as this makes for the extreme difference in load time between SQL and the all-indexing RDF.

Identifying whether a column will be frequently updated is another useful thing. This will turn off indexing and use an easy-to-update physical representation. Plus, properties which are frequently updated are best put physically together. This may, for example, guide the choice between row-wise and column-wise representation. A customer's account balance and orders year-to-date would be an example of such properties.

Some short string valued properties may be frequently returned or used as sorting keys. This requires accessing the literal via an ID in the dictionary table. Non-string literals, numbers, dates, etc., are always inlined (at least in most implementations), but strings are a special question. Bigdata and early versions of Virtuoso would inline short ones; later versions of Virtuoso would not. So specifying, per property/class combination, a length limit for an inlined string is very high gain and trivial to do. The BSBM explore score at large scales can get a factor of 2 gain just from inlining one label. BSBM is out of its league here, but this is still really true and yields benefits across the board. The simpler the application, the greater the win.

If there are foreign keys, then data should be loaded with the referenced entities first. This makes dimensional clustering possible at load time. If the foreign key is frequently used for accessing the referencing item (for example, if customers are often accessed by country), then loading customers so that customers of the same country end up next to each other can result in great gains. The same applies to a time dimension, which in SQL is often done as a dimension table, but rarely so in linked data. Anyhow, if date is a frequent selection criterion, physically putting items in certain date ranges together can give great gains.

The trick here is not necessarily to index on date, but rather to use zone maps (aka min/max index). If nearby values are together, then just storing a min-max value for thousands of consecutive column values is very compact and fast to check, provided that the rows have nearby values. Actian Vector's (VectorWise) prowess in TPC-H is in part from smart use of date order in this style.

To recap, the data shapes desiderata from the viewpoint of guiding physical storage is as follows:

(I will use "data shape" to mean "characteristic set," or "set of Subjects subject to the same set of constraints." A Subject belonging to a data shape may be determined either by its rdfs:type or by the fact of it having, within the processing window, all or some of a set of properties.)

  • All normal range, domain, cardinality, optionality, etc. — Specifically, declaring something as single valued (as with SQL's UNIQUE constraint) and mandatory (as with SQL's NOT NULL constraint) is good.
  • Primary access path — The Properties whose Objects are dominant access criteria is important
  • No-index — Declare that no index will be made on the Object of a Property within a data shape.
  • Inlined string — String values of up to so many characters in this data shape are inlined
  • Clustering key — The Subject identifiers will be picked to be correlated with the Object of this Property in this data shape. This can be qualified by a number of buckets (e.g., if dates are from 2000 to 2020, then this interval may be 100 buckets), with an exception bucket for out of range values.
  • No full text index — A string value will not need to be full text indexed in this Property even if full text indexing is generally on.
  • Full text index desired — This means that if the value of the property is a string, then the row must be locatable via this string. The string may or may not be inlined, but an index will exist on the literal ID of the string, e.g., POSG.
  • Co-location — This is akin to clustering but specifies, for a high cardinality Object, that the Subject identifier should be picked to fall in the same partition as the Object. The Object is typically a parent of the Subject being loaded; for example, the containing assembly of a sub-assembly. Traversing the assembly created in this way will be local on a scale-out system. This can also apply to geometries or text values: If primary access is by text or geo index, then the metadata represented as triples should be in the same partition as the entry in the full text/geo index.
  • Update group — A set of properties that will often change together. Implies no index and some form of co-location, plus update-friendly physical representation. Many update groups may exist, in which case they may or may not be collocated.
  • Composite foreign/primary key — A data shape can have a multicolumn foreign key, e.g., l_partkey, l_suppkey in lineitem with the matching primary key of ps_partkey, ps_suppkey in partsupp. This can be used for checking and for query optimization: Looking at l_partkey and l_suppkey as independent properties, the guess would be that there hardly ever exists a partsupp, whereas one does always exist. The XML standards stack also has a notion of a composite key for random access on multiple attributes.

These things have the semantic of "hint for physical storage" and may all be ignored without effect on semantics, at least if the data is constraint-compliant to start with.

These things will have some degree of reference implementation through the evolution of Virtuoso structure awareness, though not necessarily immediately. These are, to the semanticist, surely dirty low-level disgraceful un-abstractions, some of the very abominations the early semanticists abhorred or were blissfully ignorant of when they first raised their revolutionary standard.

Still, these are well-established principles of the broader science of database. SQL does not standardize some of these, nor does it have much need to, as the use of these features is system-specific. The support varies widely and the performance impacts are diverse. However, since RDF excels as a reference model and as a data interchange format, giving these indications as hints to back-end systems cannot hurt, and can make a difference of night and day in load and query time.

As Phil Archer said, the idea of RDF Data Shapes is for an application to say that "it will barf if it gets data that is not like this." An extension is for the data to say what the intended usage pattern is so that the system may optimize for this.

All these things may be learned from static analysis and workload traces. The danger of this is over-fitting a particular profile. This enters a gray area in benchmarking. For big data, if RDF is to be used as the logical model and the race is about highest absolute performance, never mind what the physical model ends up being, all this and more is necessary. And if one is stretching the envelope for scale, the race is always about highest absolute performance. For this reason, these things will figure at the leading edge with or without standardization. I would say that the build-up of experience in the RDBMS world is sufficient for these things to be included as hints in a profile of data shapes. The compliance cost will be nil if these are ignored, so for the W3C, these will not make the implementation effort for compliance with an eventual data shapes recommendation prohibitive.

The use case is primarily the data warehouse to go. If many departments or organizations publish data for eventual use by their peers, users within the organization may compose different combinations of extractions for different purposes. Exhaustive indexing of everything by default makes the process slow and needlessly expensive, as we have seen. Much of such exploration is bounded by load time. Federated approaches for analytics are just not good, even though they may work for infrequent lookups. If datasets are a commodity to be plugged in and out, the load and query investment must be minimized without the user/DBA having to run workload analysis and manual schema optimization. Therefore, bundling guidelines such as these with data shapes in a dataset manifest can do no harm and can in cases provide 10-50x gains in load speeds and 2-4x in space consumption, not to mention unbounded gains in query time, as good and bad plans easily differ by 10-100x, especially in analytics.

So, here is the pitch:

  • Dramatic gains in ad hoc user experience
  • Minimal effort by data publishers, as much of the physical guidelines can be made from workload trace and dataset; the point is that the ad hoc user does not have to do this.
  • Great optimization potential for system vendors; low cost for initial compliance
  • Better understanding of the science of performance by the semantic community

To be continued...

SEMANTiCS 2014 Series