These days, data provenance is a big topic across the board, ranging from the linked data web, to RDF in general, to any kind of data integration, with or without RDF. Especially with scientific data we encounter the need for metadata and provenance, repeatability of experiments, etc. Data without context is worthless, yet the producers of said data do not always have a model or budget for metadata. And if they do, the approach is often a proprietary relational schema with web services in front.

RDF and linked data principles could evidently be a great help. This is a large topic that goes into the culture of doing science and will deserve a more extensive treatment down the road.

For now, I will talk about possible ways of dealing with provenance annotations in Virtuoso at a fairly technical level.

If data comes many-triples-at-a-time from some source (e.g., library catalogue, user of a social network), then it is often easiest to put the data from each source/user into its own graph. Annotations can then be made on the graph. The graph IRI will simply occur as the subject of a triple in the same or some other graph. For example, all such annotations could go into a special annotations graph.

On the query side, having lots of distinct graphs does not have to be a problem if the index scheme is the right one, i.e., the 4 index scheme discussed in the Virtuoso documentation. If the query does not specify a graph, then triples in any graph will be considered when evaluating the query.

One could write queries like —

SELECT  ?pub 
  WHERE 
    { 
      GRAPH  ?g 
        { 
          ?person  foaf:knows  ?contact 
        } 
      ?contact  foaf:name         "Alice"  . 
      ?g        xx:has_publisher  ?pub 
    }

This would return the publishers of graphs that assert that somebody knows Alice.

Of course, the RDF reification vocabulary can be used as-is to say things about single triples. It is however very inefficient and is not supported by any specific optimization. Further, reification does not seem to get used very much; thus there is no great pressure to specially optimize it.

If we have to say things about specific triples and this occurs frequently (i.e., for more than 10% or so of the triples), then modifying the quad table becomes an option. For all its inefficiency, the RDF reification vocabulary is applicable if reification is a rarity.

Virtuoso's RDF_QUAD table can be altered to have more columns. The problem with this is that space usage is increased and the RDF loading and query functions will not know about the columns. A SQL update statement can be used to set values for these additional columns if one knows the G,S,P,O.

Suppose we annotated each quad with the user who inserted it and a timestamp. These would be columns in the RDF_QUAD table. The next choice would be whether these were primary key parts or dependent parts. If primary key parts, these would be non-NULL and would occur on every index. The same quad would exist for each distinct user and time this quad had been inserted. For loading functions to work, these columns would need a default. In practice, we think that having such metadata as a dependent part is more likely, so that G,S,P,O are the unique identifier of the quad. Whether one would then include these columns on indices other than the primary key would depend on how frequently they were accessed.

In SPARQL, one could use an extension syntax like —

SELECT  * 
  WHERE 
    { ?person      foaf:knows  ?connection 
                   OPTION ( time  ?ts )     . 
      ?connection  foaf:name   "Alice"      . 
      FILTER ( ?ts > "2009-08-08"^^xsd:datetime ) 
    }

This would return everybody who knows Alice since a date more recent than 2009-08-08. This presupposes that the quad table has been extended with a datetime column.

The OPTION (time ?ts) syntax is not presently supported but we can easily add something of the sort if there is user demand for it. In practice, this would be an extension mechanism enabling one to access extension columns of RDF_QUAD via a column ?variable syntax in the OPTION clause.

If quad metadata were not for every quad but still relatively frequent, another possibility would be making a separate table with a key of GSPO and a dependent part of R, where R would be the reification URI of the quad. Reification statements would then be made with R as a subject. This would be more compact than the reification vocabulary and would not modify the RDF_QUAD table. The syntax for referring to this could be something like —

SELECT * 
  WHERE 
    { ?person   foaf:knows         ?contact 
                OPTION ( reify  ?r )          . 
      ?r        xx:assertion_time  ?ts       . 
      ?contact  foaf:name          "Alice"   . 
      FILTER ( ?ts > "2008-8-8"^^xsd:datetime ) 
    }

We could even recognize the reification vocabulary and convert it into the reify option if this were really necessary. But since it is so unwieldy I don't think there would be huge demand. Who knows? You tell us.