At ESWC2008, we saw the Linked Open Data Cloud condense its first drops of precipitation.

voiD, Vocabulary of Interlinked Datasets, is an idea whose time has clearly come. By the end of the conference, many speakers had already adopted the meme.

The point is to describe what is inside the data sets. People may know this from having worked with the sets or from putting them together but to an outsider this is not evident.

The Semantic Sitemap says where there are files or end points for access. But it does not say what is inside these. Also for federation, it is important to be able to determine whether it makes sense to send a particular query to a particular end point.

If we play this right, this is what voiD will provide. I have to think of Dan Simmons' flamboyant Hyperion sci-fi series where the "void which binds" was a sort of hyperspace containing the thoughts of entities, past and present and even provided teleportation.

So what does the voiD hold, aside infinite potentialities?

The obvious part is DC-like provenance, version, authorship, license and such data set wide information. Also the subject matter could be classified by reference to UMBEL or the Yago classification of DBpedia.

More is needed, though. The simple part is listing the ontologies, if any. Also a set of namespaces would be an idea but this could be very large.

So let us look at what we'd like to be able to answer with the voiD set.

The below could be a sample of voiD questions?

  • What subjects are in the LOD cloud?

  • Given this URI, what set in the LOD cloud can tell me more? This is divided into asking a text index like Sindice for the location, getting the namespace or data set and then querying voiD.

  • What need I federate/load in order to combine all that is reachable from a given vocabulary? There could be for example a graph showing the data sets and edges between them, edges being qualified by a set of same as assertions, itself a voiD described set, if translations were needed.

  • What sets are from the same or equally trusted publisher as this one?

These things are roughly divided into description of the set and then some details on how it is stored on a given end point.

  • Given this set, in which other sets will I find use of the same URIs? For example, if I have language version x, I wish to know that language version y will have the same URIs insofar the things meant are the same.

  • Given this set, which sets of same as assertions will I have for mapping to which other sets? For example, if I have Geonames, I wish to know that set x will map at least some of the URIs in Geonames to DBpedia URIs.

Let me further point out that it is increasingly clear to the community that universal sameAs is dubious, hence sameAs assertions ought to be kept separate and included or excluded depending on the usage context.

  • Given this set, what are the interesting queries I can do? This is a sort of advertisement for human consumption. This is not a list of queries for crashing the end point. Denial of service can be done in SPARQL without knowing the end point content anyhow, so this is not an added risk exposer.

  • Vocabularies used. This is a reference to the OWL or RDFS resources giving the applicable ontologies, if present. Also, a complete list of classes whose direct instances actually occur in the set is useful.

  • Ballpark cardinality. Something like a DARQ optimization profile would be a good idea. I would say that there should be a possibility of just including a DARQ description file as is. This is a sort of baseline and since it already exist, we are spared the committee trouble of figuring out what it ought to contain and what not. If we start defining this from scratch, it will take long. Further, let this be optional. Quite Independently of this, query processors may make optimization related queries to remote end points insofar the specific end point supports these. This will come in time. For now, just the basics.

Along with this, LOD SPARQL end points could adopt a couple of basic conventions. The simplest would be to agree that each would host a graph with a given URI that would contain the voiD descriptions of the data sets contained, along with the graph URI used for each set, if different from the publisher's URI for the graph. There is a point to this since an end point may load multiple data sets into one graph.

We hope to have a good idea of the matter in a couple of weeks, certainly a general statement of direction to be published at Linked Data Planet in a couple of weeks.