Linked Data and Information Architecture

We had a workshop on Linked Open Data (LOD) last week in Beijing. You can see the papers in the program. The event was a success with plenty of good talks and animated conversation. I will not go into every paper here but will comment a little on the conversation and draw some technology requirements going forward.

Tim Berners-Lee showed a read-write version of Tabulator. This raises the question of updating on the Data Web. The consensus was that one could assert what one wanted in one's own space but that others' spaces would be read-only. What spaces one considered relevant would be the user's or developer's business, as in the document web.

It seems to me that a significant use case of LOD is an open-web situation where the user picks a broad read-only "data wallpaper" or backdrop of assertions, and then uses this combined with a much smaller, local, writable data set. This is certainly the case when editing data for publishing, as in Tim's demo. This will also be the case when developing mesh-ups combining multiple distinct data sets bound together by sets of SameAs assertions, for example. Questions like, "What is the minimum subset of n data sets needed for deriving the result?" will be common. This will also be the case in applications using proprietary data combined with open data.

This means that databases will have to deal with queries that specify large lists of included graphs, all graphs in the store or all graphs with an exclusion list. All this is quite possible but again should be considered when architecting systems for an open linked data web.

"There is data but what can we really do with it? How far can we trust it, and what can we confidently decide based on it?"

As an answer to this question, Zitgist has compiled the UMBEL taxonomy using SKOS. This draws on Wikipedia, Open CYC, Wordnet, and YAGO, hence the acronym WOWY. UMBEL is both a taxononmy and a set of instance data, containing a large set of named entities, including persons, organizations, geopolitical entities, and so forth. By extracting references to this set of named entities from documents and correlating this to the taxonomy, one gets a good idea of what a document (or part thereof) is about.

Kingsley presented this in the Zitgist demo. This is our answer to the criticism about DBpedia having errors in classification. DBpedia, as a bootstrap stage, is about giving names to all things. Subsequent efforts like UMBEL are about refining the relationships.

"Should there be a global URI dictionary?"

There was a talk by Paolo Bouquet about Entity Name System, a a sort of data DNS, with the purpose of associating some description and rough classification to URIs. This would allow discovering URIs for reuse. I'd say that this is good if it can cut down on the SameAs proliferation and if this can be widely distributed and replicated for resilience, à la DNS. On the other hand, it was pointed out that this was not quite in the LOD spirit, where parties would mint their own dereferenceable URIs, in their own domains. We'll see.

"What to do when identity expires?"

Giovanni of Sindice said that a document should be removed from search if it was no longer available. Kingsley pointed out that resilience of reference requires some way to recover data. The data web cannot be less resilient than the document web, and there is a point to having access to history. He recommended hooking up with the Internet Archive, since they make long term persistence their business. In this way, if an application depends on data, and the URIs on which it depends are no longer dereferenceable or or provide content from a new owner of the domain, those who need the old version can still get it and host it themselves.

It is increasingly clear that OWL SameAs is both the blessing and bane of linked data. We can easily have tens of URIs for the same thing, especially with people. Still, these should be considered the same.

Returning every synonym in a query answer hardly makes sense but accepting them as input seems almost necessary. This is what we do with Virtuoso's SameAs support. Even so, this can easily double query times even when there are no synonyms.

Be that as it may, SameAs is here to stay; just consider the mapping of DBpedia to Geonames, for example.

Also, making aberrant SameAs statements can completely poison a data set and lead to absurd query results. Hence choosing which SameAs assertions from which source will be considered seems necessary. In an open web scenario, this leads inevitably to multi-graph queries that can be complex to write with regular SPARQL. By extension, it seems that a good query would also include the graphs actually used for deriving each result row. This is of course possible but has some implications on how databases should be organized.

Yves Raymond gave a talk about deriving identity between Musicbrainz and Jamendo. I see the issue as a core question of linked data in general. The algorithm Yves presented started with attribute value similarities and then followed related entities. Artists would be the same if they had similar names and similar names of albums with similar song titles, for example. We can find the same basic question in any analysis, for example, looking at how news reporting differs between media, supposing there is adequate entity extraction.

There is basic graph diffing in RDFSync, for example. But here we are expanding the context significantly. We will traverse references to some depth, allow similarity matches, SameAs, and so forth. Having presumed identity of two URIs, we can then look at the difference in their environment to produce a human readable summary. This could then be evaluated for purposes of analysis or of combining content.

At first sight, these algorithms seem well parallelizable, as long as all threads have access to all data. For scaling, this means a probably message-bound distributed algorithm. This is something to look into for the next stage of linked data.

Some inference is needed, but if everybody has their own choice of data sets to query, then everybody would also have their own entailed triples. This will make for an explosion of entailed graphs if forward chaining is used. Forward chaining is very nice because it keeps queries simple and easy to optimize. With Virtuoso, we still favor backward chaining since we expect a great diversity of graph combinations and near infinite volume in the open web scenario. With private repositories of slowly changing data put together for a special application, the situation is different.

In conclusion, we have a real LOD movement with actual momentum and a good idea of what to do next. The next step is promoting this to the broader community, starting with Linked Data Planet in New York in June.

Orri Erling's Weblog

Details

Subscribe

Tag Cloud

Post Categories

Recent Articles

Comments

Post Comment