"I give the search keywords and you give me a SPARQL end-point and a query that will get the data."

Thus did one SPARQL user describe the task of a semantic/data web search engine.

In a previous post, I suggested that if the data web were the size of the document web, we'd be looking at two orders of magnitude more search complexity. It just might be so.

In the conversation, I pointed out that a search engine might have a copy of everything and even a capability to do SPARQL and full text on it all, yet still the users would be better off doing the queries against the SPARQL end-points of the data publishers. It is a bit like the fact that not all web browsing runs off Google's cache. With the data web, the point is even more pronounced, as serving a hit from Google's cache is a small operation but a complex query might be a very large one.

Yet, the data web is about ad-hoc joining between data sets of different origins. Thus a search engine of the data web ought to be capable of joining also, even if large queries ought to be run against individual publishers' end-points or the user's own data warehouse.

For ranking, the general consensus was that no single hit-ranking would be good for the data web. Thus word frequency-based hit-scores are OK for text hits but more is not obvious. I would think that some link analysis could apply but this will take some more experimentation.

For search summaries, if we have splitting of data sets into small fragments à la Sindice, search summaries are pretty much the same as with just text search. If we store triples, then we can give text style summaries of text hits in literals and Fresnel lens views of the structured data around the literal. For showing a page of hits, the lenses must abbreviate heavily but this is still feasible. The engine would know about the most common ontologies and summarize instance data accordingly.

Chris Bizer pointed out that trust and provenance are critical, especially if an answer is arrived at by joining multiple data sets. The trust of the conclusion is no greater than that of the weakest participating document. Different users will have different trusted sources.

A mature data web search engine would combine a provenance/trust specification, a search condition consisting of SPARQL or full text or both, and a specification for hit rank. Again, most searches would use defaults, but these three components should in principle be orthogonally specifiable.

Many places may host the same data set either for download or SPARQL access. The URI of the data set is not its URL. Different places may further host multiple data sets on one end-point. Thus the search engine ought to return all end-points where the set is to be found. The end-points themselves ought to be able to say what data sets they contain, under what graph IRIs. Since there is no consensus about end-point self description, this too would be left to the search engine. In practice, this could be accomplished by extending Sindice's semantic site map specification. A possible query would be to find an end-point containing a set of named data sets. If none were found, the search engine itself could run a query joining all the sets since it at least would hold them all.

Since many places will host sets like Wordnet or Uniprot, indexing these once for each copy hardly makes sense. Thus a site should identify its data by the data set's URI and not the copy's URL.

It came up in the discussion that search engines should share a ping format so that a single message format would be enough to notify any engine about data being updated. This is already partly the case with Sindice and PTSW (PingTheSemanticWeb) sharing a ping format.

Further, since it is no trouble to publish a copy of the 45G Uniprot file but a fair amount of work to index it, search engines should be smart about processing requests to index things, since these can amount to a denial of service attack.

Probably very large data sets should be indexed only in the form supplied by their publisher, and others hosting copies would just state that they hold a copy. If the claim to the copy proved false, users could complain and the search engine administrator would remove the listing. It seems that some manual curating cannot be avoided here.

On Data Web Search Business Model

It seems there can be an overlap between the data web search and the data web hosting businesses. For example, Talis rents space for hosting RDF data with SPARQL access. A search engine should offer basic indexing of everything for free, but could charge either data publishers or end users for running SPARQL queries across data sets. These do not have the nicely anticipatable and fairly uniform resource consumption of text lookups. In this manner, a search provider could cost-justify the capacity for allowing arbitrary queries.

The value of the data web consists of unexpected joining. Such joining takes place most efficiently if the sources are at least in some proximity, for example in the same data center. Thus the search provider could monetize functioning as the database provider for mesh-ups. In the document web, publishing pages is very simple and there is no great benefit from co-locating search and pages, rather the opposite. For the data web, the hosting with SPARQL and all is more complex and resembles providing search. Thus providing search can combine with providing SPARQL hosting, once we accept in principle that search should have arbitrary inter-document joining, even if it is at an extra premium.

The present search business model is advertising. If the data web is to be accessed by automated agents such as mesh-up code, display of ads is not self-evident. This is quite separate from the fact that semantics can lead to better ad targeting.

One model would be to do text lookups for free from a regular web page but show ads, just a la Google search ads. Using the service via web services for text or SPARQL would have a cost paid by the searching or publishing party and would not be financed by advertising.

In the case of data used in value-add data products (mesh-ups) that have financial value to their users, the original publisher of the data could even be paid for keeping the data up-to-date. This would hold for any time-sensitive feeds like news or financial feeds. Thus the hosting/search provider would be a broker of data-use fees and the data producer would be in the position of an AdSense inventory owner, i.e., a web site which shows AdSense ads. Organizing this under a hub providing back-office functions similar to an ad network could make sense even if the actual processing were divided among many sites.

Kingsley has repeatedly formulated the core value proposition of the semantic web in terms of dealing with information overload: There is the real-time enterprise and the real-time individual and both are beasts of perception. Their image is won and lost in the Internet online conversation space. We know that allegations, even if later proven false, will stick if left unchallenged. The function of semantics on the web is to allow one to track and manage where one stands. In fact, Garlik has made a business of just this, but now from a privacy and security angle. The Garlik DataPatrol harvests data from diverse sources and allows assessing vulnerability to identity theft, for example.

If one is in the business of collating all the structured data in the world, as a data web search engine is, then providing custom alerts for both security or public image management is quite natural. This can be a very valuable service if it works well.

At OpenLink, we will now experiment with the Sindice/Zitgist/PingTheSemanticWeb content. This is a regular part of the productization of Virtuoso's cluster edition. We expect to release some results in the next 4 weeks.