(Second of five posts related to the WWW 2009 conference, held the week of April 20, 2009.)
There was a workshop on semantic search plus a number of papers and of course keynotes from Google and Yahoo.
A general topic was the use of and access to query logs. Are these the monopoly of GYM (Google, Yahoo, Microsoft) or should they be made more generally available? This is a privacy question. Use of query logs and click through of search results for improved ranking was mentioned many times throughout the conference.
The semantic search workshop was largely about benchmarks for keyword search in information retrieval. For linked data, which is a database proposition, these benchmarks are not really applicable. For document search aided by semantics derived by NLP, these are of course applicable. But there is a divide in approach.
Giovanni Tummarello presented Sig.ma, a service using Sindice's RDF index for collecting all RDF statements about entities matching some set of keywords. One could then choose which sources and which entities were the right ones. One could further store such a query and embed it on a page. The point was that the filtering done manually could be persisted and republished, so as to create dynamic content aggregated from selected live sources. Further speculating, one could use such user feedback for adjusting ranking, even though Sig.ma did not. We may adopt the idea of manually excluding sources into our browser too. Fresnel lenses are another thing to look at.
There was a paper by Josep M. Pujol and Pablo Rodriguez, of Telefonica Research, about returning search to the people by means of Porqpine, a peer-to-peer search implementation based on sharing search results from search engines among peers and indexing them locally as they were retrieved. For users with similar interests, this can give a community based ranking model but has issues of privacy. Another point was that with local processing and personal scale data volumes various kinds of brute force processing were feasible that would cost a lot for the web scale. Much can be done web scale but it must be done cleverly, not with a shell script and not so ad hoc.
As a counterpoint to this, there was a talk about Hadoop and Hive, a map-reduce-based SQL-like framework. One could do an SQL GROUP BY on text files with record parsing at run time, all spread over a Hadoop cluster. The issue is, if you have a petabyte of data, you may wish to run more than one ad hoc query on it. This means that joining between partitions and complex processing becomes important. This cannot be done without indices and complex query optimization, and needs a DBMS. Stonebraker and company are fully justified in their critique of map reduce. It looks like each generation must get dazzled by the oversimplified and then retrace the same discoveries of complexity as the previous one.
Some of our future plans were confirmed by what we saw, for example as concerns:
- Interactively selecting sources for search, showing the graphs, then interactively refining
- More social networks, more network analysis, and more work on social recommendation
- Real time indexing of new pings, filling the store by forwarding queries to search engines, and harvesting micro-formats from results
- Using entity extraction
These are all items in the pipeline, easy to do on top of the existing platform. For the machine learning and NLP parts, we will partner with others, details will be worked out while we work on the items we implement by ourselves.