Why not see the whole world of data as facets? Well, we'd like to, but there is the feeling that this is not practical.
The old problem has been that it is not really practical to pre-compute counts of everything for all possible combinations of search conditions and counting/grouping/sorting. The actual matches take time.
Well, neither is in fact necessary. When there are large numbers of items matching the conditions, counting them can take time but then this is the beginning of the search, and the user is not even likely to look very closely at the counts. It is enough to see that there are many of one and few of another. If the user already knows the precise predicate or class to look for, then the top-level faceted view is not even needed. The faceted view for guiding search and precise analytics are two different problems.
There are client-side faceted views like Exhibit or our own ODE. The problem with these is that there are a few orders of magnitude difference between the actual database size and what fits on the user agent. This is compounded by the fact that one does not know what to cache on the user agent because of the open nature of the data web. If this were about a fixed workflow, then a good guess would be possible — but we are talking about the data web, the very soul of serendipity and unexpected discovery.
So we made a web service that will do faceted search on arbitrary RDF. If it does not get complete results within a timeout, it will return what it has counted so far, using Virtuoso's Anytime feature. Looking for subjects with some specific combination of properties is however a bit limited, so this will also do JOINs. Many features are one or two JOINs away; take geographical locations or social networks, for example.
Yet a faceted search should be point-and-click, and should not involve a full query construction. We put the compromise at starting with full text or property or class, then navigating down properties or classes, to arbitrary depth, tree-wise. At each step, one can see the matching instances or their classes or properties, all with counts, faceted-style.
This is good enough for queries like 'what do Harry Potter fans also like' or 'who are the authors of articles tagged semantic web and machine learning and published in 2008'. For complex grouping, sub-queries, arithmetic or such, one must write the actual query.
But one can begin with facets, and then continue refining the query by hand since the service also returns SPARQL text. We made a small web interface on top of the service with all logic server side. This proves that the web service is usable and that an interface with no AJAX, and no problems with browser interoperability or such, is possible and easy. Also, the problem of syncing between a user-agent-based store and a database is entirely gone.
The web service takes an XML representation of the search, which is more restricted and easier to process by machine than the SPARQL syntax. The web service returns the results, the SPARQL query it generated, whether the results are complete or not, and some resource use statistics.
The source of the PL functions, Web Service and Virtuoso Server Page (HTML UI) will be available as part of Virtuoso 6.0 and higher. A Programmer's Guide will be available as part of the standard Virtuoso Documentation collection, including the Virtuoso Open Source Edition Website.
About this entry:
Author: Orri Erling
Published: 01/09/2009 22:03 GMT
01/09/2009 17:15 GMT
Comment Status: 0 Comments