Virtuoso and Faceted Browsing

Details

The semantic web tech has on one hand SPARQL and on the other hand various visualization things such as faceted browsers like Longwell, Facet, and others.

But these sides do not meet very well, because SPARQL does not support aggregation and grouping, which is the very basic idea of faceted browsing.

So we have looked at possible solutions. The first and most obvious is to add SQL style aggregation to SPARQL. We do this already since one can embed SPARQL into SQL but a SPARQL-only syntax for this would be nice, so as to avoid the need for a SQL client login and so as to use the SPARQL protocol.

The first part of the solution is to allow aggregates in a SPARQL select. The aggregates are directly inherited from SQL, meaning count, min, max, sum, avg, with an optional distinct modifier. All terms of a select that are not aggregates are considered grouping columns, so one gets an implicit group by when combining variables and aggregates in a single select.

This is straightforward and is already a big improvement. But for very large result sets in a browsing situation, we can do something better.

Let's think about DBpedia as an example. We have data about people — things like names, birth/death dates, birthplaces, text descriptions. Some drill-down could be supported with a traditional OLAP cube. But in the semantic web situation, the number of dimensions can be high and the dimensions will usually be sparse. Besides, the number of dimensions is liable to change. Precomputed indices of everything are not the best choice here even though for well-defined analytics they are fine.

For an overview of the data, we can precompute some things. For example, the set of distinct graphs and their respective triple counts, also for each graph/predicate combination. These do not have to be absolutely up-to-date, and provide a quick first level of directory. Also, the count of instances of all classes is a candidate for precomputing. Same with count of triples grouped by class of S, graph.

Once we start talking about specific Ss or non-class Os, it should be either possible to count the triples or count them up to a maximum. A count aggregate function that stops the query when the count reaches a certain maxc might be useful. This would give small counts with precision but for larger cardinalities it would simply say that it is more than a given limit. In some cases this can be extrapolated to an actual count with fair precision but not always.

The more factors are given, the faster it is to count. For example, if we have a query that looks for an S with P1 = O1 and P2 = O2, this is a merge intersection which is very quick, especially if the intersection is a small fraction of the number of rows.

We can also look at the use of full text conditions with browsing. This makes precomputed counts pretty much useless once there is a text condition in the mix since a text condition is not something that can be mapped on a dimension of a cube. In this situation, it would seem appropriate to count to a certain maximum and then stop. Finding the first few matches with an AND of text and equalities on RDF graph objects or subjects is always quick. Depending on the case, one may even scale the count upwards to an estimate by looking at how far one got on the outermost loop of the joins before reaching the count ceiling. There is a little more to this since all joins are not nested loops but this is the general idea.

Relational DBMS often use random sampling of tables for optimization statistics. With RDF, this does not seem to work so well. Or rather, the usual types of stats are quite useless if all triples are in the same table. This is why we take actual samples at optimize time whenever leading key parts have literal values. Further, we take a small random sample of all triples and remember the distinct P and G values, as these are of low cardinality with highly uneven distribution. This gives us a fairly reliable idea of the relative frequencies of the more common Gs and Ps. This is good for query optimization but not necessarily for browsing. Good enough for sorting by frequency but not quite good enough for showing counts.

Anyway, we will start by introducing the basic SQL aggregates and then add a thing for limiting how much gets counted. This will be precise for small counts and give an order of magnitude for larger counts.

Comments

Re:Virtuoso and Faceted Browsing

I'm not sure how practical it would be to use with facetted browsers, but it is possible to implement the aggregation on the SPARQL results. From personal experience I can tell you grouping with XSLT is painful, but it's probably be a lot easier with JSON results, and things like count are straightforward. All assuming performance won't be a problem...

Posted by Danny on 03/22/2007 18:18 GMT

Comments URL for this entry: http://www.openlinksw.com/mt-tb/Http/comments?id=1162

Orri Erling's Weblog

Details

Subscribe

Tag Cloud

Post Categories

Recent Articles

Comments

Post Comment

Orri Erling's Weblog

Details

Subscribe

Tag Cloud

Post Categories

Recent Articles

Related

Comments

Post Comment