A persistent argument against the linked data web has been the cost, scalability, and vulnerability of SPARQL end points, should the linked data web gain serious mass and traffic.

As we are on the brink of hosting the whole DBpedia Linked Open Data cloud in Virtuoso Cluster, we have had to think of what we'll do if, for example, somebody decides to count all the triples in the set.

How can we encourage clever use of data, yet not die if somebody, whether through malice, lack of understanding, or simple bad luck, submits impossible queries?

Restricting the language is not the way; any language beyond text search can express queries that will take forever to execute. Also, just returning a timeout after the first second (or whatever arbitrary time period) leaves people in the dark and does not produce an impression of responsiveness. So we decided to allow arbitrary queries, and if a quota of time or resources is exceeded, we return partial results and indicate how much processing was done.

Here we are looking for the top 10 people whom people claim to know without being known in return, like this:

SQL> sparql 
SELECT ?celeb, 
       COUNT (*)
WHERE { ?claimant foaf:knows ?celeb .
        FILTER (!bif:exists ( SELECT (1) 
                              WHERE { ?celeb foaf:knows ?claimant }
                            )
               )
      } 
GROUP BY ?celeb 
ORDER BY DESC 2 
LIMIT 10;
celeb callret-1 VARCHAR VARCHAR ________________________________________ _________
http://twitter.com/BarackObama 252 http://twitter.com/brianshaler 183 http://twitter.com/newmediajim 101 http://twitter.com/HenryRollins 95 http://twitter.com/wilw 81 http://twitter.com/stevegarfield 78 http://twitter.com/cote 66 mailto:adam.westerski@deri.org 66 mailto:michal.zaremba@deri.org 66 http://twitter.com/dsifry 65
*** Error S1TAT: [Virtuoso Driver][Virtuoso Server]RC...: Returning incomplete results, query interrupted by result timeout. Activity: 1R rnd 0R seq 0P disk 1.346KB / 3 messages
SQL> sparql SELECT ?celeb, COUNT (*) WHERE { ?claimant foaf:knows ?celeb . FILTER (!bif:exists ( SELECT (1) WHERE { ?celeb foaf:knows ?claimant } ) ) } GROUP BY ?celeb ORDER BY DESC 2 LIMIT 10;
celeb callret-1 VARCHAR VARCHAR ________________________________________ _________
http://twitter.com/JasonCalacanis 496 http://twitter.com/Twitterrific 466 http://twitter.com/ev 442 http://twitter.com/BarackObama 356 http://twitter.com/laughingsquid 317 http://twitter.com/gruber 294 http://twitter.com/chrispirillo 259 http://twitter.com/ambermacarthur 224 http://twitter.com/t 219 http://twitter.com/johnedwards 188
*** Error S1TAT: [Virtuoso Driver][Virtuoso Server]RC...: Returning incomplete results, query interrupted by result timeout. Activity: 329R rnd 44.6KR seq 342P disk 638.4KB / 46 messages

The first query read all data from disk; the second run had the working set from the first and could read some more before time ran out, hence the results were better. But the response time was the same.

If one has a query that just loops over consecutive joins, like in basic SPARQL, interrupting the processing after a set time period is simple. But such queries are not very interesting. To give meaningful partial answers with nested aggregation and sub-queries requires some more tricks. The basic idea is to terminate the innermost active sub-query/aggregation at the first timeout, and extend the timeout a bit so that accumulated results get fed to the next aggregation, like from the GROUP BY to the ORDER BY. If this again times out, we continue with the next outer layer. This guarantees that results are delivered if there were any results found for which the query pattern is true. False results are not produced, except in cases where there is comparison with a count and the count is smaller than it would be with the full evaluation.

One can also use this as a basis for paid services. The cutoff does not have to be time; it can also be in other units, making it insensitive to concurrent usage and variations of working set.

This system will be deployed on our Billion Triples Challenge demo instance in a few days, after some more testing. When Virtuoso 6 ships, all LOD Cloud AMIs and OpenLink-hosted LOD Cloud SPARQL endpoints will have this enabled by default. (AMI users will be able to disable the feature, if desired.) The feature works with Virtuoso 6 in both single server and cluster deployment.