Exploiting Virtuoso's Anytime Query Feature
What?
Virtuoso has a native mechanism for providing query results within a configurable time-frame.Why?
This feature is called "Anytime Query" and its vital to any practical DBMS related offering (especially on the World Wide Web) that supports query access from an unpredictable number of clients performing a variety of queries.How?
This functionality is available via the default HTML based SPARQL Query Editor page that accompanies all Virtuoso instances. It is also available via a parameter (&timeout={time in milliseconds}) as part of Virtuoso's SPARQL Protocol implementation.Sample Scenario
The following scenario demonstrates how to manage anytime SPARQL query execution using the timeout parameter at a Virtuoso SPARQL Endpoint.
- Suppose a simple query:
SELECT DISTINCT ?p WHERE { ?s ?p ?o } LIMIT 100
- Suppose the query execution fails against Virtuoso endpoint due to running out of execution time as set in virtuoso.ini configuration file. This is basically a table scan, since it has to go through each quad and create a HASH of unique P it sees. After that is done it will return the first 100 values of P from this hash.
- Next we will use SQL translation and explain of the query:
SQL> SET sparql_translate ON; SQL> SELECT DISTINCT ?p WHERE { ?s ?p ?o } LIMIT 100; SPARQL_TO_SQL_TEXT VARCHAR _______________________________________________________________________________ SELECT __ro2sq ("s_2_0_rbc"."p") AS "p" FROM (SELECT DISTINCT TOP 100 __id2i ( "s_2_1-t0"."P" ) AS "p" FROM DB.DBA.RDF_QUAD AS "s_2_1-t0" OPTION (QUIETCAST)) AS "s_2_0_rbc" 1 Rows. -- 1 msec. SQL> set sparql_translate off; SQL> set explain on; SQL> SELECT __ro2sq ("s_2_0_rbc"."p") AS "p" FROM (SELECT DISTINCT TOP 100 __id2i ( "s_2_1-t0"."P" ) AS "p" FROM DB.DBA.RDF_QUAD AS "s_2_1-t0" OPTION (QUIETCAST)) AS "s_2_0_rbc" SQL> SELECT __ro2sq ("s_2_0_rbc"."p") AS "p" FROM (SELECT DISTINCT TOP 100 __id2i ( "s_2_1-t0"."P" ) AS "p" FROM DB.DBA.RDF_QUAD AS "s_2_1-t0" OPTION (QUIETCAST)) AS "s_2_0_rbc" ; REPORT VARCHAR _______________________________________________________________________________ { local save: ($22 "set_no", $23 "set_no_save") Subquery 20 { from DB.DBA.RDF_QUAD by RDF_QUAD_POGS 8.9e+07 rows Key RDF_QUAD_POGS ASC ($25 "s_2_1-t0.P") After code: 0: $28 "__id2i" := Call __id2i ($25 "s_2_1-t0.P") 5: BReturn 0 Distinct (HASH) ($25 "s_2_1-t0.P") After code: 0: $21 "p" := := artm $28 "__id2i" 4: BReturn 0 Subquery Select(TOP 100 ) ($21 "p", <$27 "<DB.DBA.RDF_QUAD s_2_1-t0>" spec 5>) } After code: 0: $44 "p" := Call __ro2sq ($21 "p") 5: BReturn 0 Select ($44 "p") } 26 Rows. -- 1 msec. SQL> set explain off;
- As we can see it is going to use the RDF_QUAD_POGS index which offers the best query plan construction basis. It would be marginally faster if there was an index that started with GP but the current 2+3 index scheme does not have such.
- So it is almost a full table scan, which might not complete on most of our systems due to limits set in the INI file for Web accessible instances
- Set the Execution
timeouton the /sparql form so to enable the Anytime paradigm:- Go to SPARQL Endpoint, for ex.
the DBPpedia SPARQL Endpoint
- Enter the query:
SELECT distinct ?p WHERE { ?s ?p ?o } LIMIT 100
- Enter for "Execution timeout": 60000
- Click "Run Query"
- As result should be redirected to the following URL (containing parameter
timeout=60000), that presents the found results:
http://dbpedia.org/sparql/?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=select+distinct+%3Fp%0D%0AWHERE%0D%0A{%0D%0A+%3Fs+%3Fp+%3Fo%0D%0A}%0D%0ALIMIT+100%0D%0A&format=text%2Fhtml&timeout=60000&debug=on
- Go to SPARQL Endpoint, for ex.
the DBPpedia SPARQL Endpoint
- Conclusions: When using the timeout value of 10000 (10 sec) you get only 1 unique ?p value back. Using a timeout value of 60000 (60 sec) would only show like 6 different values for ?p, which is logical considering that the index is sorted by Predicate (the P slot in the RDF triple pattern). So once it gets the first unique ?p, it has to skip all the triples that have the same value of ?p, to get to the next one.
Using cURL
cURL Variant without the timeout parameter
$ curl -F "query=SELECT DISTINCT ?p WHERE { ?s ?p ?o } LIMIT 100" http://dbpedia.org/sparql
Virtuoso S1T00 Error SR171: Transaction timed out
SPARQL query:
define sql:big-data-const 0 SELECT DISTINCT ?p WHERE { ?s ?p ?o} LIMIT 100
cURL Variant using the timeout parameter
$ curl "http://dbpedia.org/sparql/?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=SELECT+DISTINCT+%3Fp%0D%0AWHERE%0D%0A%7B%0D
%0A+%3Fs+%3Fp+%3Fo%0D%0A%7D%0D%0ALIMIT+100%0D%0A&format=text%2Fhtml&timeout=60000&debug=on"
<table class="sparql" border="1">
<tr>
<th>p</th>
</tr>
<tr>
<td>http://www.w3.org/1999/02/22-rdf-syntax-ns#type</td>
</tr>
<tr>
<td>http://www.w3.org/2002/07/owl#equivalentClass</td>
</tr>
<tr>
<td>http://www.w3.org/2002/07/owl#sameAs</td>
</tr>
<tr>
<td>http://www.w3.org/2002/07/owl#equivalentProperty</td>
</tr>
<tr>
<td>http://www.w3.org/2000/01/rdf-schema#subClassOf</td>
</tr>
<tr>
<td>http://www.w3.org/2004/02/skos/core#broader</td>
</tr>
<tr>
<td>http://www.w3.org/2000/01/rdf-schema#comment</td>
</tr>
<tr>
<td>http://www.w3.org/2000/01/rdf-schema#label</td>
</tr>
<tr>
<td>http://xmlns.com/foaf/0.1/name</td>
</tr>
<tr>
<td>http://xmlns.com/foaf/0.1/nick</td>
</tr>
<tr>
<td>http://www.w3.org/2004/02/skos/core#prefLabel</td>
</tr>
<tr>
<td>http://www.w3.org/2003/01/geo/wgs84_pos#lat</td>
</tr>
<tr>
<td>http://www.w3.org/2003/01/geo/wgs84_pos#long</td>
</tr>
<tr>
<td>http://www.w3.org/2000/01/rdf-schema#domain</td>
</tr>
<tr>
<td>http://www.w3.org/2000/01/rdf-schema#range</td>
</tr>
<tr>
<td>http://www.w3.org/2002/07/owl#versionInfo</td>
</tr>
<tr>
<td>http://dbpedia.org/ontology/purpose</td>
</tr>
<tr>
<td>http://dbpedia.org/ontology/supplementalDraftRound</td>
</tr>
<tr>
<td>http://dbpedia.org/ontology/podiums</td>
</tr>
<tr>
<td>http://dbpedia.org/ontology/buildingStartDate</td>
</tr>
</table>