Virtuoso Open-Source Wiki
Virtuoso Open-Source, OpenLink Data Spaces, and OpenLink Ajax Toolkit
Advanced Search
Help?
Location: / Dashboard / Main / VirtTipsAndTricksGuide / VirtTipsAndTricksGuideRandomSampleAllTriples

What is the best method to get a random sample of all triples for a subset of all the resources of a SPARQL endpoint?

The best method to get a random sample of all triples for a subset of all the resources of a SPARQL endpoint, is decimation in its original style:

SELECT ?s ?p ?o 
FROM <some-graph>
WHERE 
  { 
    ?s ?p ?o .
    FILTER ( 1 >  <bif:rnd> (10, ?s, ?p, ?o) )
  }

By tweaking the first argument of bif:rnd() and the left side of the inequality, you can tweak the decimation ratio from 1/10 to any desired value. It is important to know that the SQL optimizer has a right to execute bif:rnd (10) only once at the beginning of the query, so we pass three additional arguments that can be known only when a table row is fetched. Thus, bif:rnd (10, ?s, ?p, ?o) is calculated for each and every row, and any given row is either returned or ignored independently from others.

However, bif:rnd (10, ?s, ?p, ?o) contains a subtle inefficiency. In the RDF store, graph nodes are stored as numeric IRI IDs, and literal objects may be stored in a separate table. A SQL function call needs arguments of traditional SQL datatypes, so the query processor will extract the text of the IRI for each node and the full value for each literal object. That is a significant waste of time. The workaround is to tell the SPARQL front-end to omit redundant conversions of values, by use of the SHORT_OR_LONG tag, as shown here --

SELECT ?s ?p ?o 
FROM <some-graph> 
WHERE 
  { 
    ?s ?p ?o .
    FILTER ( 1 >  <SHORT_OR_LONG::bif:rnd> (10, ?s, ?p, ?o))  
  }

Live Example

The following SPARQL Query shows random occurrences of dc:description on the LOD Cloud Cache instance:

SELECT * 
WHERE 
  {
    ?s <http://purl.org/dc/elements/1.1/description> ?o
    FILTER ( 1 >  <SHORT_OR_LONG::bif:rnd> (10, ?s,  ?o))  
  }
limit 100

View the results of the query execution here.

Related

Powered By Virtuoso