<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0">
<channel>

<title>OpenLink Software&#39;s Virtuoso Submission to the Billion Triples Challenge</title><link>http://www.openlinksw.com:443/blog/vdb/blog/?id=1446</link><description>
Introduction 

We use Virtuoso 6 Cluster Edition to demonstrate the following:

Text and structured information based lookups
Analytics queries
Analysis of co-occurrence of features like interests and tags.
Dealing with identity of multiple IRI&#39;s (owl:sameAs)


The demo is based on a set of canned SPARQL queries that can be invoked using the OpenLink Data Explorer (ODE) Firefox extension.
The demo queries can also be run directly against the SPARQL end point.

The demo is being worked on at the time of submission and may be shown online by appointment.

Automatic annotation of the data based on named entity extraction is
being worked on at the time of this submission.  By the time of ISWC
2008 the set of sample queries will be enhanced with queries based on
extracted named entities and their relationships in the UMBEL and Open
CYC ontologies.


Also examples involving owl:sameAs are being added, likewise  with similarity metrics and search hit scores.

The Data

The database consists of the billion triples data sets and some additions like Umbel.   Also the Freebase extract is newer than the challenge original.
The triple count is 1115 million.
In the case of web harvested resources, the data is loaded in one graph per resource.
In the case of larger data sets like Dbpedia or the US census, all triples of the provenance share a data set specific graph.
All string literals are additionally indexed in a full text index.  No stop words are used.

Most queries do not specify a graph.  Thus they are evaluated against the union of all the graphs in the database.
The indexing scheme is SPOG, GPOS, POGS, OPGS.  All indices ending in S are bitmap indices.


The Queries 


The demo uses Virtuoso SPARQL extensions  in most queries.  These
extensions consist on one hand of well known SQL features like
aggregation with grouping and existence and value subqueries and on
the other of RDF specific features.
The latter include  run time RDFS and OWL inferencing support  and backward
chaining subclasses and transitivity.  



Simple Lookups 

sparql 
select ?s ?p (bif:search_excerpt (bif:vector (&#39;semantic&#39;, &#39;web&#39;), ?o)) 
where 
  {
    ?s ?p ?o . 
    filter (bif:contains (?o, &amp;quot;&#39;semantic web&#39;&amp;quot;)) 
  } 
limit 10
;


This looks up triples with semantic web in the object and makes a search hit summary of the literal, 
highlighting the search terms.


sparql 
select ?tp count(*) 
where 
  { 
    ?s ?p2 ?o2 . 
    ?o2 a ?tp . 
    ?s foaf:nick ?o . 
    filter (bif:contains (?o, &amp;quot;plaid_skirt&amp;quot;)) 
  } 
group by ?tp
order by desc 2
limit 40
;


This looks at what sorts of things are referenced by the properties of the foaf handle plaid_skirt.
What are these things called?

sparql 
select ?lbl count(*) 
where 
  { 
    ?s ?p2 ?o2 . 
    ?o2 rdfs:label ?lbl . 
    ?s foaf:nick ?o . 
    filter (bif:contains (?o, &amp;quot;plaid_skirt&amp;quot;)) 
  } 
group by ?lbl
order by desc 2
;


Many of these things do not have a rdfs:label.  Let us use a more general concept of lable 
which groups dc:title, foaf:name and other name-like properties together.  The subproperties are 
resolved at run time, there is no materialization.


sparql 
define input:inference &#39;b3s&#39;
select ?lbl count(*) 
where 
  { 
    ?s ?p2 ?o2 . 
    ?o2 b3s:label ?lbl . 
    ?s foaf:nick ?o . 
    filter (bif:contains (?o, &amp;quot;plaid_skirt&amp;quot;)) 
  } 
group by ?lbl
order by desc 2
;


We can list sources by the topics they contain.  
Below we look for graphs that mention terrorist bombing.


sparql 
select ?g count(*) 
where 
  { 
    graph ?g 
      {
        ?s ?p ?o . 
        filter (bif:contains (?o, &amp;quot;&#39;terrorist bombing&#39;&amp;quot;)) 
      }
  } 
group by ?g 
order by desc 2
;


Now some web 2.0 tagging of search results.  The tag cloud of &amp;quot;computer&amp;quot;

sparql 
select ?lbl count (*) 
where 
  { 
    ?s ?p ?o . 
    ?o bif:contains &amp;quot;computer&amp;quot; . 
    ?s sioc:topic ?tg .
    optional 
      {
        ?tg rdfs:label ?lbl
      }
  }
group by ?lbl 
order by desc 2 
limit 40
;


This query will find the posters who talk the most about sex.

sparql 
select ?auth count (*) 
where 
  { 
    ?d dc:creator ?auth .
    ?d ?p ?o
    filter (bif:contains (?o, &amp;quot;sex&amp;quot;)) 
  } 
group by ?auth
order by desc 2
;


Analytics 

We look for people who are joined by having relatively uncommon interests but do not know each other.

sparql select ?i ?cnt ?n1 ?n2 ?p1 ?p2 
where 
  {
    {
      select ?i count (*) as ?cnt 
      where 
        { ?p foaf:interest ?i } 
      group by ?i
    }
    filter ( ?cnt &amp;gt; 1 &amp;amp;&amp;amp; ?cnt &amp;lt; 10) .
    ?p1 foaf:interest ?i .
    ?p2 foaf:interest ?i .
    filter  (?p1 != ?p2 &amp;amp;&amp;amp; 
             !bif:exists ((select (1) where {?p1 foaf:knows ?p2 })) &amp;amp;&amp;amp; 
             !bif:exists ((select (1) where {?p2 foaf:knows ?p1 }))) .
    ?p1 foaf:nick ?n1 .
    ?p2 foaf:nick ?n2 .
  } 
order by ?cnt 
limit 50
;


The query takes a fairly long time, mostly spent counting the interested in 25M interest triples.  
It then takes people that share the interest and checks that neither claims to know the other.  
It then sorts the results rarest interest first.  The query can be written more efficently but is 
here just to show that database-wide scans of the population are possible ad hoc.


Now we go to SQL to make a tag co-occurrence matrix. This can be used for showing a Technorati-style
related tags line at the bottom of a search result page.  This showcases the use of SQL together 
with SPARQL.  The half-matrix of tags t1, t2 with the co-occurrence count at the intersection is 
much more efficiently done in SQL, specially since it gets updated as the data changes.  
This is an example of materialized intermediate results based on warehoused RDF.


create table 
tag_count (tcn_tag iri_id_8, 
           tcn_count int, 
           primary key (tcn_tag));
           
alter index 
tag_count on tag_count partition (tcn_tag int (0hexffff00));

create table 
tag_coincidence (tc_t1 iri_id_8, 
                 tc_t2 iri_id_8, 
                 tc_count int, 
                 tc_t1_count int, 
                 tc_t2_count int, 
                 primary key  (tc_t1, tc_t2))

alter index 
tag_coincidence on tag_coincidence partition (tc_t1 int (0hexffff00));

create index 
tc2 on tag_coincidence (tc_t2, tc_t1) partition (tc_t2 int (0hexffff00));


How many times each topic is mentioned?


insert into tag_count 
  select * 
    from (sparql define output:valmode &amp;quot;LONG&amp;quot; 
                 select ?t count (*) as ?cnt 
                 where 
                   {
                     ?s sioc:topic ?t
                   } 
                 group by ?t) 
    xx option (quietcast);


Take all t1, t2 where t1 and t2 are tags of the same subject, store only the permutation where the internal id of t1 &amp;lt; that of t2.

insert into tag_coincidence  (tc_t1, tc_t2, tc_count)
  select &amp;quot;t1&amp;quot;, &amp;quot;t2&amp;quot;, cnt 
    from 
      (select  &amp;quot;t1&amp;quot;, &amp;quot;t2&amp;quot;, count (*) as cnt 
         from 
           (sparql define output:valmode &amp;quot;LONG&amp;quot;
                   select ?t1 ?t2 
                     where 
                       {
                         ?s sioc:topic ?t1 . 
                         ?s sioc:topic ?t2 
                       }) tags
         where &amp;quot;t1&amp;quot; &amp;lt; &amp;quot;t2&amp;quot; 
         group by &amp;quot;t1&amp;quot;, &amp;quot;t2&amp;quot;) xx
    where isiri_id (&amp;quot;t1&amp;quot;) and 
          isiri_id (&amp;quot;t2&amp;quot;) 
    option (quietcast); 


Now put the individual occurrence counts into the same table with the co-occurrence.  This 
denormalization makes the related tags lookup faster.



update tag_coincidence 
  set tc_t1_count = (select tcn_count from tag_count where tcn_tag = tc_t1),
      tc_t2_count = (select tcn_count from tag_count where tcn_tag = tc_t2);


Now each tag_coincidence row has the joint occurrence count and individual occurrence counts.  
A single select will return a Technorati-style related tags listing.


To show the URI&#39;s of the tags:


select top 10 id_to_iri (tc_T1), id_to_iri (tc_t2), tc_count 
  from tag_coincidence 
  order by tc_count desc;


Social Networks 

We look at what interests people have 

sparql 
select ?o ?cnt  
where 
  {
    {
      select ?o count (*) as ?cnt 
        where 
          {
            ?s foaf:interest ?o
          } 
        group by ?o
    } 
    filter (?cnt &amp;gt; 100) 
  } 
order by desc 2 
limit 100
;


Now the same for the Harry Potter fans 

sparql 
select ?i2 count (*) 
where 
  { 
    ?p foaf:interest &amp;lt;http://www.livejournal.com/interests.bml?int=harry+potter&amp;gt; .
    ?p foaf:interest ?i2 
  } 
group by ?i2 
order by desc 2 
limit 20
;


We see whether knows relations are symmmetrical.  We return the top n people that others claim to know without being reciprocally known.

sparql 
select ?celeb, count (*) 
where 
  { 
    ?claimant foaf:knows ?celeb . 
    filter (!bif:exists ((select (1) 
                          where 
                            {
                              ?celeb foaf:knows ?claimant 
                            }))) 
  } 
group by ?celeb 
order by desc 2 
limit 10
;


We look for a well connected person to start from.

sparql 
select ?p count (*) 
where 
  {
    ?p foaf:knows ?k 
  } 
group by ?p 
order by desc 2 
limit 50
;


We look for the most connected of the many online identities of Stefan Decker.

sparql 
select ?sd count (distinct ?xx) 
where 
  { 
    ?sd a foaf:Person . 
    ?sd ?name ?ns . 
    filter (bif:contains (?ns, &amp;quot;&#39;Stefan Decker&#39;&amp;quot;)) . 
    ?sd foaf:knows ?xx 
  } 
group by ?sd 
order by desc 2
;


We count the transitive closure of Stefan Decker&#39;s connections 

sparql 
select count (*) 
where 
  { 
    {
      select * 
      where 
        { 
          ?s foaf:knows ?o 
        }
    }
    option (transitive, t_distinct, t_in(?s), t_out(?o)) . 
    filter (?s = &amp;lt;mailto:stefan.decker@deri.org&amp;gt;)
  }
;


Now we do the same while following owl:sameAs links.

sparql 
define input:same-as &amp;quot;yes&amp;quot;
select count (*) 
where 
  { 
    {
      select * 
      where 
        { 
          ?s foaf:knows ?o 
        }
    }
    option (transitive, t_distinct, t_in(?s), t_out(?o)) . 
    filter (?s = &amp;lt;mailto:stefan.decker@deri.org&amp;gt;)
  }
;


Demo System 

The system runs on Virtuoso 6 Cluster Edition.  The database is partitioned into 12 partitions, 
each served by a distinct server process. The system demonstrated hosts these 12 servers on 2 
machines, each with  2 xXeon 5345 and 16GB memory and 4 SATA disks. For scaling, the processes 
and corresponding partitions can be spread over a larger number of machines.  If each ran on its 
own server with 16GB RAM, the whole data set could be served from memory. This is desirable for 
search engine or fast analytics applications. Most of the demonstrated queries run in memory on 
second invocation. The timing difference between first and second run is easily an order of 
magnitude.

</description><pubDate>Tue, 30 Sep 2008 16:24:34 GMT</pubDate><generator>Virtuoso Universal Server 08.03.3334</generator><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Virtuoso Data Space Bot</dc:creator><image><title>OpenLink Software&#39;s Virtuoso Submission to the Billion Triples Challenge</title><url>http://www.openlinksw.com:443/weblog/public/images/vbloglogo.gif</url><link>http://www.openlinksw.com:443/blog/vdb/blog/?id=1446</link><description>A great place to track Virtuoso&#39;s rapid evolution.</description><width>88</width><height>31</height></image>

</channel>
</rss>
