OpenLink Software's Virtuoso Submission to the Billion Triples Challenge

Introduction

We use Virtuoso 6 Cluster Edition to demonstrate the following:

Text and structured information based lookups
Analytics queries
Analysis of co-occurrence of features like interests and tags.
Dealing with identity of multiple IRI's (owl:sameAs)

The demo is based on a set of canned SPARQL queries that can be invoked using the OpenLink Data Explorer (ODE) Firefox extension.

The demo queries can also be run directly against the SPARQL end point.

The demo is being worked on at the time of submission and may be shown online by appointment.

Automatic annotation of the data based on named entity extraction is being worked on at the time of this submission. By the time of ISWC 2008 the set of sample queries will be enhanced with queries based on extracted named entities and their relationships in the UMBEL and Open CYC ontologies.

Also examples involving owl:sameAs are being added, likewise with similarity metrics and search hit scores.

The Data

The database consists of the billion triples data sets and some additions like Umbel. Also the Freebase extract is newer than the challenge original.

The triple count is 1115 million.

In the case of web harvested resources, the data is loaded in one graph per resource.

In the case of larger data sets like Dbpedia or the US census, all triples of the provenance share a data set specific graph.

All string literals are additionally indexed in a full text index. No stop words are used.

Most queries do not specify a graph. Thus they are evaluated against the union of all the graphs in the database. The indexing scheme is SPOG, GPOS, POGS, OPGS. All indices ending in S are bitmap indices.

The Queries

The demo uses Virtuoso SPARQL extensions in most queries. These extensions consist on one hand of well known SQL features like aggregation with grouping and existence and value subqueries and on the other of RDF specific features. The latter include run time RDFS and OWL inferencing support and backward chaining subclasses and transitivity.

Simple Lookups

sparql 
select ?s ?p (bif:search_excerpt (bif:vector ('semantic', 'web'), ?o)) 
where 
  {
    ?s ?p ?o . 
    filter (bif:contains (?o, "'semantic web'")) 
  } 
limit 10
;

This looks up triples with semantic web in the object and makes a search hit summary of the literal, highlighting the search terms.

sparql 
select ?tp count(*) 
where 
  { 
    ?s ?p2 ?o2 . 
    ?o2 a ?tp . 
    ?s foaf:nick ?o . 
    filter (bif:contains (?o, "plaid_skirt")) 
  } 
group by ?tp
order by desc 2
limit 40
;

This looks at what sorts of things are referenced by the properties of the foaf handle plaid_skirt.

What are these things called?

sparql 
select ?lbl count(*) 
where 
  { 
    ?s ?p2 ?o2 . 
    ?o2 rdfs:label ?lbl . 
    ?s foaf:nick ?o . 
    filter (bif:contains (?o, "plaid_skirt")) 
  } 
group by ?lbl
order by desc 2
;

Many of these things do not have a rdfs:label. Let us use a more general concept of lable which groups dc:title, foaf:name and other name-like properties together. The subproperties are resolved at run time, there is no materialization.

sparql 
define input:inference 'b3s'
select ?lbl count(*) 
where 
  { 
    ?s ?p2 ?o2 . 
    ?o2 b3s:label ?lbl . 
    ?s foaf:nick ?o . 
    filter (bif:contains (?o, "plaid_skirt")) 
  } 
group by ?lbl
order by desc 2
;

We can list sources by the topics they contain. Below we look for graphs that mention terrorist bombing.

sparql 
select ?g count(*) 
where 
  { 
    graph ?g 
      {
        ?s ?p ?o . 
        filter (bif:contains (?o, "'terrorist bombing'")) 
      }
  } 
group by ?g 
order by desc 2
;

Now some web 2.0 tagging of search results. The tag cloud of "computer"

sparql 
select ?lbl count (*) 
where 
  { 
    ?s ?p ?o . 
    ?o bif:contains "computer" . 
    ?s sioc:topic ?tg .
    optional 
      {
        ?tg rdfs:label ?lbl
      }
  }
group by ?lbl 
order by desc 2 
limit 40
;

This query will find the posters who talk the most about sex.

sparql 
select ?auth count (*) 
where 
  { 
    ?d dc:creator ?auth .
    ?d ?p ?o
    filter (bif:contains (?o, "sex")) 
  } 
group by ?auth
order by desc 2
;

Analytics

We look for people who are joined by having relatively uncommon interests but do not know each other.

sparql select ?i ?cnt ?n1 ?n2 ?p1 ?p2 
where 
  {
    {
      select ?i count (*) as ?cnt 
      where 
        { ?p foaf:interest ?i } 
      group by ?i
    }
    filter ( ?cnt > 1 && ?cnt < 10) .
    ?p1 foaf:interest ?i .
    ?p2 foaf:interest ?i .
    filter  (?p1 != ?p2 && 
             !bif:exists ((select (1) where {?p1 foaf:knows ?p2 })) && 
             !bif:exists ((select (1) where {?p2 foaf:knows ?p1 }))) .
    ?p1 foaf:nick ?n1 .
    ?p2 foaf:nick ?n2 .
  } 
order by ?cnt 
limit 50
;

The query takes a fairly long time, mostly spent counting the interested in 25M interest triples. It then takes people that share the interest and checks that neither claims to know the other. It then sorts the results rarest interest first. The query can be written more efficently but is here just to show that database-wide scans of the population are possible ad hoc.

Now we go to SQL to make a tag co-occurrence matrix. This can be used for showing a Technorati-style related tags line at the bottom of a search result page. This showcases the use of SQL together with SPARQL. The half-matrix of tags t1, t2 with the co-occurrence count at the intersection is much more efficiently done in SQL, specially since it gets updated as the data changes. This is an example of materialized intermediate results based on warehoused RDF.

create table 
tag_count (tcn_tag iri_id_8, 
           tcn_count int, 
           primary key (tcn_tag));
           
alter index 
tag_count on tag_count partition (tcn_tag int (0hexffff00));

create table 
tag_coincidence (tc_t1 iri_id_8, 
                 tc_t2 iri_id_8, 
                 tc_count int, 
                 tc_t1_count int, 
                 tc_t2_count int, 
                 primary key  (tc_t1, tc_t2))

alter index 
tag_coincidence on tag_coincidence partition (tc_t1 int (0hexffff00));

create index 
tc2 on tag_coincidence (tc_t2, tc_t1) partition (tc_t2 int (0hexffff00));

How many times each topic is mentioned?

insert into tag_count 
  select * 
    from (sparql define output:valmode "LONG" 
                 select ?t count (*) as ?cnt 
                 where 
                   {
                     ?s sioc:topic ?t
                   } 
                 group by ?t) 
    xx option (quietcast);

Take all t1, t2 where t1 and t2 are tags of the same subject, store only the permutation where the internal id of t1 < that of t2.

insert into tag_coincidence  (tc_t1, tc_t2, tc_count)
  select "t1", "t2", cnt 
    from 
      (select  "t1", "t2", count (*) as cnt 
         from 
           (sparql define output:valmode "LONG"
                   select ?t1 ?t2 
                     where 
                       {
                         ?s sioc:topic ?t1 . 
                         ?s sioc:topic ?t2 
                       }) tags
         where "t1" < "t2" 
         group by "t1", "t2") xx
    where isiri_id ("t1") and 
          isiri_id ("t2") 
    option (quietcast);

Now put the individual occurrence counts into the same table with the co-occurrence. This denormalization makes the related tags lookup faster.

update tag_coincidence 
  set tc_t1_count = (select tcn_count from tag_count where tcn_tag = tc_t1),
      tc_t2_count = (select tcn_count from tag_count where tcn_tag = tc_t2);

Now each tag_coincidence row has the joint occurrence count and individual occurrence counts. A single select will return a Technorati-style related tags listing.

To show the URI's of the tags:

select top 10 id_to_iri (tc_T1), id_to_iri (tc_t2), tc_count 
  from tag_coincidence 
  order by tc_count desc;

Social Networks

We look at what interests people have

sparql 
select ?o ?cnt  
where 
  {
    {
      select ?o count (*) as ?cnt 
        where 
          {
            ?s foaf:interest ?o
          } 
        group by ?o
    } 
    filter (?cnt > 100) 
  } 
order by desc 2 
limit 100
;

Now the same for the Harry Potter fans

sparql 
select ?i2 count (*) 
where 
  { 
    ?p foaf:interest <http://www.livejournal.com/interests.bml?int=harry+potter> .
    ?p foaf:interest ?i2 
  } 
group by ?i2 
order by desc 2 
limit 20
;

We see whether knows relations are symmmetrical. We return the top n people that others claim to know without being reciprocally known.

sparql 
select ?celeb, count (*) 
where 
  { 
    ?claimant foaf:knows ?celeb . 
    filter (!bif:exists ((select (1) 
                          where 
                            {
                              ?celeb foaf:knows ?claimant 
                            }))) 
  } 
group by ?celeb 
order by desc 2 
limit 10
;

We look for a well connected person to start from.

sparql 
select ?p count (*) 
where 
  {
    ?p foaf:knows ?k 
  } 
group by ?p 
order by desc 2 
limit 50
;

We look for the most connected of the many online identities of Stefan Decker.

sparql 
select ?sd count (distinct ?xx) 
where 
  { 
    ?sd a foaf:Person . 
    ?sd ?name ?ns . 
    filter (bif:contains (?ns, "'Stefan Decker'")) . 
    ?sd foaf:knows ?xx 
  } 
group by ?sd 
order by desc 2
;

We count the transitive closure of Stefan Decker's connections

sparql 
select count (*) 
where 
  { 
    {
      select * 
      where 
        { 
          ?s foaf:knows ?o 
        }
    }
    option (transitive, t_distinct, t_in(?s), t_out(?o)) . 
    filter (?s = <mailto:stefan.decker@deri.org>)
  }
;

Now we do the same while following owl:sameAs links.

sparql 
define input:same-as "yes"
select count (*) 
where 
  { 
    {
      select * 
      where 
        { 
          ?s foaf:knows ?o 
        }
    }
    option (transitive, t_distinct, t_in(?s), t_out(?o)) . 
    filter (?s = <mailto:stefan.decker@deri.org>)
  }
;

Demo System

The system runs on Virtuoso 6 Cluster Edition. The database is partitioned into 12 partitions, each served by a distinct server process. The system demonstrated hosts these 12 servers on 2 machines, each with 2 xXeon 5345 and 16GB memory and 4 SATA disks. For scaling, the processes and corresponding partitions can be spread over a larger number of machines. If each ran on its own server with 16GB RAM, the whole data set could be served from memory. This is desirable for search engine or fast analytics applications. Most of the demonstrated queries run in memory on second invocation. The timing difference between first and second run is easily an order of magnitude.

Orri Erling's Weblog

Details

Subscribe

Tag Cloud

Post Categories

Recent Articles

Introduction

The Data

The Queries

Simple Lookups

Analytics

Social Networks

Demo System

Comments

Post Comment

Orri Erling's Weblog

Details

Subscribe

Tag Cloud

Post Categories

Recent Articles

Introduction

The Data

The Queries

Simple Lookups

Analytics

Social Networks

Demo System

Related

Comments

Post Comment