Introduction
We use Virtuoso 6 Cluster Edition to demonstrate the following:
- Text and structured information based lookups
- Analytics queries
- Analysis of co-occurrence of features like interests and tags.
- Dealing with identity of multiple IRI's (owl:sameAs)
The demo is based on a set of canned SPARQL queries that can be invoked using the OpenLink Data Explorer (ODE) Firefox extension.
The demo queries can also be run directly against the SPARQL end point.
The demo is being worked on at the time of submission and may be shown online by appointment.
Automatic annotation of the data based on named entity extraction is
being worked on at the time of this submission. By the time of ISWC
2008 the set of sample queries will be enhanced with queries based on
extracted named entities and their relationships in the UMBEL and Open
CYC ontologies.
Also examples involving owl:sameAs are being added, likewise with similarity metrics and search hit scores.
The Data
The database consists of the billion triples data sets and some additions like Umbel. Also the Freebase extract is newer than the challenge original.
The triple count is 1115 million.
In the case of web harvested resources, the data is loaded in one graph per resource.
In the case of larger data sets like Dbpedia or the US census, all triples of the provenance share a data set specific graph.
All string literals are additionally indexed in a full text index. No stop words are used.
Most queries do not specify a graph. Thus they are evaluated against the union of all the graphs in the database.
The indexing scheme is SPOG, GPOS, POGS, OPGS. All indices ending in S are bitmap indices.
The Queries
The demo uses Virtuoso SPARQL extensions in most queries. These
extensions consist on one hand of well known SQL features like
aggregation with grouping and existence and value subqueries and on
the other of RDF specific features.
The latter include run time RDFS and OWL inferencing support and backward
chaining subclasses and transitivity.
Simple Lookups
sparql
select ?s ?p (bif:search_excerpt (bif:vector ('semantic', 'web'), ?o))
where
{
?s ?p ?o .
filter (bif:contains (?o, "'semantic web'"))
}
limit 10
;
This looks up triples with semantic web in the object and makes a search hit summary of the literal,
highlighting the search terms.
sparql
select ?tp count(*)
where
{
?s ?p2 ?o2 .
?o2 a ?tp .
?s foaf:nick ?o .
filter (bif:contains (?o, "plaid_skirt"))
}
group by ?tp
order by desc 2
limit 40
;
This looks at what sorts of things are referenced by the properties of the foaf handle plaid_skirt.
What are these things called?
sparql
select ?lbl count(*)
where
{
?s ?p2 ?o2 .
?o2 rdfs:label ?lbl .
?s foaf:nick ?o .
filter (bif:contains (?o, "plaid_skirt"))
}
group by ?lbl
order by desc 2
;
Many of these things do not have a rdfs:label. Let us use a more general concept of lable
which groups dc:title, foaf:name and other name-like properties together. The subproperties are
resolved at run time, there is no materialization.
sparql
define input:inference 'b3s'
select ?lbl count(*)
where
{
?s ?p2 ?o2 .
?o2 b3s:label ?lbl .
?s foaf:nick ?o .
filter (bif:contains (?o, "plaid_skirt"))
}
group by ?lbl
order by desc 2
;
We can list sources by the topics they contain.
Below we look for graphs that mention terrorist bombing.
sparql
select ?g count(*)
where
{
graph ?g
{
?s ?p ?o .
filter (bif:contains (?o, "'terrorist bombing'"))
}
}
group by ?g
order by desc 2
;
Now some web 2.0 tagging of search results. The tag cloud of "computer"
sparql
select ?lbl count (*)
where
{
?s ?p ?o .
?o bif:contains "computer" .
?s sioc:topic ?tg .
optional
{
?tg rdfs:label ?lbl
}
}
group by ?lbl
order by desc 2
limit 40
;
This query will find the posters who talk the most about sex.
sparql
select ?auth count (*)
where
{
?d dc:creator ?auth .
?d ?p ?o
filter (bif:contains (?o, "sex"))
}
group by ?auth
order by desc 2
;
Analytics
We look for people who are joined by having relatively uncommon interests but do not know each other.
sparql select ?i ?cnt ?n1 ?n2 ?p1 ?p2
where
{
{
select ?i count (*) as ?cnt
where
{ ?p foaf:interest ?i }
group by ?i
}
filter ( ?cnt > 1 && ?cnt < 10) .
?p1 foaf:interest ?i .
?p2 foaf:interest ?i .
filter (?p1 != ?p2 &&
!bif:exists ((select (1) where {?p1 foaf:knows ?p2 })) &&
!bif:exists ((select (1) where {?p2 foaf:knows ?p1 }))) .
?p1 foaf:nick ?n1 .
?p2 foaf:nick ?n2 .
}
order by ?cnt
limit 50
;
The query takes a fairly long time, mostly spent counting the interested in 25M interest triples.
It then takes people that share the interest and checks that neither claims to know the other.
It then sorts the results rarest interest first. The query can be written more efficently but is
here just to show that database-wide scans of the population are possible ad hoc.
Now we go to SQL to make a tag co-occurrence matrix. This can be used for showing a Technorati-style
related tags line at the bottom of a search result page. This showcases the use of SQL together
with SPARQL. The half-matrix of tags t1, t2 with the co-occurrence count at the intersection is
much more efficiently done in SQL, specially since it gets updated as the data changes.
This is an example of materialized intermediate results based on warehoused RDF.
create table
tag_count (tcn_tag iri_id_8,
tcn_count int,
primary key (tcn_tag));
alter index
tag_count on tag_count partition (tcn_tag int (0hexffff00));
create table
tag_coincidence (tc_t1 iri_id_8,
tc_t2 iri_id_8,
tc_count int,
tc_t1_count int,
tc_t2_count int,
primary key (tc_t1, tc_t2))
alter index
tag_coincidence on tag_coincidence partition (tc_t1 int (0hexffff00));
create index
tc2 on tag_coincidence (tc_t2, tc_t1) partition (tc_t2 int (0hexffff00));
How many times each topic is mentioned?
insert into tag_count
select *
from (sparql define output:valmode "LONG"
select ?t count (*) as ?cnt
where
{
?s sioc:topic ?t
}
group by ?t)
xx option (quietcast);
Take all t1, t2 where t1 and t2 are tags of the same subject, store only the permutation where the internal id of t1 < that of t2.
insert into tag_coincidence (tc_t1, tc_t2, tc_count)
select "t1", "t2", cnt
from
(select "t1", "t2", count (*) as cnt
from
(sparql define output:valmode "LONG"
select ?t1 ?t2
where
{
?s sioc:topic ?t1 .
?s sioc:topic ?t2
}) tags
where "t1" < "t2"
group by "t1", "t2") xx
where isiri_id ("t1") and
isiri_id ("t2")
option (quietcast);
Now put the individual occurrence counts into the same table with the co-occurrence. This
denormalization makes the related tags lookup faster.
update tag_coincidence
set tc_t1_count = (select tcn_count from tag_count where tcn_tag = tc_t1),
tc_t2_count = (select tcn_count from tag_count where tcn_tag = tc_t2);
Now each tag_coincidence row has the joint occurrence count and individual occurrence counts.
A single select will return a Technorati-style related tags listing.
To show the URI's of the tags:
select top 10 id_to_iri (tc_T1), id_to_iri (tc_t2), tc_count
from tag_coincidence
order by tc_count desc;
Social Networks
We look at what interests people have
sparql
select ?o ?cnt
where
{
{
select ?o count (*) as ?cnt
where
{
?s foaf:interest ?o
}
group by ?o
}
filter (?cnt > 100)
}
order by desc 2
limit 100
;
Now the same for the Harry Potter fans
sparql
select ?i2 count (*)
where
{
?p foaf:interest <http://www.livejournal.com/interests.bml?int=harry+potter> .
?p foaf:interest ?i2
}
group by ?i2
order by desc 2
limit 20
;
We see whether knows relations are symmmetrical. We return the top n people that others claim to know without being reciprocally known.
sparql
select ?celeb, count (*)
where
{
?claimant foaf:knows ?celeb .
filter (!bif:exists ((select (1)
where
{
?celeb foaf:knows ?claimant
})))
}
group by ?celeb
order by desc 2
limit 10
;
We look for a well connected person to start from.
sparql
select ?p count (*)
where
{
?p foaf:knows ?k
}
group by ?p
order by desc 2
limit 50
;
We look for the most connected of the many online identities of Stefan Decker.
sparql
select ?sd count (distinct ?xx)
where
{
?sd a foaf:Person .
?sd ?name ?ns .
filter (bif:contains (?ns, "'Stefan Decker'")) .
?sd foaf:knows ?xx
}
group by ?sd
order by desc 2
;
We count the transitive closure of Stefan Decker's connections
sparql
select count (*)
where
{
{
select *
where
{
?s foaf:knows ?o
}
}
option (transitive, t_distinct, t_in(?s), t_out(?o)) .
filter (?s = <mailto:stefan.decker@deri.org>)
}
;
Now we do the same while following owl:sameAs links.
sparql
define input:same-as "yes"
select count (*)
where
{
{
select *
where
{
?s foaf:knows ?o
}
}
option (transitive, t_distinct, t_in(?s), t_out(?o)) .
filter (?s = <mailto:stefan.decker@deri.org>)
}
;
Demo System
The system runs on Virtuoso 6 Cluster Edition. The database is partitioned into 12 partitions,
each served by a distinct server process. The system demonstrated hosts these 12 servers on 2
machines, each with 2 xXeon 5345 and 16GB memory and 4 SATA disks. For scaling, the processes
and corresponding partitions can be spread over a larger number of machines. If each ran on its
own server with 16GB RAM, the whole data set could be served from memory. This is desirable for
search engine or fast analytics applications. Most of the demonstrated queries run in memory on
second invocation. The timing difference between first and second run is easily an order of
magnitude.