OpenLink Software's Virtuoso Submission to the Billion Triples Challenge

OpenLink Software's Virtuoso Submission to the Billion Triples Challengehttp://www.openlinksw.com:443/blog/vdb/blog/?id=1446 Introduction We use Virtuoso 6 Cluster Edition to demonstrate the following: Text and structured information based lookups Analytics queries Analysis of co-occurrence of features like interests and tags. Dealing with identity of multiple IRI's (owl:sameAs) The demo is based on a set of canned SPARQL queries that can be invoked using the OpenLink Data Explorer (ODE) Firefox extension. The demo queries can also be run directly against the SPARQL end point. The demo is being worked on at the time of submission and may be shown online by appointment. Automatic annotation of the data based on named entity extraction is being worked on at the time of this submission. By the time of ISWC 2008 the set of sample queries will be enhanced with queries based on extracted named entities and their relationships in the UMBEL and Open CYC ontologies. Also examples involving owl:sameAs are being added, likewise with similarity metrics and search hit scores. The Data The database consists of the billion triples data sets and some additions like Umbel. Also the Freebase extract is newer than the challenge original. The triple count is 1115 million. In the case of web harvested resources, the data is loaded in one graph per resource. In the case of larger data sets like Dbpedia or the US census, all triples of the provenance share a data set specific graph. All string literals are additionally indexed in a full text index. No stop words are used. Most queries do not specify a graph. Thus they are evaluated against the union of all the graphs in the database. The indexing scheme is SPOG, GPOS, POGS, OPGS. All indices ending in S are bitmap indices. The Queries The demo uses Virtuoso SPARQL extensions in most queries. These extensions consist on one hand of well known SQL features like aggregation with grouping and existence and value subqueries and on the other of RDF specific features. The latter include run time RDFS and OWL inferencing support and backward chaining subclasses and transitivity. Simple Lookups sparql select ?s ?p (bif:search_excerpt (bif:vector ('semantic', 'web'), ?o)) where { ?s ?p ?o . filter (bif:contains (?o, "'semantic web'")) } limit 10 ; This looks up triples with semantic web in the object and makes a search hit summary of the literal, highlighting the search terms. sparql select ?tp count(*) where { ?s ?p2 ?o2 . ?o2 a ?tp . ?s foaf:nick ?o . filter (bif:contains (?o, "plaid_skirt")) } group by ?tp order by desc 2 limit 40 ; This looks at what sorts of things are referenced by the properties of the foaf handle plaid_skirt. What are these things called? sparql select ?lbl count(*) where { ?s ?p2 ?o2 . ?o2 rdfs:label ?lbl . ?s foaf:nick ?o . filter (bif:contains (?o, "plaid_skirt")) } group by ?lbl order by desc 2 ; Many of these things do not have a rdfs:label. Let us use a more general concept of lable which groups dc:title, foaf:name and other name-like properties together. The subproperties are resolved at run time, there is no materialization. sparql define input:inference 'b3s' select ?lbl count(*) where { ?s ?p2 ?o2 . ?o2 b3s:label ?lbl . ?s foaf:nick ?o . filter (bif:contains (?o, "plaid_skirt")) } group by ?lbl order by desc 2 ; We can list sources by the topics they contain. Below we look for graphs that mention terrorist bombing. sparql select ?g count(*) where { graph ?g { ?s ?p ?o . filter (bif:contains (?o, "'terrorist bombing'")) } } group by ?g order by desc 2 ; Now some web 2.0 tagging of search results. The tag cloud of "computer" sparql select ?lbl count (*) where { ?s ?p ?o . ?o bif:contains "computer" . ?s sioc:topic ?tg . optional { ?tg rdfs:label ?lbl } } group by ?lbl order by desc 2 limit 40 ; This query will find the posters who talk the most about sex. sparql select ?auth count (*) where { ?d dc:creator ?auth . ?d ?p ?o filter (bif:contains (?o, "sex")) } group by ?auth order by desc 2 ; Analytics We look for people who are joined by having relatively uncommon interests but do not know each other. sparql select ?i ?cnt ?n1 ?n2 ?p1 ?p2 where { { select ?i count (*) as ?cnt where { ?p foaf:interest ?i } group by ?i } filter ( ?cnt > 1 && ?cnt < 10) . ?p1 foaf:interest ?i . ?p2 foaf:interest ?i . filter (?p1 != ?p2 && !bif:exists ((select (1) where {?p1 foaf:knows ?p2 })) && !bif:exists ((select (1) where {?p2 foaf:knows ?p1 }))) . ?p1 foaf:nick ?n1 . ?p2 foaf:nick ?n2 . } order by ?cnt limit 50 ; The query takes a fairly long time, mostly spent counting the interested in 25M interest triples. It then takes people that share the interest and checks that neither claims to know the other. It then sorts the results rarest interest first. The query can be written more efficently but is here just to show that database-wide scans of the population are possible ad hoc. Now we go to SQL to make a tag co-occurrence matrix. This can be used for showing a Technorati-style related tags line at the bottom of a search result page. This showcases the use of SQL together with SPARQL. The half-matrix of tags t1, t2 with the co-occurrence count at the intersection is much more efficiently done in SQL, specially since it gets updated as the data changes. This is an example of materialized intermediate results based on warehoused RDF. create table tag_count (tcn_tag iri_id_8, tcn_count int, primary key (tcn_tag)); alter index tag_count on tag_count partition (tcn_tag int (0hexffff00)); create table tag_coincidence (tc_t1 iri_id_8, tc_t2 iri_id_8, tc_count int, tc_t1_count int, tc_t2_count int, primary key (tc_t1, tc_t2)) alter index tag_coincidence on tag_coincidence partition (tc_t1 int (0hexffff00)); create index tc2 on tag_coincidence (tc_t2, tc_t1) partition (tc_t2 int (0hexffff00)); How many times each topic is mentioned? insert into tag_count select * from (sparql define output:valmode "LONG" select ?t count (*) as ?cnt where { ?s sioc:topic ?t } group by ?t) xx option (quietcast); Take all t1, t2 where t1 and t2 are tags of the same subject, store only the permutation where the internal id of t1 < that of t2. insert into tag_coincidence (tc_t1, tc_t2, tc_count) select "t1", "t2", cnt from (select "t1", "t2", count (*) as cnt from (sparql define output:valmode "LONG" select ?t1 ?t2 where { ?s sioc:topic ?t1 . ?s sioc:topic ?t2 }) tags where "t1" < "t2" group by "t1", "t2") xx where isiri_id ("t1") and isiri_id ("t2") option (quietcast); Now put the individual occurrence counts into the same table with the co-occurrence. This denormalization makes the related tags lookup faster. update tag_coincidence set tc_t1_count = (select tcn_count from tag_count where tcn_tag = tc_t1), tc_t2_count = (select tcn_count from tag_count where tcn_tag = tc_t2); Now each tag_coincidence row has the joint occurrence count and individual occurrence counts. A single select will return a Technorati-style related tags listing. To show the URI's of the tags: select top 10 id_to_iri (tc_T1), id_to_iri (tc_t2), tc_count from tag_coincidence order by tc_count desc; Social Networks We look at what interests people have sparql select ?o ?cnt where { { select ?o count (*) as ?cnt where { ?s foaf:interest ?o } group by ?o } filter (?cnt > 100) } order by desc 2 limit 100 ; Now the same for the Harry Potter fans sparql select ?i2 count (*) where { ?p foaf:interest <http://www.livejournal.com/interests.bml?int=harry+potter> . ?p foaf:interest ?i2 } group by ?i2 order by desc 2 limit 20 ; We see whether knows relations are symmmetrical. We return the top n people that others claim to know without being reciprocally known. sparql select ?celeb, count (*) where { ?claimant foaf:knows ?celeb . filter (!bif:exists ((select (1) where { ?celeb foaf:knows ?claimant }))) } group by ?celeb order by desc 2 limit 10 ; We look for a well connected person to start from. sparql select ?p count (*) where { ?p foaf:knows ?k } group by ?p order by desc 2 limit 50 ; We look for the most connected of the many online identities of Stefan Decker. sparql select ?sd count (distinct ?xx) where { ?sd a foaf:Person . ?sd ?name ?ns . filter (bif:contains (?ns, "'Stefan Decker'")) . ?sd foaf:knows ?xx } group by ?sd order by desc 2 ; We count the transitive closure of Stefan Decker's connections sparql select count (*) where { { select * where { ?s foaf:knows ?o } } option (transitive, t_distinct, t_in(?s), t_out(?o)) . filter (?s = <mailto:stefan.decker@deri.org>) } ; Now we do the same while following owl:sameAs links. sparql define input:same-as "yes" select count (*) where { { select * where { ?s foaf:knows ?o } } option (transitive, t_distinct, t_in(?s), t_out(?o)) . filter (?s = <mailto:stefan.decker@deri.org>) } ; Demo System The system runs on Virtuoso 6 Cluster Edition. The database is partitioned into 12 partitions, each served by a distinct server process. The system demonstrated hosts these 12 servers on 2 machines, each with 2 xXeon 5345 and 16GB memory and 4 SATA disks. For scaling, the processes and corresponding partitions can be spread over a larger number of machines. If each ran on its own server with 16GB RAM, the whole data set could be served from memory. This is desirable for search engine or fast analytics applications. Most of the demonstrated queries run in memory on second invocation. The timing difference between first and second run is easily an order of magnitude. Tue, 30 Sep 2008 16:24:34 GMTVirtuoso Universal Server 08.03.3334Virtuoso Data Space BotOpenLink Software's Virtuoso Submission to the Billion Triples Challengehttp://www.openlinksw.com:443/weblog/public/images/vbloglogo.gifhttp://www.openlinksw.com:443/blog/vdb/blog/?id=1446A great place to track Virtuoso's rapid evolution.8831