Personal Details

OpenLink Software
Burlington, United States

Subscribe

Post Categories

Recent Articles

Community Member Blogs

Display Settings

articles per page.
order.

Translate

Showing posts in all categories RefreshRefresh
Dbpedia Benchmark Revisited [ Virtuso Data Space Bot ]
Dbpedia Benchmark Revisited
We ran the Dbpedia benchmark queries again with different configurations of Virtuoso. I had not studied the details of the matter previously but now did have a closer look at the queries.
Comparing numbers given by different parties is a constant problem. In the case reported here, we loaded the full Dbpedia 3, all languages with about 198M triples on Virtuoso 5 and 6 cluster, all on the same 4 core 2GHz Xeon with 8G RAM. All databases were striped on 6 disks. The cluster configuration was with 4 processes in the same box.
We ran the queries in two variants: The first was with graph specified in the SPARQL from clause, using the default indices. The second variant was with no graph specified anywhere, using an alternate indexing scheme.
The times below are for the sequence of 5 queries, individual query times are not reported. I did not do a line by line review of the execution plans since they seem to run well enough. We could get some extra mileage from cost model tweaks, specially for the numeric range conditions but we will do this when somebody comes up with better times.
First about 5: Because there is a query in the set that specifies no condition on S or O and only P, this simply cannot be done with the default indices. With 6 it sort of can because 6 is more space efficient.
So we added the index:
create bitmap index rdf_quad_pogs on rdf_quad (p, o, g, s);

5 with gspo, ogps, pogs 
cold   210s
warm  0.600s

6 cluster with gspo, ogps 
cold  136s
warm 4.01 s


6 cluster with gspo, ogpps, pogs
cold  33.4s
warm 0.628 s
OK, so now let us do it without a graph being specified. For all platforms, we do:
- Drop any existing indices.
create table r2 (g iri_id_8, s, iri_id_8, p iri_id_8, o any, primary key (s, p, o, g))
alter index R2 on R2 partition (s int (0hexffff00));

log_enable (2);
insert into r2 (g, s, p, o) select g, s, p, o from rdf_quad;

drop table rdf_quad;
alter table r2 rename RDF_QUAD;
create bitmap index rdf_quad_opgs on rdf_quad (o, p, g, s) partition (o varchar (-1, 0hexffff));
create bitmap index rdf_quad_pogs on rdf_quad (p, o, g, s) partition (o varchar (-1, 0hexffff));
create bitmap index rdf_quad_gpos on rdf_quad (g, p, o, s) partition (o varchar (-1, 0hexffff));
The code is identical for 5 and 6, except that with 5 we use iri_id (32 bit) for the type, not iri_id_8 (64 bit). We note that we run out of id's with 5 around a few billion triples, so with 6 we have double the id length and still manage to be vastly more space efficient.
With the above 4 indices, we can query the data pretty much in any combination without hitting a full scan of any index. We note that all indices that do not begin with s end with s as a bitmap. This is about 60% of the space of a non-bitmap index for data such as Dbpedia.
If you intend to do completely arbitrary RDF queries in Virtuoso, then chances are you are best off with the above index scheme.
5 with spog, pogs, opgs, gpos 
warm 0.595 s

6 cluster  with spog, pogs, opgs, gpos 
warm 0.617 s
The cold times were about the same as above, so not reproduced.
Graph or No Graph?
It is in the SPARQL spirit to specify a graph and for pretty much any application, there are entirely sensible ways of keeping the data in graphs and specifying which ones are concerned by queries. This is why Virtuoso is set up for this by default.
On the other hand, for the open web scenario, dealing with an unknown large number of graphs, enumerating graphs is not possible and questions like which graph of which source asserts x become relevant. We have two distinct use cases which warrant a different setup of the database, simple as that.
The latter use case is not really within the SPARQL spec, so implementations may or may not support this. For example Oracle or Vertica would not do this well since they partition data according to graph or predicate, respectively. On the other hand stores that work with one quad table, which is most of the ones out there should do it maybe with some configuring, as shown above.
Frameworks like Jena are not to my knowledge geared towards having a wildcard for graph, although I would suppose this can be arranged by adding some "super-graph" object, a graph of all graphs. I don't think this is directly supported and besides most apps would not need it.
Once the indices are right, there is no difference between specifying a graph and no graph with the queries considered. With more complex queries, specifying a graph or set of graphs does allow some optimizations that cannot be done with graph missing. For example, bitmap intersections are possible only when all leading key parts are given.

Conclusions

The best warm cache time is with 5, the five queries under 600 ms after the first go. This is to show that all in memory with a single thread of execution is hard to beat.
The 6 cluster performs the same in 623 ms. What is gained in parallelism is lost in latency if all operations complete in microseconds. On the other hand, 6 cluster leaves 5 in the dust in any situation that has less than 100% hit rate. This is due to actual benefit from parallelism if operations take longer than a few microseconds, such as in the case of disk reads. 6 has substantially better data layout on disk, as well as fewer pages to load for the same content.
This makes it possible to run the queries without the pogs index on 6 even when 5 took prohibitively long.
The morale of the story is to have a lot of RAM and space efficient data representation.
The Dbpedia benchmark does not specify any random access pattern that would give a measure of sustained throughput under load, so we are left with the extremes of cold and warm cache of which neither is quite realistic.
Chris Bizer and I have talked on and off about benchmarks and I have made suggestions that we will see incorporated into the Berlin SPARQL benchmark, which will, I believe, be much more informative.

Appendix: Query Text

For reference, the query text is below, with graph given. To run without specifying the graph, just drop the from <http://dbpedia.org>. The returned row counts are indicated below the query text.
sparql SELECT ?p ?o from <http://dbpedia.org> WHERE {
  <http://dbpedia.org/resource/Metropolitan_Museum_of_Art> ?p ?o };

-- 1337 rows

sparql PREFIX p: <http://dbpedia.org/property/>
SELECT ?film1 ?actor1 ?film2 ?actor2
from <http://dbpedia.org> WHERE {
  ?film1 p:starring <http://dbpedia.org/resource/Kevin_Bacon> .
  ?film1 p:starring ?actor1 .
  ?film2 p:starring ?actor1 .
  ?film2 p:starring ?actor2 . };

--  23910 rows

sparql PREFIX p: <http://dbpedia.org/property/>
SELECT ?artist ?artwork ?museum ?director from <http://dbpedia.org> 
WHERE {
  ?artwork p:artist ?artist .
  ?artwork p:museum ?museum .
  ?museum p:director ?director };

-- 303 rows

sparql PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?s ?homepage from <http://dbpedia.org>  WHERE {
   <http://dbpedia.org/resource/Berlin> geo:lat ?berlinLat .
   <http://dbpedia.org/resource/Berlin> geo:long ?berlinLong . 
   ?s geo:lat ?lat .
   ?s geo:long ?long .
   ?s foaf:homepage ?homepage .
   FILTER (
     ?lat        <=     ?berlinLat + 0.03190235436 &&
     ?long       >=     ?berlinLong - 0.08679199218 &&
     ?lat        >=     ?berlinLat - 0.03190235436 && 
     ?long       <=     ?berlinLong + 0.08679199218) };

-- 56 rows


sparql PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX p: <http://dbpedia.org/property/>
SELECT ?s ?a ?homepage from <http://dbpedia.org>  WHERE {
   <http://dbpedia.org/resource/New_York_City> geo:lat ?nyLat .
   <http://dbpedia.org/resource/New_York_City> geo:long ?nyLong . 
   ?s geo:lat ?lat .
   ?s geo:long ?long .
   ?s p:architect ?a .
   ?a foaf:homepage ?homepage .
   FILTER (
     ?lat        <=     ?nyLat + 0.3190235436 &&
     ?long       >=     ?nyLong - 0.8679199218 &&
     ?lat        >=     ?nyLat - 0.3190235436 && 
     ?long       <=     ?nyLong + 0.8679199218) };

-- 13 rows
# PermaLink Comments [0]
05/09/2008 15:33 GMT Modified: 05/09/2008 15:33 GMT
Dbpedia Benchmark Revisited [ Orri Erling ]
We ran the Dbpedia benchmark queries again with different configurations of Virtuoso. I had not studied the details of the matter previously but now did have a closer look at the queries.
Comparing numbers given by different parties is a constant problem. In the case reported here, we loaded the full Dbpedia 3, all languages with about 198M triples on Virtuoso 5 and 6 cluster, all on the same 4 core 2GHz Xeon with 8G RAM. All databases were striped on 6 disks. The cluster configuration was with 4 processes in the same box.
We ran the queries in two variants: The first was with graph specified in the SPARQL from clause, using the default indices. The second variant was with no graph specified anywhere, using an alternate indexing scheme.
The times below are for the sequence of 5 queries, individual query times are not reported. I did not do a line by line review of the execution plans since they seem to run well enough. We could get some extra mileage from cost model tweaks, specially for the numeric range conditions but we will do this when somebody comes up with better times.
First about 5: Because there is a query in the set that specifies no condition on S or O and only P, this simply cannot be done with the default indices. With 6 it sort of can because 6 is more space efficient.
So we added the index:
create bitmap index rdf_quad_pogs on rdf_quad (p, o, g, s);

5 with gspo, ogps, pogs 
cold   210s
warm  0.600s

6 cluster with gspo, ogps 
cold  136s
warm 4.01 s


6 cluster with gspo, ogpps, pogs
cold  33.4s
warm 0.628 s
OK, so now let us do it without a graph being specified. For all platforms, we do:
- Drop any existing indices.
create table r2 (g iri_id_8, s, iri_id_8, p iri_id_8, o any, primary key (s, p, o, g))
alter index R2 on R2 partition (s int (0hexffff00));

log_enable (2);
insert into r2 (g, s, p, o) select g, s, p, o from rdf_quad;

drop table rdf_quad;
alter table r2 rename RDF_QUAD;
create bitmap index rdf_quad_opgs on rdf_quad (o, p, g, s) partition (o varchar (-1, 0hexffff));
create bitmap index rdf_quad_pogs on rdf_quad (p, o, g, s) partition (o varchar (-1, 0hexffff));
create bitmap index rdf_quad_gpos on rdf_quad (g, p, o, s) partition (o varchar (-1, 0hexffff));
The code is identical for 5 and 6, except that with 5 we use iri_id (32 bit) for the type, not iri_id_8 (64 bit). We note that we run out of id's with 5 around a few billion triples, so with 6 we have double the id length and still manage to be vastly more space efficient.
With the above 4 indices, we can query the data pretty much in any combination without hitting a full scan of any index. We note that all indices that do not begin with s end with s as a bitmap. This is about 60% of the space of a non-bitmap index for data such as Dbpedia.
If you intend to do completely arbitrary RDF queries in Virtuoso, then chances are you are best off with the above index scheme.
5 with spog, pogs, opgs, gpos 
warm 0.595 s

6 cluster  with spog, pogs, opgs, gpos 
warm 0.617 s
The cold times were about the same as above, so not reproduced.
Graph or No Graph?
It is in the SPARQL spirit to specify a graph and for pretty much any application, there are entirely sensible ways of keeping the data in graphs and specifying which ones are concerned by queries. This is why Virtuoso is set up for this by default.
On the other hand, for the open web scenario, dealing with an unknown large number of graphs, enumerating graphs is not possible and questions like which graph of which source asserts x become relevant. We have two distinct use cases which warrant a different setup of the database, simple as that.
The latter use case is not really within the SPARQL spec, so implementations may or may not support this. For example Oracle or Vertica would not do this well since they partition data according to graph or predicate, respectively. On the other hand stores that work with one quad table, which is most of the ones out there should do it maybe with some configuring, as shown above.
Frameworks like Jena are not to my knowledge geared towards having a wildcard for graph, although I would suppose this can be arranged by adding some "super-graph" object, a graph of all graphs. I don't think this is directly supported and besides most apps would not need it.
Once the indices are right, there is no difference between specifying a graph and no graph with the queries considered. With more complex queries, specifying a graph or set of graphs does allow some optimizations that cannot be done with graph missing. For example, bitmap intersections are possible only when all leading key parts are given.

Conclusions

The best warm cache time is with 5, the five queries under 600 ms after the first go. This is to show that all in memory with a single thread of execution is hard to beat.
The 6 cluster performs the same in 623 ms. What is gained in parallelism is lost in latency if all operations complete in microseconds. On the other hand, 6 cluster leaves 5 in the dust in any situation that has less than 100% hit rate. This is due to actual benefit from parallelism if operations take longer than a few microseconds, such as in the case of disk reads. 6 has substantially better data layout on disk, as well as fewer pages to load for the same content.
This makes it possible to run the queries without the pogs index on 6 even when 5 took prohibitively long.
The morale of the story is to have a lot of RAM and space efficient data representation.
The Dbpedia benchmark does not specify any random access pattern that would give a measure of sustained throughput under load, so we are left with the extremes of cold and warm cache of which neither is quite realistic.
Chris Bizer and I have talked on and off about benchmarks and I have made suggestions that we will see incorporated into the Berlin SPARQL benchmark, which will, I believe, be much more informative.

Appendix: Query Text

For reference, the query text is below, with graph given. To run without specifying the graph, just drop the from <http://dbpedia.org>. The returned row counts are indicated below the query text.
sparql SELECT ?p ?o from <http://dbpedia.org> WHERE {
  <http://dbpedia.org/resource/Metropolitan_Museum_of_Art> ?p ?o };

-- 1337 rows

sparql PREFIX p: <http://dbpedia.org/property/>
SELECT ?film1 ?actor1 ?film2 ?actor2
from <http://dbpedia.org> WHERE {
  ?film1 p:starring <http://dbpedia.org/resource/Kevin_Bacon> .
  ?film1 p:starring ?actor1 .
  ?film2 p:starring ?actor1 .
  ?film2 p:starring ?actor2 . };

--  23910 rows

sparql PREFIX p: <http://dbpedia.org/property/>
SELECT ?artist ?artwork ?museum ?director from <http://dbpedia.org> 
WHERE {
  ?artwork p:artist ?artist .
  ?artwork p:museum ?museum .
  ?museum p:director ?director };

-- 303 rows

sparql PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?s ?homepage from <http://dbpedia.org>  WHERE {
   <http://dbpedia.org/resource/Berlin> geo:lat ?berlinLat .
   <http://dbpedia.org/resource/Berlin> geo:long ?berlinLong . 
   ?s geo:lat ?lat .
   ?s geo:long ?long .
   ?s foaf:homepage ?homepage .
   FILTER (
     ?lat        <=     ?berlinLat + 0.03190235436 &&
     ?long       >=     ?berlinLong - 0.08679199218 &&
     ?lat        >=     ?berlinLat - 0.03190235436 && 
     ?long       <=     ?berlinLong + 0.08679199218) };

-- 56 rows


sparql PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX p: <http://dbpedia.org/property/>
SELECT ?s ?a ?homepage from <http://dbpedia.org>  WHERE {
   <http://dbpedia.org/resource/New_York_City> geo:lat ?nyLat .
   <http://dbpedia.org/resource/New_York_City> geo:long ?nyLong . 
   ?s geo:lat ?lat .
   ?s geo:long ?long .
   ?s p:architect ?a .
   ?a foaf:homepage ?homepage .
   FILTER (
     ?lat        <=     ?nyLat + 0.3190235436 &&
     ?long       >=     ?nyLong - 0.8679199218 &&
     ?lat        >=     ?nyLat - 0.3190235436 && 
     ?long       <=     ?nyLong + 0.8679199218) };

-- 13 rows
# PermaLink Comments [0]
05/09/2008 19:27 GMT Modified: 05/09/2008 15:33 GMT
Comments about recent Semantic Gang Podcast [ Kingsley Uyi Idehen ]

After listening to the latest Semantic Web Gang podcast, I found myself agreeing with some of the points made by Alex Iskold, specifically:

    -- Business exploitation of Linked Data on the Web will certainly be driven by the correlation of opportunity costs (which is more than likely what Alex meant by "use cases") associated with the lack of URIs originating from the domain of a given business (Tom Heath: also effectively alluded to this via his BBC and URI land grab anecdotes; same applies Georgi's examples)
    -- History is a great tutor, answers to many of today's problems always lie somewhere in plain sight of the past.

Of course, I also believe that Linked Data serves Web Data Integration across the Internet very well too, and the fact that it will be beneficial to businesses in a big way. No individual or organization is an island, I think the Internet and Web have done a good job of demonstrating that thus far :-) We're all data nodes in a Giant Global Graph.

Daniel lewis did shed light on the read-write aspects of the Linked Data Web, which is actually very close to the callout for a Wikipedia for Data. TimBL has been working on this via Tabulator (see Tabulator Editing Screencast), Bengamin Nowack also added similar functionality to ARC, and of course we support the same SPARQL UPDATE into an RDF information resource via the RDF Sink feature of our WebDAV and ODS-Briefcase implementations.

# PermaLink Comments [0]
05/02/2008 21:44 GMT Modified: 05/05/2008 20:06 GMT
In Perpetual Pursuit of Context [ Kingsley Uyi Idehen ]

I've always been of the opinion that concise value proposition articulation shouldn't be the achilles of the Semantic Web. As the Linked Data wave climbs up the "value Appreciation and Comprehension chain", it's getting clearer by the second that "Context" is a point of confluence for Semantic Web Technologies and easy to comprehend value, from the perspectives of those outside the core community.

In today's primarily Document centric Web, the pursuit of Context is akin to pursuing a mirage in a desert of user generated content. The quest is labor intensive, and you ultimaely end up without water at the end of the pursuit :-)

Listening to the Christine Connor's podcast interview with Talis simply reinforces my strong belief that "Context, Context, Context" is the Semantic Web's equivalent of Real Estate's "Location, Location, Location" (ignore the subprime loans mess for now). The critical thing to note is that you cannot unravel "Context" from existing Web content without incorporating powerful disambiguation technology into an "Entity Extraction" process. Of course, you cannot even consider seriously pursing any entity extraction and disambiguation endeavor without a lookup backbone that exposes "Named Entities" and their relationships to "Subject matter Concepts" (BTW - this is what UMBEL is all about). Thus, when looking at the broad subject of the Semantic Web, we can also look at "Context" as the vital point of confluence for the Data oriented (Linked Data) and the "Linguistic Meaning" oriented perspectives.

I am even inclined to state publicly that "Context" may ultimately be the foundation for 4th "Web Interaction Dimension" where practical use of AI leverages a Linked Data Web substrate en route to exposing new kinds of value :-)

"Context" may also be the focal point of concise value proposition articulation to VCs as in: "My solution offers the ability to discover and exploit "Context" iteratively, at the rate of $X.XX per iteration, across a variety of market segments :-)

# PermaLink Comments [3]
05/02/2008 19:18 GMT Modified: 05/03/2008 15:07 GMT
XTech Talks covering Linked Data [ Kingsley Uyi Idehen ]

Courtesy a post by Chris Bizer to the LOD community mailing list, here is a list of Linked Data oriented talks at the upcoming XTech 2008 event (also see the XTech 2008 Schedule which is Linked Data friendly). Of course, I am posting this to my Blog Data Space with the sole purpose of adding data to the rapidly growing Giant Global Graph of Linked Data, basically adding to my collection of live Linked Data utility demos :-)

Here is the list:

  1. Linked Data Deployment (Daniel Lewis, OpenLink Software)
  2. The Programmes Ontology (Tom Scott, BBC and all)
  3. SemWebbing the London Gazette (Jeni Tennison, The Stationery Office)
  4. Searching, publishing and remixing a Web of Semantic Data (Richard Cyganiak, DERI Galway)
  5. Building a Semantic Web Search Engine: Challenges and Solutions (Aidan Hogan, DERI Galway)
  6. 'That's not what you said yesterday!' - evolving your Web API (Ian Davis, Talis)
  7. Representing, indexing and mining scientific data using XML and RDF: Golem and CrystalEye (Andrew Walkingshaw, University of Cambridge)

For the time challenged (i.e. those unable to view this post using it's permalink / URI as a data source via the OpenLink RDF Browser, Zitgist Data Viewer, DISCO Hyperdata Browser, or Tabulator), the benefits of this post are as follows:

  • automatic URI generation for all linked items in this post
  • automatic propagation of tags to del.icio.us, Technorati, and PingTheSemanticWeb
  • automatic association of formal meanings to my Tags using the MOAT Ontology
  • automatic collation and generation of statistical data about my tags using the SCOT Ontology (*missing link is a callout to SCOT Tag Ontology folks to sort the project's home page URL at the very least*)
  • explicit typing of my Tags as SKOS Concepts.

Put differently, I cost-effectively contribute to the GGG across all Web interaction dimensions (1.0, 2.0, 3.0) :-)

# PermaLink Comments [0]
05/02/2008 14:53 GMT Modified: 05/05/2008 17:07 GMT
SPARQL at WWW 2008 [ Virtuso Data Space Bot ]
SPARQL at WWW 2008

Andy Seaborne and Eric Prud'hommeaux, editors of the SPARQL recommendation, convened a SPARQL birds of a feather session at WWW 2008. The administrative outcome was that implementors could now experiment with extensions, hopefully keeping each other current about their efforts and that towards the end of 2008, a new W3C working group might begin formalizing the experiences into a new SPARQL spec.

The session drew a good crowd, including many users and developers. The wishes were largely as expected, with a few new ones added. Many of the wishes already had diverse implementations, however most often without interop. I will below give some comments on the main issues discussed.

  • SPARQL Update - This is likely the most universally agreed upon extension. Implementations exist, largely along the lines of Andy Seaborne's SPARUL spec, which is also likely material for a W3C member submission. The issue is without much controversy; transactions fall outside the scope, which is reasonable enough. With triple stores, we can define things as combinations of inserts and deletes, and isolation we just leave aside. If anything, operating on a transactional platform such as Virtuoso, one wishes to disable transactions for any operations such as bulk loads and long-running inserts and deletes. Transactionality has pretty much no overhead for a few hundred rows, but for a few hundred million rows the cost of locking and rollback is prohibitive. With Virtuoso, we have a row auto-commit mode which we recommend for use with RDF: It commits by itself now and then, optionally keeping a roll forward log, and is transactional enough not to leave half triples around, i.e., inserted in one index but not another.

    As far as we are concerned, updating physical triples along the SPARUL lines is pretty much a done deal.

    The matter of updating relational data mapped to RDF is a whole other kettle of fish. On this, I should say that RDF has no special virtues for expressing transactions but rather has a special genius for integration. Updating is best left to web service interfaces that use SQL on the inside. Anyway, updating union views, which most mappings will be, is complicated. Besides, for transactions, one usually knows exactly what one wishes to update.

  • Full Text - Many people expressed a desire for full text access. Here we run into a deplorable confusion with regexps. The closest SPARQL has to full text in its native form is regexps, but these are not really mappable to full text except in rare special cases and I would despair of explaining to an end user what exactly these cases are. So, in principle, some regexps are equivalent to full text but in practice I find it much preferable to keep these entirely separate.

    It was noted that what the users want is a text box for search words. This is a front end to the CONTAINS predicate of most SQL implementations. Ours is MS SQL Server compatible and has a SPARQL version called bif:contains. One must still declare which triples one wants indexed for full text, though. This admin overhead seems inevitable, as text indexing is a large overhead and not needed by all applications.

    Also, text hits are not boolean; usually they come with a hit score. Thus, a SPARQL extension for this could look like

    select * where { ?thing has_description ?d . ?d ftcontains "gizmo" ftand "widget" score ?score . }

    This would return all the subjects, descriptions, and scores, from subjects with a has_description property containing widget and gizmo. Extending the basic pattern is better than having the match in a filter, since the match binds a variable.

    The XQuery/XPath groups have recently come up with a full-text spec, so I used their style of syntax above. We already have a full-text extension, as do some others. but for standardization, it is probably most appropriate to take the XQuery work as a basis. The XQuery full-text spec is quite complex, but I would expect most uses to get by with a small subset, and the structure seems better thought out, at first glance, than the more ad-hoc implementations in diverse SQLs.

    Again, declaring any text index to support the search, as well as its timeliness or transactionality, are best left to implementations.

  • Federation - This is a tricky matter. ARQ has a SPARQL extension for sending a nested set of triple patterns to a specific end-point. The DARQ project has something more, including a selectivity model for SPARQL.

    With federated SQL, life is simpler since after the views are expanded, we have a query where each table is at a known server and has more or less known statistics. Generally, execution plans where as much work as possible is pushed to the remote servers are preferred, and modeling the latencies is not overly hard. With SPARQL, each triple pattern could in principle come from any of the federated servers. Associating a specific end-point to a fragment of the query just passes the problem to the user. It is my guess that this is the best we can do without getting very elaborate, and possibly buggy, end-point content descriptions for routing federated queries.

    Having said this, there remains the problem of join order. I suggested that we enhance the protocol by allowing asking an end-point for the query cost for a given SPARQL query. Since they all must have a cost model for optimization, this should not be an impossible request. A time cost and estimated cardinality would be enough. Making statistics available à la DARQ was also discussed. Being able to declare cardinalities expected of a remote end-point is probably necessary anyway, since not all will implement the cost model interface. For standardization, agreeing on what is a proper description of content and cardinality and how fine grained this must be will be so difficult that I would not wait for it. A cost model interface would nicely hide this within the end-point itself.

    With Virtuoso, we do not have a federated SPARQL scheme but we could have the ARQ-like service construct. We'd use our own cost model with explicit declarations of cardinalities of the remote data for guessing a join order. Still, this is a bit of work. We'll see.

    For practicality, the service construct coupled with join order hints is the best short term bet. Making this pretty enough for standardization is not self-evident, as it requires end-point description and/or cost model hooks for things to stay declarative.

  • End-point description - This question has been around for a while; I have blogged about it earlier, but we are not really at a point where there would be even rough consensus about an end-point ontology. We should probably do something on our own to demonstrate some application of this, as we host lots of linked open data sets.

  • SQL equivalence - There were many requests for aggregation, some for subqueries and nesting, expressions in select, negation, existence and so on. I would call these all SQL equivalence. One use case was taking all the teams in the database and for all with over 5 members, add the big_team class and a property for member count.

    With Virtuoso, we could write this as --

    construct { ?team a big_team . ?team member_count ?ct } from ... where {?team a team . { select ?team2 count (*) as ?ct where { ?m member_of ?team2 } . filter (?team = ?team2 and ? ct > 5) }}

    We have pretty much all the SQL equivalence features, as we have been working for some time at translating the TPC-H workload into SPARQL.

    The usefulness of these things is uncontested but standardization could be hard as there are subtle questions about variable scope and the like.

  • Inference - The SPARQL spec does not deal with transitivity or such matters because it is assumed that these are handled by an underlying inference layer. This is however most often not so. There was interest in more fine grained control of inference, for example declaring that just one property in a query would be transitive or that subclasses should be taken into account in only one triple pattern. As far as I am concerned, this is very reasonable, and we even offer extensions for this sort of thing in Virtuoso's SPARQL. This however only makes sense if the inference is done at query time and pattern by pattern. For instance, if forward chaining is used, this no longer makes sense. Specifying that some forward chaining ought to be done at query time is impractical, as the operation can be very large and time consuming and it is the DBA's task to determine what should be stored and for how long, how changes should be propagated, and so on. All these are application dependent and standardizing will be difficult.

    Support for RDF features like lists and bags would all fall into the functions an underlying inference layer should perform. These things are of special interest when querying OWL models, for example.

  • Path expressions - Path expressions were requested by a few people. We have implemented some, as in

    ?product+?has_supplier+>s_name = "Gizmos, Inc.".
    This means that one supplier of product has name "Gizmo, Inc.". This is a nice shorthand but we run into problems if we start supporting repetitive steps, optional steps, and the like.
  • In conclusion, update, full text, and basic counting and grouping would seem straightforward at this point. Nesting queries, value subqueries, views, and the like should not be too hard if an agreement is reached on scope rules. Inference and federation will probably need more experimentation but a lot can be had already with very simple fine grained control of backward chaining, if such applies, or with explicit end-point references and explicit join order. These are practical but not pretty enough for committee consensus, would be my guess. Anyway, it will be a few months before anything formal will happen.

    # PermaLink Comments [0]
    04/30/2008 12:28 GMT Modified: 04/30/2008 13:48 GMT
    SPARQL at WWW 2008 [ Orri Erling ]
    Andy Seaborne and Eric Prud'hommeaux, editors of the SPARQL recommendation, convened a SPARQL birds of a feather session at WWW 2008. The administrative outcome was that implementors could now experiement with extensions, hopefully keeping each other current about their efforts and that towards the end of 2008, a new W3C working group might begin formalizing the experiences into a new SPARQL spec.
    The session drew a good crowd, including many users and developers. The wishes were largely as expected, with a few new ones added. Many of the wishes already had diverse implementations, however most often without interop. I will below give some comments on the main issues discussed.
    - SPARQL Update - This is likely the most universally agreed upon extension. Implementations exist, largely along the lines of Andy Seaborne's SPARUL spec, which is also likely material for a W3C member submission. The issue is without much controverse, transactions fall outside the scope, which is reasonable enough. With triple stores, we can define things as combinations of inserts and deletes and isolation we just leave aside. If anything, operating on a transactional platform such as Virtuoso, one wishes to disable transactions for any operations such as bulk loads and long running inserts and deletes. Transactionality has pretty much no overhead for a few hundred rows but for a few hundred million rows the cost of locking and rollback is prohibitive. With Virtuoso, we have a row autocommit mode which we recommend for use with RDF: It commits by itself now and then, optionally keeping a roll forward log and is transactional enough not to leave half triples around,i.e. inserted in one index but not another.
    As far as we are concerned, updating physical triples along the SPARUL lines is pretty much a done deal.
    The matter of updating relational data mapped to RDF is a whole other kettle of fish. On this, I should say that RDF has no special virtues for expressing transactions but rather has a special genius for integration. Updating is best left to web service interfaces that use SQL on the inside. Anyway, updating union views, which most mappings will be, is complicated. Besides, for transactions, one usually knows exactly what one wishes to update.
    Full Text - Many people expressed a desire for full text access. Here we run into a deplorable confusion with regexps. The closest SPARQL has to full text in its native form is regexps, but these are not really mappable to full text except in rare special cases and I would despair of explaining to an end user what exactly