Andy Seaborne and Eric Prud'hommeaux, editors of the
SPARQL
recommendation, convened a
SPARQL birds of a feather session at
WWW
2008. The administrative outcome was that implementors could now
experiement with extensions, hopefully keeping each other current
about their efforts and that towards the end of 2008, a new W3C
working group might begin formalizing the experiences into a new
SPARQL spec.
The session drew a good crowd, including many users and developers.
The wishes were largely as expected, with a few new ones added. Many
of the wishes already had diverse implementations, however most often
without interop. I will below give some comments on the main issues discussed.
- SPARQL Update - This is likely the most universally agreed upon
extension. Implementations exist, largely along the lines of Andy
Seaborne's SPARUL spec, which is also likely material for a W3C
member submission. The issue is without much controverse,
transactions fall outside the scope, which is reasonable enough.
With triple stores, we can define things as combinations of inserts
and deletes and isolation we just leave aside. If anything,
operating on a transactional platform such as
Virtuoso, one wishes
to disable transactions for any operations such as bulk loads and
long running inserts and deletes. Transactionality has pretty much
no overhead for a few hundred rows but for a few hundred million
rows the cost of locking and rollback is prohibitive. With
Virtuoso, we have a row autocommit mode which we recommend for use
with
RDF: It commits by itself now and then, optionally keeping a roll forward log and is transactional
enough not to leave half triples around,i.e. inserted in one index
but not another.
As far as we are concerned, updating physical triples along the SPARUL lines is pretty much a done deal.
The matter of updating relational
data mapped to
RDF is a whole other
kettle of fish. On this, I should say that RDF has no special virtues
for expressing transactions but rather has a special genius for
integration. Updating is best left to web service interfaces that use
SQL on the inside. Anyway, updating union views, which most mappings will be, is complicated. Besides, for transactions, one usually knows exactly what one wishes to update.
Full Text - Many people expressed a desire for full text access. Here
we run into a deplorable confusion with regexps. The closest SPARQL
has to full text in its native form is regexps, but these are not
really mappable to full text except in rare special cases and I would
despair of explaining to an end user what exactly these cases are.
So, in principle, some regexps are equivalent to full text but in practice I find it much preferrable to keep these entirely separate.
It was noted that what the users want is a text box for search words.
This is a front end to the CONTAINS predicate of most
SQL
implementations. Ours is MS SQL Server compatible and has a SPARQL
version called bif:contains. One must still declare which triples one
wants indexed for full text, though. This admin overhead seems
inevitable, as text indexing is a large overhead and not needed by all
applications.
Also, text hits are not boolean, usually they come with a hit score. Thus, aa SPARQL extension for this could look like
select * where { ?thing has_description ?d . ?d ftcontains "gizmo" ftand "widget" score ?score . }
This would return all the subjects, descriptions and scores from subjects with a has_description property containing widget and gizmo. Extending the basic pattern is better than having the match in a filter, since the match binds a variable.
The
XQuery/
XPATH groups have recently come up with a full text spec,
so I used their style of syntax above. We already have a full text
extension, as do some others. but for standardization, it is probably
most appropriate to take the
XQuery work as a basis. The XQuery full
text spec is quite complex but I would expect most uses to get by with
a small subset and the structure seems better thought out, at first
glance, than the more ad hoc implementations in diverse SQL's.
Again, declaring any text index to support the search, as well as its timeliness or transactionality, are best left to implementations.
Federation - This is a tricky matter. ARQ has a SPARQL extension for
sending a nested set of triple patterns to a specific end point. The
DARQ project has something more, including a selectivity model for
SPARQL.
With federated SQL, life is simpler since after the views are
expanded, we have a query where each table is at a known server and
has more or less known statistics. Generally, execution plans where as
much work as possible is pushed to the remote servers are preferred
and modeling the latencies is not overly hard. With SPARQL, each
triple pattern could in principle come from any of the federated
servers. Associating a specific end point to a fragment of the query
just passes the problem to the user. It is my guess that this is the
best we can do without getting very elaborate, and possibly buggy, end
point content descriptions for routing federated queries.
Having said this, there remains the problem of join order. I
suggested that we enhance the protocol by allowing asking an end point
for the query cost for a given SPARQL query. Since they all must have
a cost model for optimization, this should not be an imposssible
request. A time cost and estimated cardinality would be enough.
Making statistics available a la DARQ was also discussed. Being able
to declare cardinalities expected of a remote end point is probably
necessary anyway, since not all will implement the cost model
interface. For standardization, agreeing of what is a proper
description of content and cardinality and how fine grained this
must be will be so difficult that I would not wait for it. A cost model interface would nicely hide this within the end point itself.
With Virtuoso, we do not have a federated SPARQL scheme but we could
have the ARQ-like service construct. We'd use our own
cost model with explicit declarations of cardinalities of the remote
data for guessing a join order. Still, this is a bit of work. We'll see.
For practicality, the service construct coupled with join order hints is the best short term bet. Making this pretty enough for standardization is not self-evident, as it requires end point description and/or cost model hooks for things to stay declarative.
- End point description - This question has been around for a while, I
have blogged about it earlier but we are not really at a point where
there would be even rough consensus about an end point ontology. We
should probably do something on our own to demonstrate some
application of this, as we host lots of
linked open data sets.
- SQL equivalence - There were many requests for aggregation, some for
subqueries and nesting, expressions in select, negation, existence
and so on. I would call these all SQL equivalence. One use case
was taking all the teams in the database and for all with over 5
members, add the big_team class and a property for member count.
With Virtuoso, we could write this as
construct { ?team a big_team . ?team member_count ?ct } from ... where {?team a team . { select ?team2 count (*) as ?ct where { ?m member_of ?team2 } . filter (?team = ?team2 and ? ct > 5) }}
We have pretty much all the SQL equivalence features, as we have been working for some time at translating the
TPC H workload into SPARQL.
The usefulness of these things is uncontested but standardization could be hard as there are subtle questions about variable scope and the like.
- Inference - The SPARQLL spec does not deal with transitivity or such
matters because it is assumed that these are handled by an
underlying inference layer. This is however most often not so.
There was interest in more fine grained control of
inference, for example declaring that just one property in a query
would be transitive or that subclasses should be taken into account
in only one triple pattern. As far as I am concerned, this is very
reasonable and we even offer extensions for this sort of thing in
Virtuoso's SPARQL. This however only makes sense if the inference
is done at query time and pattern by pattern. For instance, if
forward chaining is used, this no longer makes sense. Specifying
that some forward chaining ought to be done at query time is
impractical, as the operation can be very large and time consuming
and it is the dba's task to determine what should be stored and for
how long, how changes should be propagated and so on. All these are application dependent and standardizing will be difficult.
Support for RDF features like lists and bags would all fall under the functions an underlying inference layer should perform. These thiings are of special interest when querying OWL models, for example.
Path expressions - Path expressions were requested by a few people.
We have implemented some, as in ?product+?has_supplier+>s_name =
"Gizmos, Inc.". This means that one supplier of product has name
"Gizmo, Inc.". This is a nice shorthand but we run into problems if
we start supporting repetitive steps, optional steps and the like.
In conclusion, update, full text and basic counting and grouping would
seem straightforward at this point. Nesting queries, value
subqueries, views and the like should not be too hard if an agreement
is reached on scope rules. Inference and federation will probably
need more experimentation but a lot can be had already with very
simple fine grained control of backward chaining, if such applies or
with explicit end point refernces and explicit join order. These are
practical butr not pretty enough for committee consensus, would be my
guess. Anyway, it will be a few months before anything formal will
happen.