Linked Geospatial Data 2014 Workshop, Part 1: Web Services or SPARQL Modeling?

The W3C (World Wide Web Consortium) and OGC (Open Geospatial Consortium) organized the Linked Geospatial Data 2014 workshop in London this week. The GeoKnow project was represented by Claus Stadler of Universität Leipzig, and Hugh Williams and myself (Orri Erling) from OpenLink Software. The Open Knowledge Foundation (OKFN) also held an Open Data Meetup in the evening of the first day of the workshop.

Reporting on each talk and the many highly diverse topics addressed is beyond the scope of this article; for this you can go to the program and the slides that will be online. Instead, I will talk about questions that to me seemed to be in the air, and about some conversations I had with the relevant people.

The trend in events like this is towards shorter and shorter talks and more and more interaction. In this workshop, talks were given in series of three talks with all questions at the end, with all the presenters on stage. This is not a bad idea since we get a panel-like effect where many presenters can address the same question. If the subject matter allows, a panel is my preferred format.

Web services or SPARQL? Is GeoSPARQL good? Is it about Linked Data or about ontologies?

Geospatial data tends to be exposed via web services, e.g., WFS (Web Feature Service). This allows item retrieval on a lookup basis and some predefined filtering, transformation, and content negotiation. Capabilities vary; OGC now has WFS 2.0, and there are open source implementations that do a fair job of providing the functionality.

Of course, a real query language is much more expressive, but a service API is more scalable, as people say. What they mean is that an API is more predictable. For pretty much any complex data task, a query language is near-infinitely more efficient than going back-and-forth, often on a wide area network, via an API. So, as Andreas Harth put it: for data publishers, make an API; an open SPARQL endpoint is too "brave," [Andreas' word, with the meaning of foolhardy]. When you analyze, he continued, then you load it into a endpoint, but you use your own. Any quality of service terms must be formulated with respect to a fixed workload, this is not meaningful with ad hoc queries in an expressive language. Things like anytime semantics (return whatever is found within a time limit) are only good for a first interactive look, not for applications.

Should the application go to the data or the reverse? Some data is big and moving it is not self-evident. A culture of datasets being hosted on a cloud may be forming. Of course some linked data like DBpedia has for a long time been available as Amazon images. Recently, SindiceTech has made a similar packaging of Freebase. The data of interest here is larger and its target audience is more specific, on the e-science side.

How should geometries be modeled? I have met the GeoSPARQL and the SQL MM on which it is based with a sense of relief, as these are reasonable things that can be efficiently implemented. There are proposals where points have URIs, and linestrings are ordered sets of points, and collections are actual trees with RDF subjects as nodes. As a standard, such a thing is beyond horrible, as it hits all the RDF penalties and overheads full force, and promises easily 10x worse space consumption and 100x worse run times compared to the sweetly reasonable GeoSPARQL. One presenter said that cases of actually hanging attributes off points of complex geometries had been heard of but were, in his words, anecdotal. He posed a question to the audience about use cases where points in fact needed separately addressable identities. Several cases did emerge, involving, for example, different measurement certainties for different points on on a trajectory trace obtained by radar. Applications that need data of this sort will perforce be very domain specific. OpenStreetMap (OSM) itself is a bit like this, but there the points that have individual identity also have predominantly non-geometry attributes and stand for actually-distinct entities. OSM being a practical project, these are then again collapsed into linestrings for cases where this is more efficient. The OGC data types themselves have up to 4 dimensions, of which the 4th could be used as an identifier of a point in the event this really were needed. If so, this would likely be empty for most points and would compress away if the data representation were done right.

For data publishing, Andreas proposed to give OGC geometries URIs, i.e., the borders of a country can be more or less precisely modeled, and the large polygon may have different versions and provenances. This is reasonable enough, as long as the geometries are big. For applications, one will then collapse the 1:n between entity and its geometry into a 1:1. In the end, when you make an application, even an RDF one, you do not just throw all the data in a bucket and write queries against that. Some alignment and transformation is generally involved.