We have just added a geometry data type and corresponding R-tree index to Virtuoso. This follows the general scheme of SQL/MM, as is implemented by PostGIS and many others. We have all the engine-side stuff, including optimizer support for geometry cardinality sampling and good execution plans for combinations of spatial and other joins. We have however not yet implemented all the different geometry types and library function support for them, like shortest distance between two arbitrary shapes.
The geometry support is for both SQL and SPARQL. On the SQL side, it works with the ISO/IEC 13249 SQL/MM API; with RDF, a geometry can occur as the object of a quad. If the object is a typed-literal of the virtrdf:Geometry
type, it gets indexed in a geometry index over all geometries in quads; no special declarations are needed. After this, SQL MM predicates and functions can be used with SPARQL, like this:
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
SELECT ?class
COUNT (*)
WHERE { ?m geo:geometry ?geo .
?m a ?class .
FILTER ( <bif:st_intersects>
( ?geo,
<bif:st_point> (0, 52),
100
)
)
}
GROUP BY ?class
ORDER BY DESC 2
This returns the counts of objects of each class occurring within 100 km of (0, 52), a point near London.
For any data set with WGS 84 geo:long
and geo:lat
values, a simple SQL function makes a point geometry for each such coordinate pair and adds it as the geo:geometry
property of the subject with the long/lat. This then enables fast spatial access to arbitrary location data in RDF.
Right now, we hardly see any geometries other than points in RDF data, even though there are some efforts for vocabularies for more complex entities. As these get adopted we will support them.
For scalability, we tried the implementation with OpenStreetMap's 350 million or so points. The geometry implementation partitions well over a cluster, similarly to a full text index, i.e., every server has its slice of the geometries, partitioned by the geometry object's key, thus not by range of coordinates or such. Like this, the items are evenly spread even though the coordinate distribution is highly uneven.
We can do spatial joins like —
SELECT ?s
( <sql:num_or_null> (?p) )
COUNT (*)
WHERE { ?s <http://dbpedia.org/ontology/populationTotal> ?p .
FILTER
( <sql:num_or_null> (?p) > 1000000 ) .
?s geo:geometry ?geo .
FILTER
( <bif:st_intersects> ( ?pt, ?geo, 5 ) ) .
?xx geo:geometry ?pt
}
GROUP BY ?s
( <sql:num_or_null> (?p) )
ORDER BY DESC 3
LIMIT 20
This takes the DBpedia subjects that have a population over 1 million and a geometry. We then count all the geometries within 5 km of the point location of the first geometry. With DBpedia (about 5 million points), GeoNames (7 million points), and OpenStreetMap (350 million points), we get the result:
http://dbpedia.org/resource/Munich 1356594 117280
http://dbpedia.org/resource/London 7355400 81486
http://dbpedia.org/resource/Davao_City 1363337 58640
http://dbpedia.org/resource/Belo_Horizonte 2412937 58640
http://dbpedia.org/resource/Chengde 3610000 58640
http://dbpedia.org/resource/Hamburg 1769117 51664
http://dbpedia.org/resource/San_Diego%2C_California 1266731 47685
http://dbpedia.org/resource/Bursa 1562828 47685
http://dbpedia.org/resource/Port-au-Prince 1082800 47685
http://dbpedia.org/resource/Oakland_County%2C_Michigan 1194156 45636
http://dbpedia.org/resource/Sana%27a 1747627 40923
http://dbpedia.org/resource/Milan 1303437 40923
http://dbpedia.org/resource/Campinas 1059420 40923
http://dbpedia.org/resource/Hohhot 2580000 40923
http://dbpedia.org/resource/Brussels 1031215 40923
http://dbpedia.org/resource/Bogra_District 2988567 40923
http://dbpedia.org/resource/Cort%C3%A9s_Department 1202510 40923
http://dbpedia.org/resource/Berlin 3416300 35668
http://dbpedia.org/resource/New_York_City 8274527 30810
http://dbpedia.org/resource/Los_Angeles%2C_California 3849378 25614
20 Rows. -- 1733 msec.
Cluster 8 nodes, 1 s. 358 m/s 1596 KB/s 664% cpu 2% read 16% clw threads 1r 0w 0i buffers 1124351 0 d 0 w 0 pfs
This takes 1.7 seconds on a Virtuoso Cluster configured with 8 processes on a single dual-Xeon 5520 box, running at about 664% CPU with warm cache. Fair enough for a first crack, this can obviously be optimized further. Still, the geo part of the processing is already as good as instantaneous.
We will shortly have the geography features installed on DBpedia and the other data sets we host. As these come online we will show more demo queries.
For more about SQL/MM, you can look to a couple of PDFs: