RDF Geography With Virtuoso

Details

We have just added a geometry data type and corresponding R-tree index to Virtuoso. This follows the general scheme of SQL/MM, as is implemented by PostGIS and many others. We have all the engine-side stuff, including optimizer support for geometry cardinality sampling and good execution plans for combinations of spatial and other joins. We have however not yet implemented all the different geometry types and library function support for them, like shortest distance between two arbitrary shapes.

The geometry support is for both SQL and SPARQL. On the SQL side, it works with the ISO/IEC 13249 SQL/MM API; with RDF, a geometry can occur as the object of a quad. If the object is a typed-literal of the virtrdf:Geometry type, it gets indexed in a geometry index over all geometries in quads; no special declarations are needed. After this, SQL MM predicates and functions can be used with SPARQL, like this:

  PREFIX  geo:  <http://www.w3.org/2003/01/geo/wgs84_pos#>  
  SELECT  ?class
          COUNT (*) 
   WHERE  { ?m  geo:geometry  ?geo    . 
            ?m  a             ?class  . 
                FILTER ( <bif:st_intersects> 
                          ( ?geo, 
                            <bif:st_point> (0, 52), 
                            100
                          )
                       )
          } 
GROUP BY  ?class 
ORDER BY  DESC 2

This returns the counts of objects of each class occurring within 100 km of (0, 52), a point near London.

For any data set with WGS 84 geo:long and geo:lat values, a simple SQL function makes a point geometry for each such coordinate pair and adds it as the geo:geometry property of the subject with the long/lat. This then enables fast spatial access to arbitrary location data in RDF.

Right now, we hardly see any geometries other than points in RDF data, even though there are some efforts for vocabularies for more complex entities. As these get adopted we will support them.

For scalability, we tried the implementation with OpenStreetMap's 350 million or so points. The geometry implementation partitions well over a cluster, similarly to a full text index, i.e., every server has its slice of the geometries, partitioned by the geometry object's key, thus not by range of coordinates or such. Like this, the items are evenly spread even though the coordinate distribution is highly uneven.

We can do spatial joins like —

   SELECT  ?s 
           ( <sql:num_or_null> (?p) )  
           COUNT (*) 
    WHERE  { ?s   <http://dbpedia.org/ontology/populationTotal>  ?p    . 
             FILTER 
               ( <sql:num_or_null> (?p) > 1000000 )                      . 
             ?s   geo:geometry                                   ?geo  .
             FILTER 
               ( <bif:st_intersects> ( ?pt, ?geo, 5 ) )                  . 
             ?xx  geo:geometry                                   ?pt 
           } 
 GROUP BY  ?s 
           ( <sql:num_or_null> (?p) )
 ORDER BY  DESC 3 
    LIMIT  20

This takes the DBpedia subjects that have a population over 1 million and a geometry. We then count all the geometries within 5 km of the point location of the first geometry. With DBpedia (about 5 million points), GeoNames (7 million points), and OpenStreetMap (350 million points), we get the result:

http://dbpedia.org/resource/Munich                        1356594    117280
http://dbpedia.org/resource/London                        7355400     81486
http://dbpedia.org/resource/Davao_City                    1363337     58640
http://dbpedia.org/resource/Belo_Horizonte                2412937     58640
http://dbpedia.org/resource/Chengde                       3610000     58640
http://dbpedia.org/resource/Hamburg                       1769117     51664
http://dbpedia.org/resource/San_Diego%2C_California       1266731     47685
http://dbpedia.org/resource/Bursa                         1562828     47685
http://dbpedia.org/resource/Port-au-Prince                1082800     47685
http://dbpedia.org/resource/Oakland_County%2C_Michigan    1194156     45636
http://dbpedia.org/resource/Sana%27a                      1747627     40923
http://dbpedia.org/resource/Milan                         1303437     40923
http://dbpedia.org/resource/Campinas                      1059420     40923
http://dbpedia.org/resource/Hohhot                        2580000     40923
http://dbpedia.org/resource/Brussels                      1031215     40923
http://dbpedia.org/resource/Bogra_District                2988567     40923
http://dbpedia.org/resource/Cort%C3%A9s_Department        1202510     40923
http://dbpedia.org/resource/Berlin                        3416300     35668
http://dbpedia.org/resource/New_York_City                 8274527     30810
http://dbpedia.org/resource/Los_Angeles%2C_California     3849378     25614

20 Rows. -- 1733 msec.

Cluster 8 nodes, 1 s. 358 m/s 1596 KB/s  664% cpu 2%  read 16% clw threads 1r 0w 0i buffers 1124351 0 d 0 w 0 pfs

This takes 1.7 seconds on a Virtuoso Cluster configured with 8 processes on a single dual-Xeon 5520 box, running at about 664% CPU with warm cache. Fair enough for a first crack, this can obviously be optimized further. Still, the geo part of the processing is already as good as instantaneous.

We will shortly have the geography features installed on DBpedia and the other data sets we host. As these come online we will show more demo queries.

For more about SQL/MM, you can look to a couple of PDFs:

SQL/MM Spatial: The Standard to Manage Spatial Data in Relational Database Systems by Knut Stolze
SQL Multimedia and Application Packages (SQL/MM) by Jim Melton and Andrew Eisenberg

Orri Erling's Weblog

Details

Subscribe

Tag Cloud

Post Categories

Recent Articles

Comments

Post Comment

Orri Erling's Weblog

Details

Subscribe

Tag Cloud

Post Categories

Recent Articles

Related

Comments

Post Comment