Background

I have mentioned on a couple of prior occasions that basic graph operations ought to be integrated into the SQL query language.

The history of databases is by and large about moving from specialized applications toward a generic platform. The introduction of the DBMS itself is the archetypal example. It is all about extracting the common features of applications and making these the features of a platform instead.

It is now time to apply this principle to graph traversal.

The rationale is that graph operations are somewhat tedious to write in a parallelize-able, latency-tolerant manner. Writing them as one would for memory-based data structures is easier but totally unscalable as soon as there is any latency involved, i.e., disk reads or messages between cluster peers.

The ad-hoc nature and very large volume of RDF data makes this a timely question. Up until now, the answer to this question has been to materialize any implied facts in RDF stores. If a was part of b, and b part of c, the implied fact that a is part of c would be inserted explicitly into the database as a pre-query step.

This is simple and often efficient, but tends to have the downside that one makes a specialized warehouse for each new type of query. The activity becomes less ad-hoc.

Also, this becomes next to impossible when the scale approaches web scale, or if some of the data is liable to be on-and-off included-into or excluded-from the set being analyzed. This is why with Virtuoso we have tended to favor inference on demand ("backward chaining") and mapping of relational data into RDF without copying.

The SQL world has taken steps towards dealing with recursion with the WITH - UNION construct which allows definition of recursive views. The idea there is to define, for example, a tree walk as a UNION of the data of the starting node plus the recursive walk of the starting node's immediate children.

The main problem with this is that I do not very well see how a SQL optimizer could effectively rearrange queries involving JOINs between such recursive views. This model of recursion seems to lose SQL's non-procedural nature. One can no longer easily rearrange JOINs based on what data is given and what is to be retrieved. If the recursion is written from root to leaf, it is not obvious how to do this from leaf to root. At any rate, queries written in this way are so complex to write, let alone optimize, that I decided to take another approach.

Take a question like "list the parts of products of category C which have materials that are classified as toxic." Suppose that the product categories are a tree, the product parts are a tree, and the materials classification is a tree taxonomy where "toxic" has a multilevel substructure.

Depending on the count of products and materials, the query can be evaluated as either going from products to parts to materials and then climbing up the materials tree to see if the material is toxic. Or one could do it in reverse, starting with the different toxic materials, looking up the parts containing these, going to the part tree to the product, and up the product hierarchy to see if the product is in the right category. One should be able to evaluate the identical query either way depending on what indices exist, what the cardinalities of the relations are, and so forth — regular cost based optimization.

Especially with RDF, there are many problems of this type. In regular SQL, it is a long-standing cultural practice to flatten hierarchies, but this is not the case with RDF.

In Virtuoso, we see SPARQL as reducing to SQL. Any RDF-oriented database-engine or query-optimization feature is accessed via SQL. Thus, if we address run-time-recursion in the Virtuoso query engine, this becomes, ipso facto, an SQL feature. Besides, we remember that SQL is a much more mature and expressive language than the current SPARQL recommendation.

SQL and Transitivity

We will here look at some simple social network queries. A later article will show how to do more general graph operations. We extend the SQL derived table construct, i.e., SELECT in another SELECT's FROM clause, with a TRANSITIVE clause.

Consider the data:

CREATE TABLE "knows" 
   ("p1" INT, 
    "p2" INT, 
    PRIMARY KEY ("p1", "p2")
   );
ALTER INDEX "knows" 
   ON "knows" 
   PARTITION ("p1" INT);
CREATE INDEX "knows2" 
   ON "knows" ("p2", "p1") 
   PARTITION ("p2" INT);

 

We represent a social network with the many-to-many relation "knows". The persons are identified by integers.

INSERT INTO "knows" VALUES (1, 2);
INSERT INTO "knows" VALUES (1, 3);
INSERT INTO "knows" VALUES (2, 4);
 
SELECT * 
   FROM (SELECT 
            TRANSITIVE 
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               "p1", 
            "p2" 
         FROM "knows"
        ) "k" 
   WHERE "k"."p1" = 1;

We obtain the result:

p1 p2
1 3
1 2
1 4

The operation is reversible:

SELECT * 
   FROM (SELECT 
            TRANSITIVE 
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               "p1", 
            "p2" 
         FROM "knows"
        ) "k" 
   WHERE "k"."p2" = 4;

 
p1 p2
2 4
1 4

Since now we give p2, we traverse from p2 towards p1. The result set states that 4 is known by 2 and 2 is known by 1.

To see what would happen if x knowing y also meant y knowing x, one could write:

SELECT * 
   FROM (SELECT 
            TRANSITIVE
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               "p1", 
            "p2" 
	    FROM (SELECT 
                  "p1", 
                  "p2" 
               FROM "knows" 
               UNION ALL 
                  SELECT 
                     "p2", 
                     "p1" 
                  FROM "knows"
              ) "k2"
        ) "k" 
   WHERE "k"."p2" = 4;
 
p1 p2
2 4
1 4
3 4

Now, since we know that 1 and 4 are related, we can ask how they are related.

SELECT * 
   FROM (SELECT 
            TRANSITIVE 
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               "p1", 
            "p2", 
            T_STEP (1) AS "via", 
            T_STEP ('step_no') AS "step", 
            T_STEP ('path_id') AS "path" 
         FROM "knows"
        ) "k" 
   WHERE "p1" = 1 
      AND "p2" = 4;
 
p1 p2 via step path
1 4 1 0 0
1 4 2 1 0
1 4 4 2 0

The two first columns are the ends of the path. The next column is the person that is a step on the path. The next one is the number of the step, counting from 0, so that the end of the path that corresponds to the end condition on the column designated as input, i.e., p1, has number 0. Since there can be multiple solutions, the last column is a sequence number allowing distinguishing multiple alternative paths from each other.

For LinkedIn users, the friends ordered by distance and descending friend count query, which is at the basis of most LinkedIn search result views can be written as:

SELECT p2, 
      dist, 
      (SELECT 
          COUNT (*) 
          FROM "knows" "c" 
          WHERE "c"."p1" = "k"."p2"
      ) 
   FROM (SELECT 
            TRANSITIVE t_in (1) t_out (2) t_distinct "p1", 
            "p2", 
            t_step ('step_no') AS "dist"
         FROM "knows"
        ) "k" 
   WHERE "p1" = 1 
   ORDER BY "dist", 3 DESC;
 
p2 dist aggregate
2 1 1
3 1 0
4 2 0

How?

The queries shown above work on Virtuoso v6. When running in cluster mode, several thousand graph traversal steps may be proceeding at the same time, meaning that all database access is parallelized and that the algorithm is internally latency-tolerant. By default, all results are produced in a deterministic order, permitting predictable slicing of result sets.

Furthermore, for queries where both ends of a path are given, the optimizer may decide to attack the path from both ends simultaneously. So, supposing that every member of a social network has an average of 30 contacts, and we need to find a path between two users that are no more than 6 steps apart, we begin at both ends, expanding each up to 3 levels, and we stop when we find the first intersection. Thus, we reach 2 * 30^3 = 54,000 nodes, and not 30^6 = 729,000,000 nodes.

Writing a generic database driven graph traversal framework on the application side, say in Java over JDBC, would easily be over a thousand lines. This is much more work than can be justified just for a one-off, ad-hoc query. Besides, the traversal order in such a case could not be optimized by the DBMS.

Next

In a future blog post I will show how this feature can be used for common graph tasks like critical path, itinerary planning, traveling salesman, the 8 queens chess problem, etc. There are lots of switches for controlling different parameters of the traversal. This is just the beginning. I will also give examples of the use of this in SPARQL.