## Background

I have mentioned on a couple of prior occasions that basic graph operations ought to be integrated into the SQL query language.

The history of databases is by and large about moving from specialized applications toward a generic platform. The introduction of the DBMS itself is the archetypal example. It is all about extracting the common features of applications and making these the features of a platform instead.

It is now time to apply this principle to graph traversal.

The rationale is that graph operations are somewhat tedious to write in a parallelize-able, latency-tolerant manner. Writing them as one would for memory-based data structures is easier but totally unscalable as soon as there is any latency involved, i.e., disk reads or messages between cluster peers.

The ad-hoc nature and very large volume of RDF data makes this a timely question. Up until now, the answer to this question has been to materialize any implied facts in RDF stores. If *a* was part of *b*, and *b* part of *c*, the implied fact that *a* is part of *c* would be inserted explicitly into the database as a pre-query step.

This is simple and often efficient, but tends to have the downside that one makes a specialized warehouse for each new type of query. The activity becomes less ad-hoc.

Also, this becomes next to impossible when the scale approaches web scale, or if some of the data is liable to be on-and-off included-into or excluded-from the set being analyzed. This is why with Virtuoso we have tended to favor inference on demand ("backward chaining") and mapping of relational data into RDF without copying.

The SQL world has taken steps towards dealing with recursion with the `WITH - UNION`

construct which allows definition of recursive views. The idea there is to define, for example, a tree walk as a `UNION`

of the data of the starting node plus the recursive walk of the starting node's immediate children.

The main problem with this is that I do not very well see how a SQL optimizer could effectively rearrange queries involving `JOIN`

s between such recursive views. This model of recursion seems to lose SQL's non-procedural nature. One can no longer easily rearrange `JOIN`

s based on what data is given and what is to be retrieved. If the recursion is written from root to leaf, it is not obvious how to do this from leaf to root. At any rate, queries written in this way are so complex to write, let alone optimize, that I decided to take another approach.

Take a question like "list the parts of products of category *C* which have materials that are classified as toxic." Suppose that the product categories are a tree, the product parts are a tree, and the materials classification is a tree taxonomy where "toxic" has a multilevel substructure.

Depending on the count of products and materials, the query can be evaluated as either going from products to parts to materials and then climbing up the materials tree to see if the material is toxic. Or one could do it in reverse, starting with the different toxic materials, looking up the parts containing these, going to the part tree to the product, and up the product hierarchy to see if the product is in the right category. One should be able to evaluate the identical query either way depending on what indices exist, what the cardinalities of the relations are, and so forth — regular cost based optimization.

Especially with RDF, there are many problems of this type. In regular SQL, it is a long-standing cultural practice to flatten hierarchies, but this is not the case with RDF.

In Virtuoso, we see SPARQL as reducing to SQL. Any RDF-oriented database-engine or query-optimization feature is accessed via SQL. Thus, if we address run-time-recursion in the Virtuoso query engine, this becomes, *ipso facto*, an SQL feature. Besides, we remember that SQL is a much more mature and expressive language than the current SPARQL recommendation.

## SQL and Transitivity

We will here look at some simple social network queries. A later article will show how to do more general graph operations. We extend the SQL derived table construct, i.e., `SELECT`

in another `SELECT`

's `FROM`

clause, with a `TRANSITIVE`

clause.

Consider the data:

`CREATE TABLE "knows" ("p1" INT, "p2" INT, PRIMARY KEY ("p1", "p2") ); ALTER INDEX "knows" ON "knows" PARTITION ("p1" INT); CREATE INDEX "knows2" ON "knows" ("p2", "p1") PARTITION ("p2" INT);`

We represent a social network with the many-to-many relation "knows". The persons are identified by integers.

`INSERT INTO "knows" VALUES (1, 2); INSERT INTO "knows" VALUES (1, 3); INSERT INTO "knows" VALUES (2, 4);`

`SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT "p1", "p2" FROM "knows" ) "k" WHERE "k"."p1" = 1;`

We obtain the result:

p1 p2 1 3 1 2 1 4

The operation is reversible:

`SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT "p1", "p2" FROM "knows" ) "k" WHERE "k"."p2" = 4;`

p1 p2 2 4 1 4

Since now we give *p2*, we traverse from *p2* towards *p1*. The result set states that 4 is known by 2 and 2 is known by 1.

To see what would happen if *x* knowing *y* also meant *y* knowing *x*, one could write:

`SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT "p1", "p2" FROM (SELECT "p1", "p2" FROM "knows" UNION ALL SELECT "p2", "p1" FROM "knows" ) "k2" ) "k" WHERE "k"."p2" = 4;`

p1 p2 2 4 1 4 3 4

Now, since we know that 1 and 4 are related, we can ask how they are related.

`SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT "p1", "p2", T_STEP (1) AS "via", T_STEP ('step_no') AS "step", T_STEP ('path_id') AS "path" FROM "knows" ) "k" WHERE "p1" = 1 AND "p2" = 4;`

p1 p2 via step path 1 4 1 0 0 1 4 2 1 0 1 4 4 2 0

The two first columns are the ends of the path. The next column is the person that is a step on the path. The next one is the number of the step, counting from 0, so that the end of the path that corresponds to the end condition on the column designated as input, i.e., *p1*, has number 0. Since there can be multiple solutions, the last column is a sequence number allowing distinguishing multiple alternative paths from each other.

For LinkedIn users, the friends ordered by distance and descending friend count query, which is at the basis of most LinkedIn search result views can be written as:

`SELECT p2, dist, (SELECT COUNT (*) FROM "knows" "c" WHERE "c"."p1" = "k"."p2" ) FROM (SELECT TRANSITIVE t_in (1) t_out (2) t_distinct "p1", "p2", t_step ('step_no') AS "dist" FROM "knows" ) "k" WHERE "p1" = 1 ORDER BY "dist", 3 DESC;`

p2 dist aggregate 2 1 1 3 1 0 4 2 0

## How?

The queries shown above work on Virtuoso v6. When running in cluster mode, several thousand graph traversal steps may be proceeding at the same time, meaning that all database access is parallelized and that the algorithm is internally latency-tolerant. By default, all results are produced in a deterministic order, permitting predictable slicing of result sets.

Furthermore, for queries where both ends of a path are given, the optimizer may decide to attack the path from both ends simultaneously. So, supposing that every member of a social network has an average of 30 contacts, and we need to find a path between two users that are no more than 6 steps apart, we begin at both ends, expanding each up to 3 levels, and we stop when we find the first intersection. Thus, we reach 2 * 30^3 = 54,000 nodes, and not 30^6 = 729,000,000 nodes.

Writing a generic database driven graph traversal framework on the application side, say in Java over JDBC, would easily be over a thousand lines. This is much more work than can be justified just for a one-off, ad-hoc query. Besides, the traversal order in such a case could not be optimized by the DBMS.

## Next

In a future blog post I will show how this feature can be used for common graph tasks like critical path, itinerary planning, traveling salesman, the 8 queens chess problem, etc. There are lots of switches for controlling different parameters of the traversal. This is just the beginning. I will also give examples of the use of this in SPARQL.

## Comments

algorithm is internally latency tolerant."

Do you have some deep magic voo-doo that works around the problem of having a straight chain? In the worst case, every single step of your search will cross nodes. Every one. That is in no way, shape, or form "latency tolerant."

In the realm of parallel graph manipulation and searching, you need to be very, very precise in what you declare... Are your results for some general case, over some set of examples, or a technical average?

You certainly are not talking about worst-case performance. Some graphs contain no parallelism at all. Some pointer jumping or triangulation can add parallelism to otherwise sequential graphs but at an explosive storage cost. There are some nifty techniques for pulling as much parallelism as possible out of a graph (splitting by the DFS tree, forgot the author, but it's referenced from Reif's proof that DFS is sequential), but they are quite painful if your data is dynamic.