Virtuoso Cluster Stage 1

I recall a quote from a stock car racing movie.

"What is the necessary prerequisite for winning a race?" asked the racing team boss.

"Being the fastest," answered the hotshot driver, after yet another wrecked engine.

"No. It is finishing the race."

In the interest of finishing, we'll now leave optimizing the cluster traffic and scheduling and move to completing functionality. Our next stop is TPC-D. After this TPC-C, which adds the requirement of handling distributed deadlocks. After this we add RDF-specific optimizations.

This will be Virtuoso 6 with the first stage of clustering support. This is with fixed partitions, which is just like a single database, except it runs on multiple machines. The stage after this is Virtuoso Cloud, the database with all the space filling properties of foam, expanding and contracting to keep an even data density as load and resource availability change.

Right now, we have a pretty good idea of the final form of evaluating loop joins in a cluster, which after all is the main function of the thing. It makes sense to tune this to a point before going further. You want the pipes and pumps and turbines to have known properties and fittings before building a power plant.

To test this, we took a table of a million short rows and made one copy partitioned over 4 databases and one copy with all rows in one database. We ran all the instances in a 4 core Xeon box. We used Unix sockets for communication.

We joined the table to itself, like SELECT COUNT (*) FROM ct a, ct b WHERE b.row_no = a.row_no + 3. The + 3 causes the joined rows never to be on the same partition.

With cluster, the single operation takes 3s and with a single process it takes 4s. The overall CPU time for cluster is about 30% higher, some of which is inevitable since it must combine results, serialize them, and so forth. Some real time is gained by doing multiple iterations of the inner loop (getting the row for b) in parallel. This can be further optimized to maybe 2x better with cluster but this can wait a little.

Then we make a stream of 10 such queries. The stream with cluster is 14s; with the single process, it is 22s. Then we run 4 streams in parallel. The time with cluster is 39s and with a single process 36s. With 16 streams in parallel, cluster gets 2m51 and single process 3m21.

The conclusion is that clustering overhead is not significant in a CPU-bound situation. Note that all the runs were at 4 cores at 98-100%, except for the first, single-client run, which had one process at 98% and 3 at 32%.

The SMP single process loses by having more contention for mutexes serializing index access. Each wait carries an entirely ridiculous penalty of up to 6µs or so, as discussed earlier on this blog. The cluster wins by less contention due to distributed data and loses due to having to process messages and remember larger intermediate results. These balance out, or close enough.

For the case with a single client, we can cut down on the coordination overhead by simply optimizing the code some more. This is quite possible, so we could get one process at 100% and 3 at 50%.

The numbers are only relevant as ballpark figures and the percentages will vary between different queries. The point is to prove that we actually win and do not jump from the frying pan into the fire by splitting queries across processes. As a point of comparison, running the query clustered just as one would run it locally took 53s.

We will later look at the effects of different networks, as we get to revisit the theme with some real benchmarks.