I wrote the basics of the Virtuoso clustering support over the past three weeks. It can now manage connections, decide where things go, do two phase commits, insert and select data from tables partitioned over multiple Virtuoso instances. It works about enough to be measured, of which I will blog more over the next two weeks.
I will in the following give a features preview of what will be in the Virtuoso clustering support when it is released in the fall of this year (2007).
Data Partitioning
A Virtuoso database consists of indices only, so that the row of a table is stored together with the primary key. Blobs are stored on separate pages when they do not fit inline within the row. With clustering, partitioning can be specified index by index. Partitioning means that values of specific columns are used for determining where the containing index entry will be stored. Virtuoso partitions by hash and allows specifying what parts of partitioning columns are used for the hash, for example bits 14-6 of an integer or the first 5 characters of a string. Like this, key compression gains are not lost by storing consecutive values on different partitions.
Once the partitioning is specified, we specify which set of cluster nodes stores this index. Not every index has to be split evenly across all nodes. Also, all nodes do not have to have equal slices of the partitioned index, accommodating differences in capacity between cluster nodes.
Each Virtuoso instance can manage up to 32TB of data. A cluster has no definite size limit.
Load Balancing and Fault Tolerance
When data is partitioned, an operation on the data goes where the data is. This provides a certain natural parallelism but we will discuss this further below.
Some data may be stored multiple times in the cluster, either for fail-over or for splitting read load. Some data, such as database schema, is replicated on all nodes. When specifying a set of nodes for storing the partitions of a key, it is possible to specify multiple nodes for the same partition. If this is the case, updates go to all nodes and reads go to a randomly picked node from the group.
If one of the nodes in the group fails, operation can resume with the surviving node. The failed node can be brought back online from the transaction logs of the surviving nodes. A few transactions may be rolled back at the time of failure and again at the time of the failed node rejoining the cluster but these are aborts as in the case of deadlock and lose no committed data.
Shared Nothing
The Virtuoso architecture does not require a SAN for disk sharing across nodes. This is reasonable since a few disks on a local controller can easily provide 300MB/s of read and passing this over an interconnect fabric that would also have to carry inter-node messages could saturate even a fast network.
Client View
A SQL or HTTP client can connect to any node of the cluster and get an identical view of all data with full transactional semantics. DDL operations like table creation and package installation are limited to one node, though.
Applications such as ODS will run unmodified. They are installed on all nodes with a single install command. After this, the data partitioning must be declared, which is a one time operation to be done cluster by cluster. The only application change is specifying the partitioning columns for each index. The gain is optional redundant storage and capacity not limited to a single machine. The penalty is that single operations may take a little longer when not all data is managed by the same process but then the parallel throughput is increased. We note that the main ODS performance factor is web page logic and not database access. Thus splitting the web server logic over multiple nodes gives basically linear scaling.
Parallel Query Execution
Message latency is the principal performance factor in a clustered database. Due to this, Virtuoso packs the maximum number of operations in a single message. For example, when doing a loop join that reads one table sequentially and retrieves a row of another table for each row of the outer table, a large number of the join of the inner loop are run in parallel. So, if there is a join of five tables that gets one row from each table and all rows are on different nodes, the time will be spent on message latency. If each step of the join gets 10 rows, for a total of 100000 results, the message latency is not a significant factor and the cluster will clearly outperform a single node.
Also, if the workload consists of large numbers of concurrent short updates or queries, the message latencies will even out and throughput will scale up even if doing a single transaction were faster on a single node.
Parallel SQL
There are SQL extensions for stored procedures allowing parallelizing operations. For example, if a procedure has a loop doing inserts, the inserted rows can be buffered until a sufficient number is available, at which point they are sent in batches to the nodes concerned. Transactional semantics are kept but error detection is deferred to the actual execution.
Transactions
Each transaction is owned by one node of the cluster, the node to which the client is connected. When more than one node besides the owner of the transaction is updated, two phase commit is used. This is transparent to the application code. No external transaction monitor is required, the Virtuoso instances perform these functions internally. There is a distributed deadlock detection scheme based on the nodes periodically sharing transaction waiting information.
Since read transactions can operate without locks, reading the last committed state of uncommitted updated rows, waiting for locks is not very common.
Interconnect and Threading
Virtuoso uses TCP to connect between instances. A single instance can have multiple listeners at different network interfaces for cluster activity. The interfaces will be used in a round-robin fashion by the peers, spreading the load over all network interfaces. A separate thread is created for monitoring each interface. Long messages, such as transfers of blobs are done on a separate thread, thus allowing normal service on the cluster node while the transfer is proceeding.
We will have to test the performance of TCP over Infiniband to see if there is clear gain in going to a lower level interface like MPI. The Virtuoso architecture is based on streams connecting cluster nodes point to point. The design does not per se gain from remote DMA or other features provided by MPI. Typically, messages are quite short, under 100K. Flow control for transfer of blobs is however nice to have but can be written at the application level if needed. We will get real data on the performance of different interconnects in the next weeks.
Deployment and Management
Configuring is quite simple, with each process sharing a copy of the same configuration file. One line in the file differs from host to host, telling it which one it is. Otherwise the database configuration files are individual per host, accommodating different file system layouts etc. Setting up a node requires copying the executable and two configuration files, no more. All functionality is contained in a single process. There are no installers to be run or such.
Changing the number or network interface of cluster nodes requires a cluster restart. Changing data partitioning requires copying the data into a new table and renaming this over the old one. This is time consuming and does not mix well with updates. Splitting an existing cluster node requires no copying with repartitioning but shifting data between partitions does.
A consolidated status report shows the general state and level of intra-cluster traffic as count of messages and count of bytes.
Start, shutdown, backup, and package installation commands can only be issued from a single master node. Otherwise all is symmetrical.
Present State and Next Developments
The basics are now in place. Some code remains to be written for such things as distributed deadlock detection, 2-phase commit recovery cycle, management functions, etc. Some SQL operations like text index, statistics sampling, and index intersection need special support, yet to be written.
The RDF capabilities are not specifically affected by clustering except in a couple of places. Loading will be slightly revised to use larger batches of rows to minimize latency, for example.
There is a pretty much infinite world of SQL optimizations for splitting aggregates, taking advantage of co-located joins etc. These will be added gradually. These are however not really central to the first application of RDF storage but are quite important for business intelligence, for example.
We will run some benchmarks for comparing single host and clustered Virtuoso instances over the next weeks. Some of this will be with real data, giving an estimate on when we can move some of the RDF data we presently host to the new platform. We will benchmark against Oracle and DB2 later but first we get things to work and compare against ourselves.
We roughly expect a halving in space consumption and a significant increase in single query performance and linearly scaling parallel throughput through addition of cluster nodes.
The next update will be on this blog within two weeks.