Introduction

This post discusses the technical specifics of how we accomplish smooth transactional operation in a database server cluster under different failure conditions. (A higher-level short version was posted last week.) The reader is expected to be familiar with the basics of distributed transactions.

Someone on a cloud computing discussion list called two-phase commit (2PC) the "anti-availability protocol." There is indeed a certain anti-SQL and anti-2PC sentiment out there, with key-value stores and "eventual consistency" being talked about a lot. Indeed, if we are talking about wide-area replication over high-latency connections, then 2PC with synchronously-sharp transaction boundaries over all copies is not really workable.

For multi-site operations, a level of eventual consistency is indeed quite unavoidable. Exactly what the requirements are depends on the application, so I will focus here on operations inside one site.

The key-value store culture seems to focus on workloads where a record is relatively self-contained. The record can be quite long, with repeating fields, different selections of fields in consecutive records, and so forth. Such a record would typically be split over many tables of a relational schema. In the RDF world, such a record would be split even wider, with the information needed to reconstitute the full record almost invariably split over many servers. This comes from the mapping between the text of URIs and their internal IDs being partitioned in one way, and the many indices on the RDF quads each in yet another way.

So it comes to pass that in the data models we are most interested in, the application-level entity (e.g., a user account in a social network) is not a contiguous unit with a single global identifier. The social network user account, that the key-value store would consider a unit of replication mastering and eventual consistency, will be in RDF or SQL a set of maybe hundreds of tuples, each with more than one index, nearly invariably spanning multiple nodes of the database cluster.

So, before we can talk about wide-area replication and eventual consistency with application-level semantics, we need a database that can run on a fair-sized cluster and have cast-iron consistency within its bounds. If such a cluster is to be large and is to operate continuously, it must have some form of redundancy to cover for hardware failures, software upgrades, reboots, etc., without interruption of service.

This is the point of the design space we are tackling here.

Non Fault-Tolerant Operation

There are two basic modes of operation we cover: bulk load, and online transactions.

In the case of bulk load, we start with a consistent image of the database; load data; and finish by making another consistent image. If there is a failure during load, we lose the whole load, and restart from the initial consistent image. This is quite simple and is not properly transactional. It is quicker for filling a warehouse but is not to be used for anything else. In the remainder, we will only talk about online transactions.

When all cluster nodes are online, operation is relatively simple. Each entry of each index belongs to a partition that is determined by the values of one or more partitioning columns of said index. There are no tables separate from indices; the relational row is on the index leaf of its primary key. Secondary indices reference the row by including the primary key. Blobs are in the same partition as the row which contains the blob. Each partition is then stored on a "cluster node." In non fault-tolerant operations, each such cluster node is a single process with exclusive access to its own permanent storage, consisting of database files and logs; i.e., each node is a single server instance. It does not matter if the storage is local or on a SAN, the cluster node is still the only one accessing it.

When things are not fault tolerant, transactions work as follows:

When there are updates, two-phase commit is used to guarantee a consistent result. Each transaction is coordinated by one cluster node, which issues the updates in parallel to all cluster nodes concerned. Sending two update messages instead of one does not significantly impact latency. The coordinator of each transaction is the primary authority for the transaction's outcome. If the coordinator of the transaction dies between the phases of the commit, the transaction branches stay in the prepared state until the coordinator is recovered and can be asked again about the outcome of the transaction. Likewise, if a non-coordinating cluster node with a transaction branch dies between the phases, it will do a roll-forward and ask the coordinator for the outcome of the transaction.

If cluster nodes occasionally crash and then recover relatively quickly, without ever losing transaction logs or database files, this is resilient enough. Everything is symmetrical; there are no cluster nodes with special functions, except for one master node that has the added task of resolving distributed deadlocks.

I suppose our anti-SQL person called 2PC "anti-availability" because in the above situation we have the following problems: if any one cluster node is offline, it is quite likely that no transaction can be committed. This is so unless the data is partitioned on a key with application semantics, and all data touched by a transaction usually stays within a single partition. Then operations could proceed on most of the data while one cluster node was recovering. But, especially with RDF, this is never the case, since keys are partitioned in ways that have nothing to do with application semantics. Further, if one uses XA or Microsoft DTC with the monitor on a single box, this box can become a bottleneck and/or a single point of failure. (Among other considerations, this is why Virtuoso does not rely on any such monitor.) Further, if a cluster node dies never to be heard of again, leaving prepared but uncommitted transaction branches, the rest of the system has no way of telling what to do with them, again unless relying on a monitor that is itself liable to fail.

If transactions have a real world counterpart, it is possible, at least in theory, to check the outcome against the real world state: One can ask a customer if an order was actually placed or a shipment delivered. But when a transaction has to do with internal identifiers of things, for example whether mailto://plaidskirt@hotdate.com has internal ID 0xacebabe, such a check against external reality is not possible.

Fault-Tolerant Operation

In a fault tolerant setting, we introduce the following extra elements: Cluster nodes are comprised of "quorums" of mutually-mirroring server instances. Each such quorum holds a partition of the data. Such a quorum typically consists of two server instances, but may have three for extra safety. If all server instances in the quorum are offline, then the cluster node is offline, and the cluster is not fully operational. If at least one server instance in a quorum is online, then the cluster node is online, and the cluster is operational and can process new transactions.

We designate one cluster node (i.e., one quorum of 2 or 3 server instances) to act as a master node, and we set an order of precedence among its member instances. In addition to arbitrating distributed deadlocks, the master instance on duty will handle reports of server instance failures, and answer questions about any transactions left hanging in prepared state by a dead transaction coordinator. If the master on duty fails, the next master in line will either notice this itself in the line of normal business or get a complaint from another server instance about not being able to contact the previous master.

There is no global heartbeat messaging per se, but since connections between server instances are reused long-term, a dropped connection will be noticed and the master on duty will be notified. If all masters are unavailable, that entire quorum (i.e., the master node) is offline and thus (as with any entire node going offline) most operations will fail anyway, unless by chance they do not hit any data managed by that failed quorum.

When it receives a notice of unavailability, the master instance on duty tries to contact the unavailable server instance and if it fails, it will notify all remaining instances that that server instance is removed from the cluster. The effect is that the remaining server instances will stop attempting to access the failed instance. Updates to the partitions managed by the failed server instance are no longer sent to it, which results in updates to this data succeeding, as they are made against the other server instances in that quorum. Updates to the data of the failed server instance will fail in the window of time between the actual failure and the removal, which is typically well under a second. The removal of a failed server instance is delegated to a central authority in order not to have everybody get in each other's way when trying to effect the removal.

If the failed server instance left prepared uncommitted transactions behind, the server instances having such branches will in due order contact the transaction coordinator to ask what should be done. This is a normal procedure for dealing with possibly dropped commit or rollback messages. When they discover that the coordinator has been removed, the master on duty will be contacted instead. Each prepare message of a transaction lists all the server instances participating in the transaction; thus the master can check whether each has received the prepare. If all have the prepare and none has an abort, the transaction is committed. The dead coordinator may not know this or may indeed not have the transaction logged, since it sends the prepares before logging its own prepare. The recovery will handle this though. We note that of the remaining branches, there is at least one copy of the branch with the failed server instance, or else we would have a whole quorum failed. In cases where there are branches participating in an unresolved transaction where all the quorum members have failed, the system cannot decide the outcome, and will periodically retry until at least one member of the failed quorum becomes available.

The most complex part of the protocol is the recovery of a failed server instance. The recovery starts with a normal roll forward from the local transaction log. After this, the server instance will contact the master on duty to ask for its state. Typically, the master will reply that the recovering server instance had been removed and is out of date. When this is established, the recovering server instance will contact a live member of its quorum and ask for sync. The failed server instance has an approximate timestamp of its last received transaction. It knows this from the roll forward, where time markers are interspersed now and then between transaction records. The live partner then sends its transaction log(s) covering the time from a few seconds before the last transaction of the failed partner up to the present. A few transactions may get rolled forward twice but this does no harm, since these records have absolute values and no deltas and the second insert of a key is simply ignored. When the sender of the log reaches its last committed log entry, it asks the recovering server instance to confirm successful replay of the log so far. Having the confirmation, the sender will abort all unprepared transactions affecting it and will not accept any new ones until the sync is completed. If new transactions were committed between sending the last of the log and killing the uncommitted new transactions, these too are shipped to the recovering server instance in their committed or prepared state. When these are also confirmed replayed, the recovering server instance is in exact sync up to the transaction. The sender then notifies the rest of the cluster that the sync is complete and that the recovered server instance will be included in any updates of its slice of the data. The time between freeze and re-enable of transactions is the time to replay what came in between the first sync and finishing the freeze. Typically nothing came in, so the time is in milliseconds. If an application got its transaction killed in this maneuver, it will be seen as a deadlock.

If the recovering server instance received transactions in prepared state, it will ask about their outcome as a part of the periodic sweep through pending transactions. One of these transactions could have been one originally prepared by itself, where the prepares had gone out before it had time to log the transaction. Thus, this eventuality too is covered and has a consistent outcome. Failures can interrupt the recovery process. The recovering server instance will have logged as far as it got, and will pick up from this point onward. Real time clocks on the host nodes of the cluster will have to be in approximate sync, within a margin of a minute or so. This is not a problem in a closely connected network.

For simultaneous failure of a entire quorum of server instances (i.e., a set of mutually-mirroring partners; a cluster node), the rule is that the last one to fail must be the first to come back up. In order to have uninterrupted service across arbitrary double failures, one must store things in triplicate; statistically, however, most double failures will not hit cluster nodes of the same group.

The protocol for recovery of failed server instances of the master quorum (i.e., the master cluster node) is identical, except that a recovering master will have to ask the other master(s) which one is more up to date. If the recovering master has a log entry of having excluded all other masters in its quorum from the cluster, it can come back online without asking anybody. If there is no such entry, it must ask the other master(s). If all had failed at the exact same instant, none has an entry of the other(s) being excluded and all will know that they are in the same state since any update to one would also have been sent to the other(s).

Failure of Storage Media

When a server instance fails, its permanent storage may or may not survive. Especially with mirrored disks, storage most often survives a failure. However, the survival of the database does not depend on any single server instance retaining any permanent storage over failure. If storage is left in place, as in the case of an OS reboot or replacing a faulty memory chip, rejoining the cluster is done based on the existing copy of the database on the server instance. if there is no existing copy, a copy can be taken from any surviving member of the same quorum. This consists of the following steps: First, a log checkpoint is forced on the surviving instance. Normally log checkpoints are done at regular intervals, independently on each server instance. The log checkpoint writes a consistent state of the database to permanent storage. The disk pages forming this consistent image will not be written to until the next log checkpoint. Therefore copying the database file is safe and consistent as long as a log checkpoint does not take place between the start and end of copy. Thus checkpoints are disabled right after the initial checkpoint. The copy can take a relatively long time; consider 20s per gigabyte on a 1GbE network a good day. At the end of copy, checkpoints are re-enabled on the surviving cluster node. The recovering database starts without a log, sees the timestamp of the checkpoint in the database, and asks for transactions from just before this time up to present. The recovery then proceeds as outlined above.

Network Failures

The CAP theorem states that Consistency, Availability, and Partition-tolerance do not mix. "Partition" here means the split of a network.

It is trivially true that if the network splits so that on both sides there is a copy of each partition of the data, both sides will think themselves the live copy left online after the other died, and each will thus continue to accumulate updates. Such an event is not very probable within one site where all machines are redundantly connected to two independent switches. Most servers have dual 1GbE on the motherboard, and both ports should be used for cluster interconnect for best performance, with each attached to an independent switch. Both switches would have to fail in such a way as to split their respective network for a single-site network split to happen. Of course, the likelihood of a network split in multi-site situations is higher.

One way of guarding against network splits is to require that at least one partition of the data have all copies online. Additionally, the master on duty can request each cluster node or server instance it expects to be online to connect to every other node or instance, and to report which they could reach. If the reports differ, there is a network problem. This procedure can be performed using both interfaces or only the first or second interface of each server to determine if one of the switches selectively blocks some paths. These simple sanity checks protect against arbitrary network errors. Using TCP for inter-cluster-node communication in principle protects against random message loss, but the Virtuoso cluster protocols do not rely on this. Instead, there are protocols for retry of any transaction messages and for using keep-alive messages on any long-running functions sent across the cluster. Failure to get a keep-alive message within a certain period will abort a query even if the network connections look OK.

Backups, and Recovery from Loss of Entire Site

For a constantly-operating distributed system, it is hard to define what exactly constitutes a consistent snapshot. The checkpointed state on each cluster node is consistent as far as this cluster node is concerned (i.e., it contains no uncommitted data), but the checkpointed states on all the cluster nodes are not from exactly the same moment in time. The complete state of a cluster is the checkpoint state of each cluster node plus the current transaction log of each. If the logs were shipped in real time to off-site storage, a consistent image could be reconstructed from them. Since such shipping cannot be synchronous due to latency considerations, some transactions could be received only in part in the event of a failure of the off-site link. Such partial transactions can however be detected at reconstruction time because each record contains the list of all participants of the transaction. If some piece is found missing, the whole can be discarded. In this way integrity is guaranteed but it is possible that a few milliseconds worth of transactions get lost. In these cases, the online client will almost certainly fail to get the final success message and will recheck the status after recovery.

For business continuity purposes, a live feed of transactions can be constantly streamed off-site, for example to a cloud infrastructure provider. One low-cost virtual machine on the cloud will typically be enough for receiving the feed. In the event of long-term loss of the whole site, replacement servers can be procured on the cloud; thus, capital is not tied up in an aging inventory of spare servers. The cloud-based substitute can be maintained for the time it takes to rebuild an owned infrastructure, which is still at present more economical than a cloud-only solution.

Switching a cluster from an owned site to the cloud could be accomplished in a few hours. The prerequisite of this is that there are reasonably recent snapshots of the database files, so that replay of logs does not take too long. The bulk of the time taken by such a switch would be in transferring the database snapshots from S3 or similar to the newly provisioned machines, formatting the newly provisioned virtual disks, etc.

Rehearsing such a maneuver beforehand is quite necessary for predictable execution. We do not presently have a productized set of tools for such a switch, but can advise any interested parties on implementing and testing such a disaster recovery scheme.

Conclusions

In conclusion, we have shown how we can have strong transactional guarantees in a database cluster without single points of failure or performance penalties when compared with a non fault-tolerant cluster. Operator intervention is not required for anything short of hardware failure. Recovery procedures are simple, at most consisting of installing software and copying database files from a surviving cluster node. Unless permanent storage is lost in the failure, not even this is required. Real-time off-site log shipment can easily be added to these procedures to protect against site-wide failures.

Future work may be directed toward concurrent operation of geographically-distributed data centers with eventual consistency. Such a setting would allow for migration between sites in the event of whole-site failures, and for reconciliation between inconsistent histories of different halves of a temporarily split network. Such schemes are likely to require application-level logic for reconciliation and cannot consist of an out-of-the-box DBMS alone. All techniques discussed here are application-agnostic and will work equally well for Graph Model (e.g., RDF) and Relational Model (e.g., SQL) workloads.

Glossary

  • Virtuoso Cluster (VC) -- a collection of Virtuoso Cluster Nodes on one or more machines, working in parallel as part of a Virtuoso Cluster.
  • Virtuoso Cluster Node (VCN) -- a Virtuoso Server Instance (Non Fault-Tolerant Operations), or a Quorum of Server Instances (Fault Tolerant Operations), which is a member of a collection of Virtuoso Cluster Nodes working in parallel as part of a Virtuoso Cluster.
  • Virtuoso Host Cluster (VHC) -- a collection of machines, each hosting one or more Virtuoso Server Instances, making up a Virtuoso Cluster.
  • Virtuoso Host Cluster Node (VHCN) -- a machine hosting one or more Virtuoso Server Instances that are members of a Virtuoso Cluster.
  • Virtuoso Server Instance (VSI) -- a single Virtuoso process with exclusive access to its own permanent storage, consisting of database files and logs. May comprise an entire Virtuoso Cluster Node (Non Fault-Tolerant Operations), or be one member of a quorum which comprises a Virtuoso Cluster Node (Fault Tolerant Operations).

Also see