Fault Tolerance in Virtuoso Cluster Edition (Short Version)

We have for some time had the option of storing data in a cluster in multiple copies, in the Commercial Edition of Virtuoso. (This feature is not in and is not planned to be added to the Open Source Edition.)

Based on some feedback from the field, we decided to make this feature more user friendly. The gist of the matter is that failure and recovery processes have been automated so that neither application developer nor operating personnel needs any knowledge of how things actually work.

So I will here make a few high level statements about what we offer for fault tolerance. I will follow up with technical specifics in another post.

Three types of individuals need to know about fault tolerance:

Executives: What does it cost? Will it really eliminate downtime?
System Administrators: Is it hard to configure? What do I do when I get an alert?
Application Developers/Programmers: Will I need to write extra code? Can old applications get fault tolerance with no changes?

I will explain the matter to each of these three groups:

Executives

The value gained is elimination of downtime. The cost is in purchasing twice (or thrice) the hardware and software licenses. In reality, the cost is less since you get the whole money's worth of read throughput and half the money's worth of write throughput. Since most applications are about reading, this is a good deal. You do not end up paying for unused capacity.

Server instances are grouped in "quorums" of two or, for extra safety, three; as long as one member of each quorum is available, the system keeps running and nobody sees a difference, except maybe for slower response. This does not protect against widespread power outage or the building burning down; the scope is limited to hardware and software failures at one site.

The most basic site-wide disaster recovery plan consists of constantly streaming updates off-site. Using an off-site backup plus update stream, one can reconstitute the failed data center on a cloud provider in a few hours. Details will vary; please contact us for specifics.

Running multiple sites in parallel is also possible but specifics will depend on the application. Again, please contact us if you have a specific case in mind.

System Administrators

To configure, divide your server instances into quorums of 2 or 3, according to which will be mirrors of each other, with each quorum member on a different host from the others in its quorum. These things are declared in a configuration file. Table definitions do not have to be altered for fault tolerance. It is enough for tables and indices to specify partitioning. Use two switches, and two NICs per machine, and connect one of each server's network cables to each switch, to cover switch failures.

When things break, as long as there is at least one server instance up from each quorum, things will continue to work. Reboots and the like are handled without operator intervention; if there is a broken host, then remove it and put a spare in its place. If the disks are OK, put the old disks in the replacement host and start. If the disks are gone, then copy the database files from the live copy. Finally start the replacement database, and the system will do the rest. The system is online in read-write mode during all this time, including during copying.

Having mirrored disks in individual hosts is optional since data will anyhow be in two copies. Mirrored disks will shorten the vulnerability window of running a partition on a single server instance since this will for the most part eliminate the need to copy many (hundreds) of GB of database files when recovering a failed instance.

Application Developers/Programmers

An application can connect to any server instance in the cluster and have access to the same data, with full ACID properties.

There are two types of errors that can occur in any database application: The database server instance may be offline or otherwise unreachable; and a transaction may be aborted due to a deadlock.

For the missing server instance, the application should try to reconnect. An ODBC/JDBC connect string can specify a list of alternate server instances; thus as long as the application is written to try to reconnect as best practices dictate, there is no new code needed.

For the deadlock, the application is supposed to retry the transaction. Sometimes when a server instance drops out or rejoins a running cluster, some transactions will have to be retried. To the application, these conditions look like a deadlock. If the application handles deadlocks (SQL State 40001) as best practices dictate, there is no change needed.

Conclusion

In summary...

Limited extra cost for fault tolerance; no equipment sitting idle.
Easy operation: Replace servers when they fail; the cluster does the rest.
No changes needed to most applications.
No proprietary SQL APIs or special fault tolerance logic needed in applications.
Fully transactional programming model.

All the above applies to both the Graph Model (RDF) and Relational (SQL) sides of Virtuoso. These features will be in the commercial release of Virtuoso to be publicly available in the next 2-3 weeks. Please contact OpenLink Software Sales for details of availability or for getting advance evaluation copies.

Glossary

Virtuoso Cluster (VC) -- a collection of Virtuoso Cluster Nodes on one or more machines, working in parallel as part of a Virtuoso Cluster.
Virtuoso Cluster Node (VCN) -- a Virtuoso Server Instance (Non Fault-Tolerant Operations), or a Quorum of Server Instances (Fault Tolerant Operations), which is a member of a collection of Virtuoso Cluster Nodes working in parallel as part of a Virtuoso Cluster.
Virtuoso Host Cluster (VHC) -- a collection of machines, each hosting one or more Virtuoso Server Instances, making up a Virtuoso Cluster.
Virtuoso Host Cluster Node (VHCN) -- a machine hosting one or more Virtuoso Server Instances that are members of a Virtuoso Cluster.
Virtuoso Server Instance (VSI) -- a single Virtuoso process with exclusive access to its own permanent storage, consisting of database files and logs. May comprise an entire Virtuoso Cluster Node (Non Fault-Tolerant Operations), or be one member of a quorum which comprises a Virtuoso Cluster Node (Fault Tolerant Operations).

Also see

Special Relativity and the Problem of Database Scalability (PDF), by James Starkey of NimbusDB, Inc.

Orri Erling's Weblog

Details

Subscribe

Tag Cloud

Post Categories

Recent Articles

Executives

System Administrators

Application Developers/Programmers

Conclusion

Glossary

Also see

Comments

Post Comment

Orri Erling's Weblog

Details

Subscribe

Tag Cloud

Post Categories

Recent Articles

Executives

System Administrators

Application Developers/Programmers

Conclusion

Glossary

Also see

Related

Comments

Post Comment