Virtuoso and cluster capacity allocation
I just read Google's Bigtable paper. It is relevant here because it talks about keeping petabyte scale (1024TB) tables on a variable size cluster of machines.
I have talked about partitioning versus distributed cache in the second to last post. The problem in short is that you do not expect a DBA to really know how to partition things, and even if the indices are correctly partitioned initially, repartitioning them is so bad that doing it online can be a problem. And repartitioning is needed whenever adding machines, unless the size increment is a doubling, which it will never be.
So Oracle has really elegantly stepped around the whole problem by not partitioning for clustering in the first place. So incremental capacity change does not require repartitioning. Oracle has partitioning for other purposes but this is not tied to their cluster proposition.
I did not go the cache fusion route because I could not figure a way to know with near certainty where to send a request for a given key value. In the case we are interested in, the job simply must go to the data and not the other way around. Besides, not being totally dependent on a microsecond latency interconnect and a SAN for performance enhances deployment options. Sending large batches of functions tolerates latency better than cache consistency messages which are a page at a time, unless of course you kill yourself with extra trickery for batching these too.
So how to adapt to capacity change? Well, by making the unit of capacity allocation much smaller than a machine, of course.
Google has done this in Bigtable by a scheme of dynamic range partitioning. The partition size is in the tens to hundreds of megabytes, something that can be moved around within reason. When the partition, called a tablet, gets too big, it splits. Just like a Btree index. The tree top must be common knowledge, as well as the allocation of partitions to servers but these can be cached here and there and do not change all the time.
So how could we do something of the sort here? I know for an experiential fact that when people cannot change the server memory pool size, let alone correctly set up disk striping, they simply cannot be expected to deal with partitioning. Besides, even if you know exactly what you are doing and why, configuring and refilling large numbers of partitions by hand is error prone, tedious, time consuming, and will run out of disk and require restoring backups and all sorts of DBA activity that will have everything down for a long time, unless of course you have MIS staff such as is not easily found.
The solution is not so complex. We start with a set number of machines and make a file group on each. A file group has a bunch of disk stripes and a log file and can be laid out on the local file system in the usual manner. The data goes into the file group, partitioned as defined. You still specify partitioning columns but not where each partition goes. The system will decide this by itself. When a server's file group gets too big, it splits. One half of each key's partition in the original stays where it was and the other half goes to the copy. The copies will hold rows that no longer belong there but these can be removed in the background. The new file group will be managed by the same server process and the partitioning information on all servers gets updated to reflect the existence of the new file group and the range of hash values that belong there.
If a file group is kept at some reasonable size, under a few GB, these can be moved around between servers, even dynamically.
If data is kept replicated, then the replicas have to split at the same time and the system will have to make sure that the replicas are kept on separate machines.
So what happens to disk locality when file groups split? Nothing much. Firstly, partitioning will be set up so that consecutive values go to the same hash value, so that key compression is not ruined. Thus, consecutive numbers will be on the same page. Imagine an integer key partitioned two ways on bits 10-20. Values 0-1K go together, values 1K-2K go another way, values 2K-3K go the first way etc.
Now let us suppose the first partition, the even K's splits. It could split so that multiples of 4 go one way and the rest another way. Now we'd have 0-1K in place, 2K-3K in the new partition, 4K-5K in place and so on. A sequential disk read, with some read ahead, would scan the partitions in parallel but the disk access would be made sequential by the read ahead logic — remember that these are controlled by the same server process.
For purposes of sending functions, the file group would be the recipient, not the host, per se. The allocation of file groups to hosts could change.
Now picture a transaction that touches multiple file groups. The requests going to collocated file groups can travel in the same batch and the recipient server process can run them sequentially or with a thread per file group, as may be convenient. Multiple threads per query on the same index make contention and needless thread switches. But since distinct file groups have their distinct mutexes there is less interference.
For purposes of transactions, we might view a file group as deserving a its own branch. In this way we would not have to abort transactions if file groups moved. A file group split would probably have to kill all uncommitted transactions on it so as not to have to split one branch in two or deal with uncommitted data in the split. This is hardly a problem, the event being rare. For purposes of checkpoints, logging, log archival, recovery, and such, a file group is its own unit. The Bigtable paper had some ideas about combining transaction logs and such, all quite straightforward and intuitive.
Writing the clustering logic with the file group, not the database process, as the main unit of location is a good idea and an entirely trivial change. This will make it possible to adjust capacity in almost real time without bringing everything to a halt by re-inserting terabytes of data in system wide repartitioning runs.
Implementing this on the current Virtuoso is not a real difficulty. There is already a concept of file group, although we use only two, one for the data and one for temp. Using multiple ones is not a big deal.
Supporting capacity allocation at the file group level instead of the server level can be introduced towards the middle of the clustering effort and will not greatly impact timetables.