In Hoc Signo Vinces (part 16 of n): Introduction to Scale-Out

So far, we have analyzed TPC-H in a single-server, memory-only setting. We will now move to larger data and cluster implementations. In principle, TPC-H parallelizes well, so we should expect near-linear scalability; i.e., twice the gear runs twice as fast, or close enough.

In practice, things are not quite so simple. Larger data, particularly a different data-to-memory ratio, and the fact of having no shared memory, all play a role. There is also a network, so partitioned operations, which also existed in the single-server case, now have to send messages across machines, not across threads. For data loading and refreshes, there is generally no shared file system, so data distribution and parallelism have to be considered.

As an initial pass, we look at 100G and 1000G scales on the same test system as before. This is two machines, each with dual Xeon E5-2630, 192 GB RAM, 2 x 512 GB SSD, and QDR InfiniBand. We will also try other platforms, but if nothing else is said, this is the test system.

As of this writing, there is a working implementation, but it is not guaranteed to be optimal as yet. We will adjust it as we go through the workload. One outcome of the experiment will be a precise determination of the data-volume-to-RAM ratio that still gives good performance.

A priori, we know of the following things that complicate life with clusters:

Distributed memory — The working set must be in memory for a run to have a competitive score. A cluster can have a lot of memory, and the data is such that it partitions very evenly, so this appears at first not a problem. The difficulty comes with query memory: If each machine has 1/16th of the total RAM and a hash table would be 1/64th of the working set, on a single-server it is no problem just building the hash table. On a scale-out system, the hash table would be 1/4 of the working set if replicated on each node, which will not fit, especially if there are many such hash tables at the same time. Two main approaches exist: The hash table can be partitioned, but this will force the probe to go cross-partition, which takes time. The other possibility is to build the hash table many times, each time with a fraction of the data, and to run the probe side many times. Since hash tables often have Bloom filters, it is sometimes possible to replicate the Bloom filter and partition the hash table. One has also heard of hash tables that go to secondary storage, but should this happen, the race is already lost; so, we do not go there.

We must evaluate different combinations of these techniques and have a cost model that accurately predicts the performance of each variant. Adding to realism is always safe but halfway difficult to do.
NUMA — Most servers are NUMA (non-uniform memory architecture), where each CPU socket has its own local memory. For single-server cases, we use all the memory for the process. Some implementations have special logic for memory affinity between threads. With scale-out there is the choice of having a server process per-NUMA-node or per-physical-machine. If per-NUMA-node, we are guaranteed only local memory accesses. This is a tradeoff to be evaluated.
Network and Scheduling — Execution on a cluster is always vectored, for the simple reason that sending single-tuple messages is unfeasible in terms of performance. With an otherwise vectored architecture, the message batching required on a cluster comes naturally. However, the larger the cluster, the more partitions there are, which rapidly gets into shorter messages. Increasing the vector size is possible and messages become longer, but indefinite increase in vector size has drawbacks for cache locality and takes memory. To run well, each thread must stay on core. There are two ways of being taken off core ahead of time: Blocking for a mutex, and blocking for network. Lots of short messages run into scheduling overhead, since the recipient must decide what to do with each, which is not really possible without some sort of critical section. This is more efficient if messages are longer, as the decision time does not depend on message length. Longer messages are however liable to block on write at the sender side. So one pays in either case. This is another tradeoff to be balanced.
Flow control — A query is a pipeline of producers and consumers. Sometimes the consumer is in a different partition. The producer must not get indefinitely ahead of the consumer because this would run out of memory, but it must stay sufficiently ahead so as not to stop the consumer. In practice, there are synchronization barriers to check even progress. These will decrease platform utilization, because two threads never finish at exactly the same time. The price of not having these is having no cap on transient memory consumption.
Un-homogenous performance — Identical machines do not always perform identically. This is seen especially with disk, where wear on SSDs can affect write speed, and where uncontrollable hazards of data placement will get uneven read speeds on rotating media. Purely memory-bound performance is quite close, though. Un-anticipatable and uncontrollable hazards of scheduling cause different times of arrival of network messages, which introduces variation in run time on consecutive runs. Single-servers have some such variation from threading, but the effects are larger with a network.

The logical side of query optimization stays the same. Pushing down predicates is always good, and all the logical tricks with moving conditions between subqueries stay the same.

Schema design stays much the same, but there is the extra question of partitioning keys. In this implementation, there are only indices on identifiers, not on dates, for example. So, for a primary key to foreign key join, if there is an index on the foreign key, the index should be partitioned the same way as the primary key. So, joining from orders to lineitem on orderkey will be co-located. Joining from customer to orders by index will be colocated for the c_custkey = o_custkey part (assuming an index on o_custkey) and cross-partition for getting the customer row on c_custkey, supposing that the query needs some property of the customer other than c_custkey or c_orderkey.

A secondary question is the partition granularity. For good compression, nearby values should be consecutive, so here we leave the low 12 bits out of the partitioning. This has effect on bulk load and refreshes, for example, so that a batch of 10,000 lineitems, ordered on l_orderkey will go to only 2 or 3 distinct destinations, thus getting longer messages and longer insert batches, which is more efficient.

This is a quick overview of the wisdom so far. In subsequent installments, we will take a quantitative look at the tradeoffs and consider actual queries. As a conclusion, we will show a full run on a couple of different platforms, and likely provide Amazon machine images for the interested to see for themselves. Virtuoso Cluster is not open source, but the cloud will provide easy access.

To be continued...