Benchmarks, Redux (part 2): A Benchmarking Story

Caeterum censeo, benchmarks are for vendors...

This is an edifying story about benchmarks and how databases work. I will show how one detail makes a 5+x difference, and how one really must understand how things work in order to make sense of benchmarks.

We begin right after the publication of the recent Berlin report. This report gives us OK performance for queries and very bad performance for loading. Trickle updates were not measurable. This comes as a consequence of testing intermediate software cuts and having incomplete instructions for operating them. I will cover the whole BSBM matter and the general benchmarking question in forthcoming posts; for now, let's talk about specifics.

In the course of the discussion to follow, we talk about 3 different kinds of Virtuoso:

6 Single is the generally available single-instance-server configuration of Virtuoso. Whether this is open source or not does not make a difference.
6 Cluster is the generally available, commercial-only, cluster-capable Virtuoso.
7 Single is the next-generation single-instance-server Virtuoso, about to be released as a preview.

We began by running the various parts of BSBM at different scales with different Virtuoso variants. In so doing, we noticed that the BSBM Explore mix at one scale got better throughput as we added more clients, approximately as one would expect based on CPU usage and number of cores, while at another scale this was not so.

At the 1-billion-triple scale (1000 Mt; 1 Mt = 1 Megatriple, or one million triples) we saw CPU going from 200% with 1 client to 1400% with 16 clients but throughput increased by less than 20%.

When we ran the same scale with our shared-nothing 6 Cluster, running 8 processes on the same box, throughput increased normally with the client count. We have not previously tried BSBM with 6 Cluster simply because there is little to gain and a lot to lose by distributing this workload. But here we got a multiuser throughput with 6 Cluster that is easily 3 times that of the single server, even with a cluster-unfriendly workload.

See, sometimes scaling out even within a shared memory multiprocessor pays! Still, what we saw was rather anomalous.

Over the years we have looked at performance any number of times and have a lot of built-in meters. For cases of high CPU with no throughput, the prime suspect is contention on critical sections. Quite right, when building with the mutex meter enabled, counting how many times each mutex is acquired and how many times this results in a wait, we found a mutex which gets acquired 600M times in the run, of which an insane 450M result in a wait. One can count a microsecond of real time each time a mutex wait results in the kernel switching tasks. The run took 500 s or so, of which 450 s of real time were attributable to the overhead of waiting for this one mutex.

Waiting for a mutex is a real train wreck. We have tried spinning a few times before it, which the OS does anyhow, but this does not help. Using spin locks is good only if waits are extremely rare; with any frequency of waiting, even for very short waits, a mutex is still a lot better.

Now, the mutex in question happens to serialize the buffer cache for one specific page of data, one level down from the root of the index for RDF PSOG. By the luck of the draw, the Ps falling on that page are commonly accessed Ps pertaining to product features. In order to get any product feature value, one must pass via this page. At the smaller scale, the different properties web their different ways based on the index root.

One might here ask why the problem is one level down from the root and not in the root. The index root is already handled specially, so the read-write locks for buffers usually apply only for the first level down. One might also ask why have a mutex in the first place. Well, unless one is read-only and all in memory, there simply must be a way to say that a buffer must not get written to by one thread while another is reading it. Same for cache replacement. Some in-memory people fork a whole copy of the database process to do a large query and so can forget about serialization. But one must have long queries for this and have all in memory. One can make writes less frequent by keeping deltas, but this does not remove the need to merge the deltas at some point, which cannot happen without serializing this with the readers.

Most of the time the offending mutex is acquired for getting a property of a product in Q5, the one that looks for products with similar values of a numeric property. We retrieve this property for a number of products in one go, due to vectoring. Vectoring is supposed to save us from constantly hitting the index tree top when getting the next match. So how come there is contention in the index tree top? As it happens, the vectored index lookup checks for locality only when all search conditions on key parts are equalities. Here however there is equality on P and S and a range on O; hence, the lookup starts from the index root every time.

So I changed this. The effect was Q5 getting over twice as fast, with the single user throughput at 1000 Mt going from 2000 to 5200 QMpH (Query Mixes per Hour) and the 16-user throughput going from 3800 to over 21000 QMpH. The previously "good" throughput of 40K QMpH at 100 Mt went to 66K QMpH.

Vectoring can make a real difference. The throughputs for the same workload on 6 Single, without vectoring, thus unavoidably hitting the page with the crazy contention, are 1770 QMpH single user and 2487 QMpH with 16 users. The 6 Cluster throughput, avoiding the contention but without the increased locality from vectoring and with the increased latency of going out-of-process for most of the data, was about 11.5K QMpH with 16 users. Each partition had a page getting the hits but since the partitioning was on S and S was about-evenly distributed, each partition got 1/8 of the load; thus waiting on the mutex did not become a killer issue.

We see how detailed analysis of benchmarks can lead to almost an order of magnitude improvements in a short time. This analysis is however both difficult and tedious. It is not readily delegable; one needs real knowledge of how things work and of how they ought to work in order to get anywhere with this. Experience tends to show that a competitive situation is needed in order to motivate one to go to the trouble. Unless something really sticks out in an obvious manner, one is most likely not going to look deep enough. Of course, this is seen in applications too but application optimization tends to stop at a point where the application is usable. Also stored procedures and specially-tweaked queries will usually help. In most application scenarios, we are not simultaneously looking at multiple different implementations, except maybe at the start of development but then this falls under benchmarking and evaluation.

So, the usefulness of benchmarks is again confirmed. There is likely great unexplored space for improvement as we move to more interesting and diverse scenarios.