Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs

In the context of database benchmarks we cannot ignore I/O, as pretty much has been done so far by BSBM.

There are two approaches:

run twice or otherwise make sure one runs from memory and forget about I/O, or
make rules and metrics for warm-up.

We will see if the second is possible with BSBM.

From this starting point, we look at various ways of scheduling I/O in Virtuoso using a 1000 Mt BSBM database on sets of each of HDDs (hard disk devices) and SSDs (solid-state storage devices). We will see that SSDs in this specific application can make a significant difference.

In this test we have the same 4 stripes of a 1000 Mt BSBM database on each of two storage arrays.

Storage Arrays
Type	Quantity	Maker	Size	Speed	Interface speed	Controller	Drive Cache	RAID
SSD	4	Crucial	128 GB	N/A	6Gbit SATA	RocketRaid 640	128 MB	None
HDD	4	Samsung	1000 GB	7200 RPM	3Gbit SATA	Intel ICH on Supermicro motherboard	16 MB	None

We make sure that the files are not in OS cache by filling it with other big files, reading a total of 120 GB off SSDs with `cat file > /dev/null`.

The configuration files are as in the report on the 1000 Mt run. We note as significant that we have a few file descriptors for each stripe, and that read-ahead for each is handled by its own thread.

Two different read-ahead schemes are used:

With 6 Single, if a 2MB extent gets a second read within a given time after the first, the whole extent is scheduled for background read.
With 7 Single, as an index search is vectored, we know a large number of values to fetch at one time and these values are sorted into an ascending sequence. Therefore, by looking at a node in an index tree, we can determine which sub-trees will be accessed and schedule these for read-ahead, skipping any that will not be accessed.

In either model, a sequential scan touching more than a couple of consecutive index leaf pages triggers a read-ahead, to the end of the scanned range or to the next 3000 index leaves, whichever comes first. However, there are no sequential scans of significant size in BSBM.

There are a few different possibilities for the physical I/O:

Using a separate read system call for each page. There may be several open file descriptors on a file so that many such calls can proceed concurrently on different threads; the OS will order the operations.
A thread finds it needs a page and reads it.
Using Unix asynchronous I/O, aio.h, with the aio_* and lio_listio functions.
Using single-read system calls for adjacent pages. In this way, the drive sees longer requests and should give better throughput. If there are short gaps in the sequence, the gaps are also read, wasting bandwidth but saving on latency.

The two latter apply only to bulk I/O that are scheduled on background threads, one per independently-addressable device (HDD, SSD, or RAID-set). These bulk-reads operate on an elevator model, keeping a sorted queue of things to read or write and moving through this queue from start to end. At any time, the queue may get more work from other threads.

There is a further choice when seeing single-page random requests. They can either go to the elevator or they can be done in place. Taking the elevator is presumably good for throughput but bad for latency. In general, the elevator should have a notion of fairness; these matters are discussed in the CWI collaborative scan paper. Here we do not have long queries, so we do not have to talk about elevator policies or scan sharing; there are no scans. We may touch on these questions later with the column store, the BSBM BI mix, and TPC-H.

While we may know principles, I/O has always given us surprises; the only way to optimize this is to measure.

The metric we try to optimize here is the time it takes for a multiuser BSBM run starting from cold cache to get to 1200% CPU. When running from memory, the CPU is around 1350% for the system in question.

This depends on getting I/O throughput, which in turn depends on having a lot of speculative reading since the workload itself does not give any long stretches to read.

The test driver is set at 16 clients, and the run continues for 2000 query mixes or until target throughput is reached. Target throughput is deemed reached after the first 20 second stretch with CPU at 1200% or higher.

The meter is a stored procedure that records the CPU time, count of reads, cumulative elapsed time spent waiting for I/O, and other metrics. The code for this procedure (for 7 Single; this file will not work on Virtuoso 6 or earlier) is available here.

The database space allocation gives each index a number of 2MB segments, each with 256 8K pages. When a page splits, the new page is allocated from the same extent if possible, or from a specific second extent which is designated as the overflow extent of this extent. This scheme provides for a sort of pseudo-locality within extents over random insert order. Thus there is a chance that pre-reading an extent will get key values in the same range a the ones on the page being requested in the first place. At least the pre-read pages will be from the same index tree. There are insertion orders that do not create good locality with this allocation scheme, though. In order to generally improve locality, one could shuffle pages of an all-dirty subtree before writing this out so as to have physical order match key order. We will look at some tricks in this vein with the column store.

For the sake of simplicity we only run 7 Single with the 1000 Mt scale.

The first experiment was with SSDs and the vectored read-ahead. The target throughput was reached after 280 seconds.

The next test was with HDDs and extent read-ahead. One hour into the experiment, the CPU was about 70% after processing around 1000 query mixes. It might have been hours before HDD reads became rare enough for hitting 1200% CPU. The test was not worth continuing.

The result with HDDs and vectored read-ahead would be worse since vectored read-ahead leads to smaller read-ahead batches and to less contiguous read patterns. The individual read times here, are over twice the individual read times with per-extent read-ahead. The fact that vectored read-ahead does not read potentially unneeded pages makes no difference. Hence this test is also not worth running to completion.

There are other possibilities for improving HDD I/O. If only 2MB read requests are made, a transfer will be about 20 ms at a sequential transfer speed of 50 MB/s. Then seeking to the next 2MB extent will be a few ms, most often less than 20, so the HDD should give at least half the nominal throughput.

We note that, when reading sequential 8K pages inside a single 2MB (256 page) extent, the seek latency is not 0 as one would expect but an extreme 5 ms. One would think that the drive would buffer a whole track, and a track would hold a large number of 2MB sections, but apparently this is not so.

Therefore, now if we have a sequential read pattern that is more dense than 1 page out of 10, we read all the pages and just keep the ones we want.

So now we set the read-ahead to merge reads that fall within 10 pages. This wastes bandwidth, but supposedly saves on latency. We will see.

So we try, and we find that read-ahead does not account for most pages since it does not get triggered. Thus, we change the triggering condition to be the 2nd read to fall in the extent within 20 seconds of the first.

The HDDs were in all cases 700% busy for 4 HDDs. But with the new setting we get longer requests, most often full extents, which gets a per-HDD transfer rate of about 5 MB/s. With the looser condition for starting read-ahead, 89% of all pages were read in a read-ahead batch. We see the I/O throughput decrease during the run because there are more single-page reads that do not trigger extent read-ahead. So HDDs have 1.7 concurrent operations pending, but the batch size drops, dropping the throughput.

Thus with the best settings, the test with 2000 query mixes finishes in 46 minutes, and the CPU utilization is steadily increasing, hitting 392% for the last minute. In comparison, with SSDs and our worst read-ahead setting we got 1200% CPU in under 5 minutes from cold start. The I/O system can be further tuned; for example, by only reading full extents as long as the buffer pool is not full. In the next post we will measure some more.

BSBM Note

We look at query times with semi-warm cache, with CPU around 400%. We note that Q8-Q12 are especially bad. Q5 runs at about half speed. Q12 runs at under 1/10th speed. The relatively slowest queries appear to be single-instance lookups. Nothing short of the most aggressive speculative reading can help there. Neither query nor workload has any exploitable pattern. Therefore if an I/O component is to be included in a BSBM metric, the only way to score in this is to use speculative read to the maximum.

Some of the queries take consecutive property values of a single instance. One could parallelize this pipeline, but this would be a one-off and would make sense only when reading from storage (whether HDD, SSD, or otherwise). Multithreading for single rows is not worth the overhead.

A metric for BSBM warm-up is not interesting for database science, but may still be of practical interest in the specific case of RDF stores. Specially reading large chunks at startup time is good, so putting a section in BSBM that would force one to implement this would be a service to most end users. Measuring and reporting such I/O performance would favor space efficiency in general. Space efficiency is generally a good thing, especially at larger scales, so we can put an optional section in the report for warm-up. This is also good for comparing HDDs and SSDs, and for testing read-ahead, which is still something a database is expected to do. Implementors have it easy; just speculatively read everything.

Looking at the BSBM fictional use case, anybody running such a portal would do this from RAM only, so it makes sense to define the primary metric as running from warm cache, in practice 100% from memory.