Benchmarks, Redux (part 6): BSBM and I/O, continued

In the words of Jim Gray, disks have become tapes. By this he means that a disk is really only good for sequential access. For this reason, the SSD extent read ahead was incomparably better. We note that in the experiment, every page in the general area of the database the experiment touched would in time be touched, and that the whole working set would end up in memory. Therefore no speculative read would be wasted. Therefore it stands to reason to read whole extents.

So I changed the default behavior to use a very long window for triggering read-ahead as long as the buffer pool was not full. After the initial filling of the buffer pool, the read ahead would require more temporal locality before kicking in.

Still, the scheme was not really good since the rest of the extent would go for background-read and the triggering read would be done right then, leading to extra seeks. Well, this is good for latency but bad for throughput. So I changed this too, going to an "elevator only" scheme where reads that triggered read-ahead would go with the read-ahead batch. Reads that did not trigger read-ahead would still be done right in place, thus favoring latency but breaking any sequentiality with its attendant 10+ ms penalty.

We keep in mind that the test we target is BSBM warm-up time, which is purely a throughput business. One could have timeouts and could penalize queries that sacrificed too much latency to throughput.

We note that even for this very simple metric, just reading the allocated database pages from start to end is not good since a large number of pages in fact never get read during a run.

We further note that the vectored read-ahead without any speculation will be useful as-is for cases with few threads and striping, since at least one thread's random I/Os get to go to multiple threads. The benefit is less in multiuser situations where disks are randomly busy anyhow.

In the previous I/O experiments, we saw that with vectored read ahead and no speculation, there were around 50 pages waiting for I/O at all times. With an easily-triggered extent read-ahead, there were around 4000 pages waiting. The more pages are waiting for I/O, the greater the benefit from the elevator algorithm of servicing I/O in order of file offset.

In Virtuoso 5 we had a trick that would, if the buffer pool was not full, speculatively read every uncached sibling of every index tree node it visited. This filled the cache quite fast, but was useless after the cache was full. The extent read ahead first implemented in 6 was less aggressive, but would continue working with full cache and did in fact help with shifts in the working set.

The next logical step is to combine the vector and extent read-ahead modes. We see what pages we will be getting, then take the distinct extents; if we have been to this extent within the time window, we just add all the uncached allocated pages of the extent to the batch.

With this setting, especially at the start of the run, we get large read-ahead batches and maintain I/O queues of 5000 to 20000 pages. The SSD starting time drops to about 120 seconds from cold start to reach 1200% CPU. We see transfer rates of up to 150 MB/s per SSD. With HDDs, we see transfer rates around 14 MB/s per drive, mostly reading chunks of an average of seventy-one (71) 8K pages.

The BSBM workload does not offer better possibilities for optimization, short of pre-reading the whole database, which is not practical at large scales.

Some Details

First we start from cold disk, with and without mandatory read of the whole extent on the touch.

Without any speculation but with vectored read-ahead, here are the times for the first 11 query mixes:

 0: 151560.82 ms, total: 151718 ms
 1: 179589.08 ms, total: 179648 ms
 2:  71974.49 ms, total:  72017 ms
 3: 102701.73 ms, total: 102729 ms
 4:  58834.41 ms, total:  58856 ms
 5:  65926.34 ms, total:  65944 ms
 6:  68244.69 ms, total:  68274 ms
 7:  39197.15 ms, total:  39215 ms
 8:  45654.93 ms, total:  45674 ms
 9:  34850.30 ms, total:  34878 ms
10: 100061.30 ms, total: 100079 ms

The average CPU during this time was 5%. The best read throughput was 2.5 MB/s; the average was 1.35 MB/s. The average disk read was 16 ms.

With vectored read-ahead and full extents only, i.e., max speculation:

 0: 178854.23 ms, total: 179034 ms
 1: 110826.68 ms, total: 110887 ms
 2:  19896.11 ms, total:  19941 ms
 3:  36724.43 ms, total:  36753 ms
 4:  21253.70 ms, total:  21285 ms
 5:  18417.73 ms, total:  18439 ms
 6:  21668.92 ms, total:  21690 ms
 7:  12236.49 ms, total:  12267 ms
 8:  14922.74 ms, total:  14945 ms
 9:  11502.96 ms, total:  11523 ms
10:  15762.34 ms, total:  15792 ms
...

90:   1747.62 ms, total:   1761 ms
91:   1701.01 ms, total:   1714 ms
92:   1300.62 ms, total:   1318 ms
93:   1873.15 ms, total:   1886 ms
94:   1508.24 ms, total:   1524 ms
95:   1748.15 ms, total:   1761 ms
96:   2076.92 ms, total:   2090 ms
97:   2199.38 ms, total:   2212 ms
98:   2305.75 ms, total:   2319 ms
99:   1771.91 ms, total:   1784 ms

Scale factor:              2848260
Number of warmup runs:     0
Seed:                      808080
Number of query mix runs 
  (without warmups):       100 times
min/max Querymix runtime:  1.3006s / 178.8542s
Elapsed runtime:           872.993 seconds
QMpH:                      412.374 query mixes per hour

The peak throughput is 91 MB/s, with average around 50 MB/s; CPU average around 50%.

We note that the latency of the first query mix is hardly greater than in the non-speculative run, but starting from mix 3 the speed is clearly better.

Then the same with cold SSDs. First with no speculation:

 0:   5177.68 ms, total:   5302 ms
 1:   2570.16 ms, total:   2614 ms
 2:   1353.06 ms, total:   1391 ms
 3:   1957.63 ms, total:   1978 ms
 4:   1371.13 ms, total:   1386 ms
 5:   1765.55 ms, total:   1781 ms
 6:   1658.23 ms, total:   1673 ms
 7:   1273.87 ms, total:   1289 ms
 8:   1355.19 ms, total:   1380 ms
 9:   1152.78 ms, total:   1167 ms
10:   1787.91 ms, total:   1802 ms
...

90:   1116.25 ms, total:   1128 ms
91:    989.50 ms, total:   1001 ms
92:    833.24 ms, total:    844 ms
93:   1137.83 ms, total:   1150 ms
94:    969.47 ms, total:    982 ms
95:   1138.04 ms, total:   1149 ms
96:   1155.98 ms, total:   1168 ms
97:   1178.15 ms, total:   1193 ms
98:   1120.18 ms, total:   1132 ms
99:   1013.16 ms, total:   1025 ms

Scale factor:              2848260
Number of warmup runs:     0
Seed:                      808080
Number of query mix runs 
  (without warmups):       100 times
min/max Querymix runtime:  0.8201s / 5.1777s
Elapsed runtime:           127.555 seconds
QMpH:                      2822.321 query mixes per hour

The peak I/O is 45 MB/s, with average 28.3 MB/s; CPU average is 168%.

Now, SSDs with max speculation.

 0:  44670.34 ms, total:  44809 ms
 1:  18490.44 ms, total:  18548 ms
 2:   7306.12 ms, total:   7353 ms
 3:   9452.66 ms, total:   9485 ms
 4:   5648.56 ms, total:   5668 ms
 5:   5493.21 ms, total:   5511 ms
 6:   5951.48 ms, total:   5970 ms
 7:   3815.59 ms, total:   3834 ms
 8:   4560.71 ms, total:   4579 ms
 9:   3523.74 ms, total:   3543 ms
10:   4724.04 ms, total:   4741 ms
...

90:    673.53 ms, total:    685 ms
91:    534.62 ms, total:    545 ms
92:    730.81 ms, total:    742 ms
93:   1358.14 ms, total:   1370 ms
94:   1098.64 ms, total:   1110 ms
95:   1232.20 ms, total:   1243 ms
96:   1259.57 ms, total:   1273 ms
97:   1298.95 ms, total:   1310 ms
98:   1156.01 ms, total:   1166 ms
99:   1025.45 ms, total:   1034 ms

Scale factor:              2848260
Number of warmup runs:     0
Seed:                      808080
Number of query mix runs 
  (without warmups):       100 times
min/max Querymix runtime:  0.4725s / 44.6703s
Elapsed runtime:           269.323 seconds
QMpH:                      1336.683 query mixes per hour

The peak I/O is 339 MB/s, with average 192 MB/s; average CPU is 121%.

The above was measured with the read-ahead thread doing single-page reads. We repeated the test with merging reads with small differences. The max IO was 353 MB/s, and average 173 MB/s; average CPU 113%.

We see that the start latency is quite a bit longer than without speculation and the CPU % is lower due to higher latency of individual I/O. The I/O rate is fair. We would expect more throughput however.

We find that a supposedly better use of the API, doing single requests of up to 100 pages instead of consecutive requests of 1 page, does not make a lot of difference. The peak I/O is a bit higher; overall throughput is a bit lower.

We will have to retry these experiments with a better controller. We have at no point seen anything like the 50K 4KB random I/Os promised for the SSDs by the manufacturer. We know for a fact that the controller gives about 700 MB/s sequential read with cat file /dev/null and two drives busy. With 4 drives busy, this does not get better. The best 30 second stretch we saw in a multiuser BSBM warm-up was 590 MB/s, which is consistent with the cat to /dev/null figure. We will later test with 8 SSDs with better controllers.

Note that the average I/O and CPU are averages over 30 second measurement windows; thus for short running tests, there is some error from the window during which the activity ended.

Let us now see if we can make a BSBM instance warm up from disk in a reasonable time. We run 16 users with max speculation. We note that after reading 7,500,000 buffers we are not entirely free of disk. The max speculation read-ahead filled the cache in 17 minutes, with an average of 58 MB/s. After the cache is filled, the system shifts to a more conservative policy on extent read-ahead; one which in fact never gets triggered with the BSBM Explore in steady state. The vectored read-ahead is kept on since this by itself does not read pages that are not needed. However, the vectored read-ahead does not run either, because the data that is accessed in larger batches is already in memory. Thus there remains a trickle of an average 0.49 MB/s from disk. This keeps CPU around 350%. With SSDs, the trickle is about 1.5 MB/s and CPU is around 1300% in steady state. Thus SSDs give approximately triple the throughput in a situation where there is a tiny amount of continuous random disk access. The disk access in question is 80% for retrieving RDF literal strings, presumably on behalf of the DESCRIBE query in the mix. This query touches things no other query touches and does so one subject at a time, in a way that can neither be anticipated nor optimized.

The Virtuoso 7 column store will deal with this better because it is more space efficient overall. If we apply stream-compression to literals, these will go in under half the space, while quads will go in maybe one-quarter the space. Thus 3000 Mt all from memory should be possible with 72 GB RAM. 1000 Mt row-wise did fit in in 72 GB RAM except for the random literals accessed by the the DESCRIBE. This alone drops throughput to under a third of the memory-only throughput if using HDDs. SSDs, on the other hand, can largely neutralize this effect.

Conclusions

We have looked at basics of I/O. SSDs have been found to be a readily available solution to I/O bottlenecks without need for reconfiguration or complex I/O policies. We have been able to get a decent read rate under conditions of server warm-up or shift of working set even with HDDs.

More advanced I/O matters will be covered with the column store. We note that the techniques discussed here apply identically to rows and columns.

As concerns BSBM, it seems appropriate to include a warm-up time. In practice, this means that the store just must eagerly pre-read. This is not hard to do and can be quite useful.