Details

OpenLink Software
Burlington, United States

Subscribe

Post Categories

Recent Articles

Community Member Blogs

Display Settings

articles per page.
order.

Translate

In Hoc Signo Vinces (part 20 of n): 100G and 1000G With Cluster; When is Cluster Worthwhile; Effects of I/O [ Virtuso Data Space Bot ]

In the introduction to scale out piece, I promised to address the matter of data-to-memory ratio, and to talk about when scale-out makes sense. Here we will see that scale-out makes sense whenever data does not fit in memory on a single commodity server. The gains in processing power are immediate, even when going from one box to just two, with both systems having all in memory.

As an initial take on the issue we run 100 GB and 1000 GB on the test system. 100 GB is trivially in memory, 1000 GB is not, as the memory is 384 GB total, of which 360 GB may be used for the processes.

We run 2 workloads on the 100 GB database, having pre-loaded the data in memory:

run power throughput composite
1 349,027.7 420,503.1 383,102.1
2 387,890.3 433,066.6 409,856.5

This is directly comparable to the 100 GB single-server results. Comparing the second runs, we see a 1.53x gain in power and a 1.8x gain in throughput from 2x the platform. This is fully on the level for a workload that is not trivially parallel, as we have seen in the previous articles. The difference between the first and second runs at 100 GB comes, for both single-server and cluster, from the latency of allocating transient query memory. For an official run, where the weakest link is the first power test, this would simply have to be pre-allocated.

We run 2 workloads on the 1000 GB database, starting from cold.

The result is:

run power throughput composite
1 136,744.5 147,374.6 141,960.1
2 199,652.0 125,161.1 158,078.0

The 1000 GB result is not for competition with this platform; more memory would be needed. For actual applications, the numbers are still in the usable range, though.

The 1000 GB setup uses 4 SSDs for storage, one per server process. The server processes are each bound to their own physical CPU.

We look at the meters: 32M pages (8M per process) are in memory at each time. Over the 2 benchmark executions there are a total of 494M disk reads. The total CPU time is 165,674 seconds of CPU, of which about 10% are system, over 10,063 seconds of real-time. Cumulative disk-read wait-time is 130,177 s. This gives an average disk read throughput of 384 MB/s.

This is easily sustained by 4 SSDs; in practice, the maximum throughput we see for reading is 1 GB/s (256 MB/s per SSD). Newer SSDs would do maybe twice that. Using rotating media would not be an option.

Without the drop in CPU caused by waiting for SSD, we would have numbers very close to the 100 GB numbers.

The interconnect traffic for the two runs was 1,077 GB with no message compression. The write block time was 448 seconds of thread-time. So we see that blocking on write hurts platform utilization when running under optimal conditions, but compared to going to secondary storage, it is not a large factor.

The 1000 GB scale has a transient peak memory consumption of 42 GB. This consists of hash-join build sides and GROUP BYs. The greatest memory consumers are Q9 with 9 GB, Q13 with 11 GB, and Q16 with 7 GB. Having many of these at a time drives up the transient peak. The peak gets higher as the scale grows, also because a larger scale requires more concurrent query streams. At the 384 GB for 1000 GB ratio, we do not yet get into memory saving plans like hash joins in many passes or index use instead of hash. When the data size grows, replicated hash build sides will become less convenient, and communication will increase. Q9 and Q13 can be done by index with almost no transient memory, but these plans are easily 3x less efficient for CPU. These will probably help at 3000 GB and be necessary at least part of the time at 10,000 GB.

The I/O volume in MB per index over the 2 executions is:

index MB
LINEITEM 1,987,483
ORDERS 1,440,526
PARTSUPP 199,335
PART 161,717
CUSTOMER 43,276
O_CK 19,085
SUPPLIER 13,393

Of this, maybe 600 GB could be saved by stream compressing o_comment. Otherwise this cannot be helped without adding memory. The lineitem reads are mostly for l_extendedprice, which is not compressible. If compressing o_comment made l_extendedprice always fit in memory, then there would be a radical drop in I/O. Also, as a matter of fact, the buffer management policy of least-recently-used works the very worst for big scans, specifically those of l_extendedprice: If the head is replaced when reading the tail, and the next read starts from the head, then the whole table/column is read all over again. Caching policies that specially recognized scans of this sort could further reduce I/O. Clustering lineitems/orders on date, as Actian Vector TPC-H implementations do, also starts yielding a greater gain when not running from memory: One column (e.g., l_shipdate) may be scanned for the whole table but, if the matches are bunched together, then most of l_extendedprice will not be read at all. Still, if going for top ranks in the races, all will be from memory, or at least there will be SSDs with read throughput around 150 MB/s per core, so these tricks become relatively less important.

In the 100 GB numerical quantities summaries, we see much the same picture as in the single-server. Queries get faster, but their relative times are not radically different. The throughput test (many queries at a time) times are more or less multiples of the power (single user) times. This picture breaks at 1000 GB where I/O first drops the performance to under half and introduces huge variation in execution times within a single query. The time entirely depends on which queries are running along with or right before the execution and on whether these have the same or different working sets. All the streams have the same queries with different parameters, but the query order in each stream is different.

The numerical quantities follow for all the runs. Note that the first 1000 GB run is cold. A competition grade 1000 GB result can be made with double the memory, and the more CPU the better. We will try one at Amazon in a bit.

***

The conclusion is that scale-out pays from the get-go. At present prices, a system with twice the power of a single node of the test system is cost effective. Scales of up to 500 GB are single commodity server, under $10K. Rather than going from a mid-to-large dual-socket box to a quad-socket box, one is likely to be better off having two cheaper dual-socket boxes. These are also readily available on clouds, whereas scale-up configurations are not. Onwards of 1 TB, a cluster is expected to clearly win. At 3 TB, a commodity cluster will clearly be the better deal for both price and absolute performance.

100 GB Run 1

Virt-H Executive Summary

Report Date October 3, 2014
Database Scale Factor 100
Total Data Storage/Database Size 0M
Query Streams for
Throughput Test
5
Virt-H Power 349,027.7
Virt-H Throughput 420,503.1
Virt-H Composite
Query-per-Hour Metric
(Qph@100GB)
383,102.1
Measurement Interval in
Throughput Test (Ts)
94.273000 seconds

Duration of stream execution

Start Date/Time End Date/Time Duration
Stream 0 10/03/2014 15:05:07 10/03/2014 15:05:40 0:00:33
Stream 1 10/03/2014 15:05:42 10/03/2014 15:07:15 0:01:33
Stream 2 10/03/2014 15:05:42 10/03/2014 15:07:15 0:01:33
Stream 3 10/03/2014 15:05:42 10/03/2014 15:07:16 0:01:34
Stream 4 10/03/2014 15:05:42 10/03/2014 15:07:14 0:01:32
Stream 5 10/03/2014 15:05:42 10/03/2014 15:07:15 0:01:33
Refresh 0 10/03/2014 15:05:07 10/03/2014 15:05:13 0:00:06
10/03/2014 15:05:41 10/03/2014 15:05:42 0:00:01
Refresh 1 10/03/2014 15:06:48 10/03/2014 15:07:03 0:00:15
Refresh 2 10/03/2014 15:05:42 10/03/2014 15:06:06 0:00:24
Refresh 3 10/03/2014 15:06:06 10/03/2014 15:06:20 0:00:14
Refresh 4 10/03/2014 15:06:20 10/03/2014 15:06:35 0:00:15
Refresh 5 10/03/2014 15:06:35 10/03/2014 15:06:48 0:00:13

Numerical Quantities Summary Timing Intervals in Seconds:

Query Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Stream 0 2.045198 0.337315 1.129548 0.327029 1.230955 0.473090 0.979096 0.852639
Stream 1 4.521951 0.596538 3.464342 1.167101 3.944699 1.744325 5.442328 4.706185
Stream 2 4.678728 0.837205 3.594060 1.911751 3.942459 0.947788 3.821267 4.686319
Stream 3 5.126384 0.932394 0.961762 1.043759 5.359990 1.035597 3.056079 5.803445
Stream 4 4.497118 0.381036 4.665412 1.224975 5.316591 1.666253 2.297872 6.425171
Stream 5 4.080968 0.493741 4.416305 0.879202 5.705877 1.615987 3.846881 3.346686
Min Qi 4.080968 0.381036 0.961762 0.879202 3.942459 0.947788 2.297872 3.346686
Max Qi 5.126384 0.932394 4.665412 1.911751 5.705877 1.744325 5.442328 6.425171
Avg Qi 4.581030 0.648183 3.420376 1.245358 4.853923 1.401990 3.692885 4.993561
Query Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Stream 0 3.575916 2.786656 1.579488 0.611454 3.132460 0.685095 0.955559 1.060110
Stream 1 9.551437 7.187181 5.816455 2.004946 9.461347 5.624020 5.517677 2.924265
Stream 2 9.637427 6.641804 6.359532 2.412576 8.819754 3.335494 4.549792 3.163920
Stream 3 11.041451 6.464479 6.982671 3.272975 8.342983 3.448635 4.405911 2.886393
Stream 4 8.860228 6.754529 7.065501 3.225236 8.789565 3.419165 4.240718 2.399092
Stream 5 7.339672 8.121027 6.261988 2.711946 8.764934 3.106366 6.544712 3.472092
Min Qi 7.339672 6.464479 5.816455 2.004946 8.342983 3.106366 4.240718 2.399092
Max Qi 11.041451 8.121027 7.065501 3.272975 9.461347 5.624020 6.544712 3.472092
Avg Qi 9.286043 7.033804 6.497229 2.725536 8.835717 3.786736 5.051762 2.969152
Query Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2
Stream 0 1.433789 0.972152 0.780247 1.287222 1.360084 0.254051 6.201742 1.219707
Stream 1 3.398354 2.591249 3.021207 4.663204 4.775704 1.116547 8.770115 5.643550
Stream 2 6.811520 3.411846 2.634076 4.296810 4.669635 2.282003 18.039617 6.060465
Stream 3 4.947110 2.479268 2.952951 6.431644 5.469152 1.816467 8.271266 5.498956
Stream 4 5.240237 2.062261 2.734378 6.055141 2.997684 2.519301 7.889700 6.944722
Stream 5 4.839670 3.379315 3.231582 6.255944 3.759509 1.347830 8.707303 4.376033
Min Qi 3.398354 2.062261 2.634076 4.296810 2.997684 1.116547 7.889700 4.376033
Max Qi 6.811520 3.411846 3.231582 6.431644 5.469152 2.519301 18.039617 6.944722
Avg Qi 5.047378 2.784788 2.914839 5.540549 4.334337 1.816430 10.335600 5.704745

100 GB Run 2

Virt-H Executive Summary

Report Date October 3, 2014
Database Scale Factor 100
Total Data Storage/Database Size 0M
Query Streams for
Throughput Test
5
Virt-H Power 387,890.3
Virt-H Throughput 433,066.6
Virt-H Composite
Query-per-Hour Metric
(Qph@100GB)
409,856.5
Measurement Interval in
Throughput Test (Ts)
91.541000 seconds

Duration of stream execution

Start Date/Time End Date/Time Duration
Stream 0 10/03/2014 15:07:19 10/03/2014 15:07:47 0:00:28
Stream 1 10/03/2014 15:07:48 10/03/2014 15:09:19 0:01:31
Stream 2 10/03/2014 15:07:48 10/03/2014 15:09:16 0:01:28
Stream 3 10/03/2014 15:07:48 10/03/2014 15:09:17 0:01:29
Stream 4 10/03/2014 15:07:48 10/03/2014 15:09:16 0:01:28
Stream 5 10/03/2014 15:07:48 10/03/2014 15:09:20 0:01:32
Refresh 0 10/03/2014 15:07:19 10/03/2014 15:07:22 0:00:03
10/03/2014 15:07:47 10/03/2014 15:07:48 0:00:01
Refresh 1 10/03/2014 15:08:45 10/03/2014 15:08:59 0:00:14
Refresh 2 10/03/2014 15:07:49 10/03/2014 15:08:02 0:00:13
Refresh 3 10/03/2014 15:08:02 10/03/2014 15:08:17 0:00:15
Refresh 4 10/03/2014 15:08:17 10/03/2014 15:08:29 0:00:12
Refresh 5 10/03/2014 15:08:29 10/03/2014 15:08:45 0:00:16

Numerical Quantities Summary Timing Intervals in Seconds:

Query Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Stream 0 2.081986 0.208487 0.902462 0.313160 1.312273 0.493157 0.926629 0.786345
Stream 1 2.755427 0.911578 3.618085 0.664407 3.740112 2.118189 4.738754 6.551446
Stream 2 4.189612 0.957921 5.267355 2.152479 6.068005 1.263380 4.251842 3.620160
Stream 3 4.708834 0.981651 2.411839 0.790955 4.384516 1.322670 2.641571 4.771831
Stream 4 3.739567 1.185884 2.863871 1.517891 5.946967 1.179960 3.840560 4.926325
Stream 5 5.258746 0.705228 3.460904 0.951328 4.530620 1.104500 3.226494 4.041142
Min Qi 2.755427 0.705228 2.411839 0.664407 3.740112 1.104500 2.641571 3.620160
Max Qi 5.258746 1.185884 5.267355 2.152479 6.068005 2.118189 4.738754 6.551446
Avg Qi 4.130437 0.948452 3.524411 1.215412 4.934044 1.397740 3.739844 4.782181
Query Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Stream 0 3.226685 1.878227 1.802562 0.676499 3.145884 0.653129 0.963449 0.990524
Stream 1 8.842030 5.630466 5.728147 2.643227 9.615551 3.197855 4.676538 4.285251
Stream 2 9.508612 5.288044 4.319998 1.492915 9.431995 3.206360 3.859749 3.201996
Stream 3 10.480224 5.880274 4.517320 2.509405 6.913159 2.892479 6.408602 2.938061
Stream 4 8.824111 5.752413 5.997959 2.581237 8.954756 3.351951 2.420598 4.148455
Stream 5 4.905553 7.099111 5.121041 2.516020 9.354924 3.955638 4.389209 3.818902
Min Qi 4.905553 5.288044 4.319998 1.492915 6.913159 2.892479 2.420598 2.938061
Max Qi 10.480224 7.099111 5.997959 2.643227 9.615551 3.955638 6.408602 4.285251
Avg Qi 8.512106 5.930062 5.136893 2.348561 8.854077 3.320857 4.350939 3.678533
Query Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2
Stream 0 1.405338 0.868313 0.806277 1.123366 1.314028 0.233214 2.590459 1.230242
Stream 1 5.191045 3.171244 3.403836 4.604523 3.721133 0.892096 7.136841 6.500452
Stream 2 6.282687 2.845465 3.024786 4.086546 3.530743 0.619683 9.263671 4.826173
Stream 3 6.040787 2.659766 2.787273 6.210077 3.902190 2.175417 7.974860 6.689780
Stream 4 4.978721 2.542674 3.518783 4.385571 3.906211 0.918752 6.303352 5.139326
Stream 5 5.208600 3.761975 3.682886 7.874493 5.017600 2.087150 7.999074 7.978154
Min Qi 4.978721 2.542674 2.787273 4.086546 3.530743 0.619683 6.303352 4.826173
Max Qi 6.282687 3.761975 3.682886 7.874493 5.017600 2.175417 9.263671 7.978154
Avg Qi 5.540368 2.996225 3.283513 5.432242 4.015575 1.338620 7.735560 6.226777

1000 GB Run 1

Virt-H Executive Summary

Report Date October 3, 2014
Database Scale Factor 1000
Total Data Storage/Database Size 26M
Query Streams for
Throughput Test
7
Virt-H Power 136,744.5
Virt-H Throughput 147,374.6
Virt-H Composite
Query-per-Hour Metric
(Qph@1000GB)
141,960.1
Measurement Interval in
Throughput Test (Ts)
3,761.953000 seconds

Duration of stream execution

Start Date/Time End Date/Time Duration
Stream 0 10/03/2014 09:18:42 10/03/2014 09:34:12 0:15:30
Stream 1 10/03/2014 09:34:43 10/03/2014 10:35:42 1:00:59
Stream 2 10/03/2014 09:34:43 10/03/2014 10:37:14 1:02:31
Stream 3 10/03/2014 09:34:43 10/03/2014 10:37:25 1:02:42
Stream 4 10/03/2014 09:34:43 10/03/2014 10:33:31 0:58:48
Stream 5 10/03/2014 09:34:43 10/03/2014 10:35:26 1:00:43
Stream 6 10/03/2014 09:34:43 10/03/2014 10:28:00 0:53:17
Stream 7 10/03/2014 09:34:43 10/03/2014 10:35:42 1:00:59
Refresh 0 10/03/2014 09:18:42 10/03/2014 09:19:27 0:00:45
10/03/2014 09:34:12 10/03/2014 09:34:42 0:00:30
Refresh 1 10/03/2014 09:43:03 10/03/2014 09:43:38 0:00:35
Refresh 2 10/03/2014 09:34:43 10/03/2014 09:36:54 0:02:11
Refresh 3 10/03/2014 09:36:53 10/03/2014 09:38:39 0:01:46
Refresh 4 10/03/2014 09:38:39 10/03/2014 09:39:22 0:00:43
Refresh 5 10/03/2014 09:39:23 10/03/2014 09:41:09 0:01:46
Refresh 6 10/03/2014 09:41:09 10/03/2014 09:42:15 0:01:06
Refresh 7 10/03/2014 09:42:15 10/03/2014 09:43:02 0:00:47

Numerical Quantities Summary Timing Intervals in Seconds:

Query Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Stream 0 104.488583 18.351559 24.631282 36.195531 36.319915 3.807790 22.750889 31.190630
Stream 1 209.323441 26.205435 59.637373 245.808484 60.699333 22.369379 289.435780 335.733425
Stream 2 109.134446 64.185831 96.131735 108.459418 310.273986 53.595127 152.242755 104.350098
Stream 3 73.321611 215.535408 69.543101 12.423757 64.445611 38.254747 122.952872 98.713213
Stream 4 110.875875 4.272757 78.697314 16.316807 59.746855 23.447211 353.190412 342.549908
Stream 5 41.972337 5.978707 60.784575 34.219229 42.372449 344.590640 146.186614 274.972270
Stream 6 115.760155 18.692078 58.493147 9.193234 49.831932 19.081395 60.603109 128.095501
Stream 7 58.601744 118.126585 297.327543 298.578268 714.284222 108.475250 91.868151 55.881029
Min Qi 41.972337 4.272757 58.493147 9.193234 42.372449 19.081395 60.603109 55.881029
Max Qi 209.323441 215.535408 297.327543 298.578268 714.284222 344.590640 353.190412 342.549908
Avg Qi 102.712801 64.713829 102.944970 103.571314 185.950627 87.116250 173.782813 191.470778
Query Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Stream 0 41.777880 10.035063 16.125611 9.245638 209.443782 111.271310 37.821595 9.483838
Stream 1 244.243830 63.473338 207.741931 33.696956 561.057408 141.026049 126.818051 54.774792
Stream 2 189.297446 144.853756 56.292537 184.781273 501.330052 49.965102 107.736393 85.691079
Stream 3 231.060699 355.394713 43.483645 11.806590 555.445111 36.722686 251.241817 9.057850
Stream 4 227.371508 32.207115 108.880658 139.922550 532.697956 57.106583 159.198489 153.088913
Stream 5 416.113856 108.689389 62.847727 702.712683 622.906487 58.198961 89.707091 85.614769
Stream 6 228.019243 62.474213 88.227994 282.932978 432.387869 238.544027 61.486269 56.950548
Stream 7 230.564416 69.197517 130.708759 120.531103 551.112816 57.438478 82.256530 63.796403
Min Qi 189.297446 32.207115 43.483645 11.806590 432.387869 36.722686 61.486269 9.057850
Max Qi 416.113856 355.394713 207.741931 702.712683 622.906487 238.544027 251.241817 153.088913
Avg Qi 252.381571 119.470006 99.740464 210.912019 536.705386 91.285984 125.492091 72.710622
Query Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2
Stream 0 22.897349 47.870269 12.735580 25.982194 46.091766 6.623306 45.120559 30.016788
Stream 1 123.444839 22.212194 647.523826 97.431531 81.592165 4.573040 21.068225 14.486185
Stream 2 80.853865 622.651044 288.656211 336.409076 70.925079 33.578052 82.910543 48.001583
Stream 3 392.340812 84.967695 57.181935 473.720060 497.262620 66.966740 54.778284 50.940094
Stream 4 97.069440 301.705125 338.035788 258.992426 103.699408 28.750257 23.858757 13.626079
Stream 5 69.882110 34.277914 146.031938 179.656129 104.788154 10.836148 54.319823 52.077352
Stream 6 141.310431 247.242904 94.392791 702.775460 80.142930 19.969889 46.027410 19.136271
Stream 7 89.018281 51.105998 281.234432 79.046122 84.341517 26.221892 33.169666 13.309634
Min Qi 69.882110 22.212194 57.181935 79.046122 70.925079 4.573040 21.068225 13.309634
Max Qi 392.340812 622.651044 647.523826 702.775460 497.262620 66.966740 82.910543 52.077352
Avg Qi 141.988540 194.880411 264.722417 304.004401 146.107410 27.270860 45.161815 30.225314

1000 GB Run 2

Virt-H Executive Summary

Report Date October 3, 2014
Database Scale Factor 1000
Total Data Storage/Database Size 26M
Query Streams for
Throughput Test
7
Virt-H Power 199,652.0
Virt-H Throughput 125,161.1
Virt-H Composite
Query-per-Hour Metric
(Qph@1000GB)
158,078.0
Measurement Interval in
Throughput Test (Ts)
4,429.608000 seconds

Duration of stream execution

Start Date/Time End Date/Time Duration
Stream 0 10/03/2014 10:37:29 10/03/2014 10:52:26 0:14:57
Stream 1 10/03/2014 10:52:35 10/03/2014 12:05:19 1:12:44
Stream 2 10/03/2014 10:52:35 10/03/2014 12:06:25 1:13:50
Stream 3 10/03/2014 10:52:35 10/03/2014 12:03:08 1:10:33
Stream 4 10/03/2014 10:52:35 10/03/2014 12:05:20 1:12:45
Stream 5 10/03/2014 10:52:35 10/03/2014 11:57:40 1:05:05
Stream 6 10/03/2014 10:52:35 10/03/2014 12:05:28 1:12:53
Stream 7 10/03/2014 10:52:35 10/03/2014 12:05:25 1:12:50
Refresh 0 10/03/2014 10:37:29 10/03/2014 10:37:52 0:00:23
10/03/2014 10:52:25 10/03/2014 10:52:34 0:00:09
Refresh 1 10/03/2014 11:01:44 10/03/2014 11:02:29 0:00:45
Refresh 2 10/03/2014 10:52:35 10/03/2014 10:54:50 0:02:15
Refresh 3 10/03/2014 10:54:50 10/03/2014 10:57:02 0:02:12
Refresh 4 10/03/2014 10:57:05 10/03/2014 10:58:47 0:01:42
Refresh 5 10/03/2014 10:58:47 10/03/2014 10:59:46 0:00:59
Refresh 6 10/03/2014 10:59:45 10/03/2014 11:00:38 0:00:53
Refresh 7 10/03/2014 11:00:39 10/03/2014 11:01:44 0:01:05

Numerical Quantities Summary Timing Intervals in Seconds:

Query Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Stream 0 34.105419 1.439089 9.802183 2.033956 10.525742 3.356152 23.953729 36.199533
Stream 1 26.598252 150.572833 41.930330 86.870320 50.604856 201.001372 61.638366 244.013359
Stream 2 50.129895 102.219282 12.380935 102.319615 62.577229 43.454392 891.076608 407.640626
Stream 3 269.947278 53.172724 54.649973 11.460062 66.695722 17.336698 63.371232 91.158050
Stream 4 41.149221 22.520836 28.707973 509.984321 68.916549 17.525025 702.191490 666.450230
Stream 5 59.179045 30.734442 99.504351 11.145990 101.334340 21.660836 74.625589 535.160207
Stream 6 225.105215 55.567328 46.749707 554.474507 215.657091 54.362551 72.960653 442.194302
Stream 7 220.993226 28.528230 47.543365 336.191006 308.931194 9.767397 850.258452 66.121298
Min Qi 26.598252 22.520836 12.380935 11.145990 50.604856 9.767397 61.638366 66.121298
Max Qi 269.947278 150.572833 99.504351 554.474507 308.931194 201.001372 891.076608 666.450230
Avg Qi 127.586019 63.330811 47.352376 230.349403 124.959569 52.158324 388.017484 350.391153
Query Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Stream 0 50.439615 9.287196 15.892947 7.112715 250.527755 131.478131 54.458992 10.525842
Stream 1 420.919329 317.402771 101.818338 403.213385 724.539887 160.669174 65.374584 28.563034
Stream 2 464.378760 210.938167 23.395678 545.086468 736.005716 54.680686 398.880053 34.018918
Stream 3 350.083270 321.781561 48.652019 435.954962 378.872739 100.588804 289.350342 190.140640
Stream 4 306.265994 249.621982 79.280220 221.255121 348.932746 49.555802 100.062439 61.368814
Stream 5 511.923087 133.018420 134.199065 9.655693 662.658830 104.380635 82.847242 59.952271
Stream 6 578.362701 61.221715 145.613349 47.957006 621.993889 256.150595 77.124777 91.163005
Stream 7 418.450091 391.818564 29.360218 17.236628 761.850888 31.952329 50.393082 27.530882
Min Qi 306.265994 61.221715 23.395678 9.655693 348.932746 31.952329 50.393082 27.530882
Max Qi 578.362701 391.818564 145.613349 545.086468 761.850888 256.150595 398.880053 190.140640
Avg Qi 435.769033 240.829026 80.331270 240.051323 604.979242 108.282575 152.004646 70.391081
Query Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2
Stream 0 22.444111 37.978532 13.347320 26.553364 115.511143 7.670304 22.771613 8.761026
Stream 1 329.153807 19.198590 258.455295 556.256015 99.647793 14.878746 32.803289 8.771923
Stream 2 76.940373 74.916489 75.246897 16.035355 14.403643 32.348500 91.981362 41.426540
Stream 3 88.918404 238.858707 221.257060 688.441713 247.669761 5.345632 70.780594 49.352955
Stream 4 497.105081 167.874781 67.668514 76.820831 78.585717 3.655421 73.165786 29.401670
Stream 5 309.991618 123.023557 380.801141 347.055909 93.478502 18.351491 33.338814 12.557542
Stream 6 57.200926 154.489850 386.007137 103.558355 32.676369 92.863316 35.576966 14.061801
Stream 7 160.332088 46.934177 340.957970 84.479720 78.985110 60.568796 44.362737 8.831746
Min Qi 57.200926 19.198590 67.668514 16.035355 14.403643 3.655421 32.803289 8.771923
Max Qi 497.105081 238.858707 386.007137 688.441713 247.669761 92.863316 91.981362 49.352955
Avg Qi 217.091757 117.899450 247.199145 267.521128 92.206699 32.573129 54.572793 23.486311

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
10/06/2014 13:53 GMT
In Hoc Signo Vinces (part 20 of n): 100G and 1000G With Cluster; When is Cluster Worthwhile; Effects of I/O [ Orri Erling ]

In the introduction to scale out piece, I promised to address the matter of data-to-memory ratio, and to talk about when scale-out makes sense. Here we will see that scale-out makes sense whenever data does not fit in memory on a single commodity server. The gains in processing power are immediate, even when going from one box to just two, with both systems having all in memory.

As an initial take on the issue we run 100 GB and 1000 GB on the test system. 100 GB is trivially in memory, 1000 GB is not, as the memory is 384 GB total, of which 360 GB may be used for the processes.

We run 2 workloads on the 100 GB database, having pre-loaded the data in memory:

run power throughput composite
1 349,027.7 420,503.1 383,102.1
2 387,890.3 433,066.6 409,856.5

This is directly comparable to the 100 GB single-server results. Comparing the second runs, we see a 1.53x gain in power and a 1.8x gain in throughput from 2x the platform. This is fully on the level for a workload that is not trivially parallel, as we have seen in the previous articles. The difference between the first and second runs at 100 GB comes, for both single-server and cluster, from the latency of allocating transient query memory. For an official run, where the weakest link is the first power test, this would simply have to be pre-allocated.

We run 2 workloads on the 1000 GB database, starting from cold.

The result is:

run power throughput composite
1 136,744.5 147,374.6 141,960.1
2 199,652.0 125,161.1 158,078.0

The 1000 GB result is not for competition with this platform; more memory would be needed. For actual applications, the numbers are still in the usable range, though.

The 1000 GB setup uses 4 SSDs for storage, one per server process. The server processes are each bound to their own physical CPU.

We look at the meters: 32M pages (8M per process) are in memory at each time. Over the 2 benchmark executions there are a total of 494M disk reads. The total CPU time is 165,674 seconds of CPU, of which about 10% are system, over 10,063 seconds of real-time. Cumulative disk-read wait-time is 130,177 s. This gives an average disk read throughput of 384 MB/s.

This is easily sustained by 4 SSDs; in practice, the maximum throughput we see for reading is 1 GB/s (256 MB/s per SSD). Newer SSDs would do maybe twice that. Using rotating media would not be an option.

Without the drop in CPU caused by waiting for SSD, we would have numbers very close to the 100 GB numbers.

The interconnect traffic for the two runs was 1,077 GB with no message compression. The write block time was 448 seconds of thread-time. So we see that blocking on write hurts platform utilization when running under optimal conditions, but compared to going to secondary storage, it is not a large factor.

The 1000 GB scale has a transient peak memory consumption of 42 GB. This consists of hash-join build sides and GROUP BYs. The greatest memory consumers are Q9 with 9 GB, Q13 with 11 GB, and Q16 with 7 GB. Having many of these at a time drives up the transient peak. The peak gets higher as the scale grows, also because a larger scale requires more concurrent query streams. At the 384 GB for 1000 GB ratio, we do not yet get into memory saving plans like hash joins in many passes or index use instead of hash. When the data size grows, replicated hash build sides will become less convenient, and communication will increase. Q9 and Q13 can be done by index with almost no transient memory, but these plans are easily 3x less efficient for CPU. These will probably help at 3000 GB and be necessary at least part of the time at 10,000 GB.

The I/O volume in MB per index over the 2 executions is:

index MB
LINEITEM 1,987,483
ORDERS 1,440,526
PARTSUPP 199,335
PART 161,717
CUSTOMER 43,276
O_CK 19,085
SUPPLIER 13,393

Of this, maybe 600 GB could be saved by stream compressing o_comment. Otherwise this cannot be helped without adding memory. The lineitem reads are mostly for l_extendedprice, which is not compressible. If compressing o_comment made l_extendedprice always fit in memory, then there would be a radical drop in I/O. Also, as a matter of fact, the buffer management policy of least-recently-used works the very worst for big scans, specifically those of l_extendedprice: If the head is replaced when reading the tail, and the next read starts from the head, then the whole table/column is read all over again. Caching policies that specially recognized scans of this sort could further reduce I/O. Clustering lineitems/orders on date, as Actian Vector TPC-H implementations do, also starts yielding a greater gain when not running from memory: One column (e.g., l_shipdate) may be scanned for the whole table but, if the matches are bunched together, then most of l_extendedprice will not be read at all. Still, if going for top ranks in the races, all will be from memory, or at least there will be SSDs with read throughput around 150 MB/s per core, so these tricks become relatively less important.

In the 100 GB numerical quantities summaries, we see much the same picture as in the single-server. Queries get faster, but their relative times are not radically different. The throughput test (many queries at a time) times are more or less multiples of the power (single user) times. This picture breaks at 1000 GB where I/O first drops the performance to under half and introduces huge variation in execution times within a single query. The time entirely depends on which queries are running along with or right before the execution and on whether these have the same or different working sets. All the streams have the same queries with different parameters, but the query order in each stream is different.

The numerical quantities follow for all the runs. Note that the first 1000 GB run is cold. A competition grade 1000 GB result can be made with double the memory, and the more CPU the better. We will try one at Amazon in a bit.

***

The conclusion is that scale-out pays from the get-go. At present prices, a system with twice the power of a single node of the test system is cost effective. Scales of up to 500 GB are single commodity server, under $10K. Rather than going from a mid-to-large dual-socket box to a quad-socket box, one is likely to be better off having two cheaper dual-socket boxes. These are also readily available on clouds, whereas scale-up configurations are not. Onwards of 1 TB, a cluster is expected to clearly win. At 3 TB, a commodity cluster will clearly be the better deal for both price and absolute performance.

100 GB Run 1

Virt-H Executive Summary

Report Date October 3, 2014
Database Scale Factor 100
Total Data Storage/Database Size 0M
Query Streams for
Throughput Test
5
Virt-H Power 349,027.7
Virt-H Throughput 420,503.1
Virt-H Composite
Query-per-Hour Metric
(Qph@100GB)
383,102.1
Measurement Interval in
Throughput Test (Ts)
94.273000 seconds

Duration of stream execution

Start Date/Time End Date/Time Duration
Stream 0 10/03/2014 15:05:07 10/03/2014 15:05:40 0:00:33
Stream 1 10/03/2014 15:05:42 10/03/2014 15:07:15 0:01:33
Stream 2 10/03/2014 15:05:42 10/03/2014 15:07:15 0:01:33
Stream 3 10/03/2014 15:05:42 10/03/2014 15:07:16 0:01:34
Stream 4 10/03/2014 15:05:42 10/03/2014 15:07:14 0:01:32
Stream 5 10/03/2014 15:05:42 10/03/2014 15:07:15 0:01:33
Refresh 0 10/03/2014 15:05:07 10/03/2014 15:05:13 0:00:06
10/03/2014 15:05:41 10/03/2014 15:05:42 0:00:01
Refresh 1 10/03/2014 15:06:48 10/03/2014 15:07:03 0:00:15
Refresh 2 10/03/2014 15:05:42 10/03/2014 15:06:06 0:00:24
Refresh 3 10/03/2014 15:06:06 10/03/2014 15:06:20 0:00:14
Refresh 4 10/03/2014 15:06:20 10/03/2014 15:06:35 0:00:15
Refresh 5 10/03/2014 15:06:35 10/03/2014 15:06:48 0:00:13

Numerical Quantities Summary Timing Intervals in Seconds:

Query Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Stream 0 2.045198 0.337315 1.129548 0.327029 1.230955 0.473090 0.979096 0.852639
Stream 1 4.521951 0.596538 3.464342 1.167101 3.944699 1.744325 5.442328 4.706185
Stream 2 4.678728 0.837205 3.594060 1.911751 3.942459 0.947788 3.821267 4.686319
Stream 3 5.126384 0.932394 0.961762 1.043759 5.359990 1.035597 3.056079 5.803445
Stream 4 4.497118 0.381036 4.665412 1.224975 5.316591 1.666253 2.297872 6.425171
Stream 5 4.080968 0.493741 4.416305 0.879202 5.705877 1.615987 3.846881 3.346686
Min Qi 4.080968 0.381036 0.961762 0.879202 3.942459 0.947788 2.297872 3.346686
Max Qi 5.126384 0.932394 4.665412 1.911751 5.705877 1.744325 5.442328 6.425171
Avg Qi 4.581030 0.648183 3.420376 1.245358 4.853923 1.401990 3.692885 4.993561
Query Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Stream 0 3.575916 2.786656 1.579488 0.611454 3.132460 0.685095 0.955559 1.060110
Stream 1 9.551437 7.187181 5.816455 2.004946 9.461347 5.624020 5.517677 2.924265
Stream 2 9.637427 6.641804 6.359532 2.412576 8.819754 3.335494 4.549792 3.163920
Stream 3 11.041451 6.464479 6.982671 3.272975 8.342983 3.448635 4.405911 2.886393
Stream 4 8.860228 6.754529 7.065501 3.225236 8.789565 3.419165 4.240718 2.399092
Stream 5 7.339672 8.121027 6.261988 2.711946 8.764934 3.106366 6.544712 3.472092
Min Qi 7.339672 6.464479 5.816455 2.004946 8.342983 3.106366 4.240718 2.399092
Max Qi 11.041451 8.121027 7.065501 3.272975 9.461347 5.624020 6.544712 3.472092
Avg Qi 9.286043 7.033804 6.497229 2.725536 8.835717 3.786736 5.051762 2.969152
Query Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2
Stream 0 1.433789 0.972152 0.780247 1.287222 1.360084 0.254051 6.201742 1.219707
Stream 1 3.398354 2.591249 3.021207 4.663204 4.775704 1.116547 8.770115 5.643550
Stream 2 6.811520 3.411846 2.634076 4.296810 4.669635 2.282003 18.039617 6.060465
Stream 3 4.947110 2.479268 2.952951 6.431644 5.469152 1.816467 8.271266 5.498956
Stream 4 5.240237 2.062261 2.734378 6.055141 2.997684 2.519301 7.889700 6.944722
Stream 5 4.839670 3.379315 3.231582 6.255944 3.759509 1.347830 8.707303 4.376033
Min Qi 3.398354 2.062261 2.634076 4.296810 2.997684 1.116547 7.889700 4.376033
Max Qi 6.811520 3.411846 3.231582 6.431644 5.469152 2.519301 18.039617 6.944722
Avg Qi 5.047378 2.784788 2.914839 5.540549 4.334337 1.816430 10.335600 5.704745

100 GB Run 2

Virt-H Executive Summary

Report Date October 3, 2014
Database Scale Factor 100
Total Data Storage/Database Size 0M
Query Streams for
Throughput Test
5
Virt-H Power 387,890.3
Virt-H Throughput 433,066.6
Virt-H Composite
Query-per-Hour Metric
(Qph@100GB)
409,856.5
Measurement Interval in
Throughput Test (Ts)
91.541000 seconds

Duration of stream execution

Start Date/Time End Date/Time Duration
Stream 0 10/03/2014 15:07:19 10/03/2014 15:07:47 0:00:28
Stream 1 10/03/2014 15:07:48 10/03/2014 15:09:19 0:01:31
Stream 2 10/03/2014 15:07:48 10/03/2014 15:09:16 0:01:28
Stream 3 10/03/2014 15:07:48 10/03/2014 15:09:17 0:01:29
Stream 4 10/03/2014 15:07:48 10/03/2014 15:09:16 0:01:28
Stream 5 10/03/2014 15:07:48 10/03/2014 15:09:20 0:01:32
Refresh 0 10/03/2014 15:07:19 10/03/2014 15:07:22 0:00:03
10/03/2014 15:07:47 10/03/2014 15:07:48 0:00:01
Refresh 1 10/03/2014 15:08:45 10/03/2014 15:08:59 0:00:14
Refresh 2 10/03/2014 15:07:49 10/03/2014 15:08:02 0:00:13
Refresh 3 10/03/2014 15:08:02 10/03/2014 15:08:17 0:00:15
Refresh 4 10/03/2014 15:08:17 10/03/2014 15:08:29 0:00:12
Refresh 5 10/03/2014 15:08:29 10/03/2014 15:08:45 0:00:16

Numerical Quantities Summary Timing Intervals in Seconds:

Query Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Stream 0 2.081986 0.208487 0.902462 0.313160 1.312273 0.493157 0.926629 0.786345
Stream 1 2.755427 0.911578 3.618085 0.664407 3.740112 2.118189 4.738754 6.551446
Stream 2 4.189612 0.957921 5.267355 2.152479 6.068005 1.263380 4.251842 3.620160
Stream 3 4.708834 0.981651 2.411839 0.790955 4.384516 1.322670 2.641571 4.771831
Stream 4 3.739567 1.185884 2.863871 1.517891 5.946967 1.179960 3.840560 4.926325
Stream 5 5.258746 0.705228 3.460904 0.951328 4.530620 1.104500 3.226494 4.041142
Min Qi 2.755427 0.705228 2.411839 0.664407 3.740112 1.104500 2.641571 3.620160
Max Qi 5.258746 1.185884 5.267355 2.152479 6.068005 2.118189 4.738754 6.551446
Avg Qi 4.130437 0.948452 3.524411 1.215412 4.934044 1.397740 3.739844 4.782181
Query Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Stream 0 3.226685 1.878227 1.802562 0.676499 3.145884 0.653129 0.963449 0.990524
Stream 1 8.842030 5.630466 5.728147 2.643227 9.615551 3.197855 4.676538 4.285251
Stream 2 9.508612 5.288044 4.319998 1.492915 9.431995 3.206360 3.859749 3.201996
Stream 3 10.480224 5.880274 4.517320 2.509405 6.913159 2.892479 6.408602 2.938061
Stream 4 8.824111 5.752413 5.997959 2.581237 8.954756 3.351951 2.420598 4.148455
Stream 5 4.905553 7.099111 5.121041 2.516020 9.354924 3.955638 4.389209 3.818902
Min Qi 4.905553 5.288044 4.319998 1.492915 6.913159 2.892479 2.420598 2.938061
Max Qi 10.480224 7.099111 5.997959 2.643227 9.615551 3.955638 6.408602 4.285251
Avg Qi 8.512106 5.930062 5.136893 2.348561 8.854077 3.320857 4.350939 3.678533
Query Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2
Stream 0 1.405338 0.868313 0.806277 1.123366 1.314028 0.233214 2.590459 1.230242
Stream 1 5.191045 3.171244 3.403836 4.604523 3.721133 0.892096 7.136841 6.500452
Stream 2 6.282687 2.845465 3.024786 4.086546 3.530743 0.619683 9.263671 4.826173
Stream 3 6.040787 2.659766 2.787273 6.210077 3.902190 2.175417 7.974860 6.689780
Stream 4 4.978721 2.542674 3.518783 4.385571 3.906211 0.918752 6.303352 5.139326
Stream 5 5.208600 3.761975 3.682886 7.874493 5.017600 2.087150 7.999074 7.978154
Min Qi 4.978721 2.542674 2.787273 4.086546 3.530743 0.619683 6.303352 4.826173
Max Qi 6.282687 3.761975 3.682886 7.874493 5.017600 2.175417 9.263671 7.978154
Avg Qi 5.540368 2.996225 3.283513 5.432242 4.015575 1.338620 7.735560 6.226777

1000 GB Run 1

Virt-H Executive Summary

Report Date October 3, 2014
Database Scale Factor 1000
Total Data Storage/Database Size 26M
Query Streams for
Throughput Test
7
Virt-H Power 136,744.5
Virt-H Throughput 147,374.6
Virt-H Composite
Query-per-Hour Metric
(Qph@1000GB)
141,960.1
Measurement Interval in
Throughput Test (Ts)
3,761.953000 seconds

Duration of stream execution

Start Date/Time End Date/Time Duration
Stream 0 10/03/2014 09:18:42 10/03/2014 09:34:12 0:15:30
Stream 1 10/03/2014 09:34:43 10/03/2014 10:35:42 1:00:59
Stream 2 10/03/2014 09:34:43 10/03/2014 10:37:14 1:02:31
Stream 3 10/03/2014 09:34:43 10/03/2014 10:37:25 1:02:42
Stream 4 10/03/2014 09:34:43 10/03/2014 10:33:31 0:58:48
Stream 5 10/03/2014 09:34:43 10/03/2014 10:35:26 1:00:43
Stream 6 10/03/2014 09:34:43 10/03/2014 10:28:00 0:53:17
Stream 7 10/03/2014 09:34:43 10/03/2014 10:35:42 1:00:59
Refresh 0 10/03/2014 09:18:42 10/03/2014 09:19:27 0:00:45
10/03/2014 09:34:12 10/03/2014 09:34:42 0:00:30
Refresh 1 10/03/2014 09:43:03 10/03/2014 09:43:38 0:00:35
Refresh 2 10/03/2014 09:34:43 10/03/2014 09:36:54 0:02:11
Refresh 3 10/03/2014 09:36:53 10/03/2014 09:38:39 0:01:46
Refresh 4 10/03/2014 09:38:39 10/03/2014 09:39:22 0:00:43
Refresh 5 10/03/2014 09:39:23 10/03/2014 09:41:09 0:01:46
Refresh 6 10/03/2014 09:41:09 10/03/2014 09:42:15 0:01:06
Refresh 7 10/03/2014 09:42:15 10/03/2014 09:43:02 0:00:47

Numerical Quantities Summary Timing Intervals in Seconds:

Query Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Stream 0 104.488583 18.351559 24.631282 36.195531 36.319915 3.807790 22.750889 31.190630
Stream 1 209.323441 26.205435 59.637373 245.808484 60.699333 22.369379 289.435780 335.733425
Stream 2 109.134446 64.185831 96.131735 108.459418 310.273986 53.595127 152.242755 104.350098
Stream 3 73.321611 215.535408 69.543101 12.423757 64.445611 38.254747 122.952872 98.713213
Stream 4 110.875875 4.272757 78.697314 16.316807 59.746855 23.447211 353.190412 342.549908
Stream 5 41.972337 5.978707 60.784575 34.219229 42.372449 344.590640 146.186614 274.972270
Stream 6 115.760155 18.692078 58.493147 9.193234 49.831932 19.081395 60.603109 128.095501
Stream 7 58.601744 118.126585 297.327543 298.578268 714.284222 108.475250 91.868151 55.881029
Min Qi 41.972337 4.272757 58.493147 9.193234 42.372449 19.081395 60.603109 55.881029
Max Qi 209.323441 215.535408 297.327543 298.578268 714.284222 344.590640 353.190412 342.549908
Avg Qi 102.712801 64.713829 102.944970 103.571314 185.950627 87.116250 173.782813 191.470778
Query Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Stream 0 41.777880 10.035063 16.125611 9.245638 209.443782 111.271310 37.821595 9.483838
Stream 1 244.243830 63.473338 207.741931 33.696956 561.057408 141.026049 126.818051 54.774792
Stream 2 189.297446 144.853756 56.292537 184.781273 501.330052 49.965102 107.736393 85.691079
Stream 3 231.060699 355.394713 43.483645 11.806590 555.445111 36.722686 251.241817 9.057850
Stream 4 227.371508 32.207115 108.880658 139.922550 532.697956 57.106583 159.198489 153.088913
Stream 5 416.113856 108.689389 62.847727 702.712683 622.906487 58.198961 89.707091 85.614769
Stream 6 228.019243 62.474213 88.227994 282.932978 432.387869 238.544027 61.486269 56.950548
Stream 7 230.564416 69.197517 130.708759 120.531103 551.112816 57.438478 82.256530 63.796403
Min Qi 189.297446 32.207115 43.483645 11.806590 432.387869 36.722686 61.486269 9.057850
Max Qi 416.113856 355.394713 207.741931 702.712683 622.906487 238.544027 251.241817 153.088913
Avg Qi 252.381571 119.470006 99.740464 210.912019 536.705386 91.285984 125.492091 72.710622
Query Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2
Stream 0 22.897349 47.870269 12.735580 25.982194 46.091766 6.623306 45.120559 30.016788
Stream 1 123.444839 22.212194 647.523826 97.431531 81.592165 4.573040 21.068225 14.486185
Stream 2 80.853865 622.651044 288.656211 336.409076 70.925079 33.578052 82.910543 48.001583
Stream 3 392.340812 84.967695 57.181935 473.720060 497.262620 66.966740 54.778284 50.940094
Stream 4 97.069440 301.705125 338.035788 258.992426 103.699408 28.750257 23.858757 13.626079
Stream 5 69.882110 34.277914 146.031938 179.656129 104.788154 10.836148 54.319823 52.077352
Stream 6 141.310431 247.242904 94.392791 702.775460 80.142930 19.969889 46.027410 19.136271
Stream 7 89.018281 51.105998 281.234432 79.046122 84.341517 26.221892 33.169666 13.309634
Min Qi 69.882110 22.212194 57.181935 79.046122 70.925079 4.573040 21.068225 13.309634
Max Qi 392.340812 622.651044 647.523826 702.775460 497.262620 66.966740 82.910543 52.077352
Avg Qi 141.988540 194.880411 264.722417 304.004401 146.107410 27.270860 45.161815 30.225314

1000 GB Run 2

Virt-H Executive Summary

Report Date October 3, 2014
Database Scale Factor 1000
Total Data Storage/Database Size 26M
Query Streams for
Throughput Test
7
Virt-H Power 199,652.0
Virt-H Throughput 125,161.1
Virt-H Composite
Query-per-Hour Metric
(Qph@1000GB)
158,078.0
Measurement Interval in
Throughput Test (Ts)
4,429.608000 seconds

Duration of stream execution

Start Date/Time End Date/Time Duration
Stream 0 10/03/2014 10:37:29 10/03/2014 10:52:26 0:14:57
Stream 1 10/03/2014 10:52:35 10/03/2014 12:05:19 1:12:44
Stream 2 10/03/2014 10:52:35 10/03/2014 12:06:25 1:13:50
Stream 3 10/03/2014 10:52:35 10/03/2014 12:03:08 1:10:33
Stream 4 10/03/2014 10:52:35 10/03/2014 12:05:20 1:12:45
Stream 5 10/03/2014 10:52:35 10/03/2014 11:57:40 1:05:05
Stream 6 10/03/2014 10:52:35 10/03/2014 12:05:28 1:12:53
Stream 7 10/03/2014 10:52:35 10/03/2014 12:05:25 1:12:50
Refresh 0 10/03/2014 10:37:29 10/03/2014 10:37:52 0:00:23
10/03/2014 10:52:25 10/03/2014 10:52:34 0:00:09
Refresh 1 10/03/2014 11:01:44 10/03/2014 11:02:29 0:00:45
Refresh 2 10/03/2014 10:52:35 10/03/2014 10:54:50 0:02:15
Refresh 3 10/03/2014 10:54:50 10/03/2014 10:57:02 0:02:12
Refresh 4 10/03/2014 10:57:05 10/03/2014 10:58:47 0:01:42
Refresh 5 10/03/2014 10:58:47 10/03/2014 10:59:46 0:00:59
Refresh 6 10/03/2014 10:59:45 10/03/2014 11:00:38 0:00:53
Refresh 7 10/03/2014 11:00:39 10/03/2014 11:01:44 0:01:05

Numerical Quantities Summary Timing Intervals in Seconds:

Query Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Stream 0 34.105419 1.439089 9.802183 2.033956 10.525742 3.356152 23.953729 36.199533
Stream 1 26.598252 150.572833 41.930330 86.870320 50.604856 201.001372 61.638366 244.013359
Stream 2 50.129895 102.219282 12.380935 102.319615 62.577229 43.454392 891.076608 407.640626
Stream 3 269.947278 53.172724 54.649973 11.460062 66.695722 17.336698 63.371232 91.158050
Stream 4 41.149221 22.520836 28.707973 509.984321 68.916549 17.525025 702.191490 666.450230
Stream 5 59.179045 30.734442 99.504351 11.145990 101.334340 21.660836 74.625589 535.160207
Stream 6 225.105215 55.567328 46.749707 554.474507 215.657091 54.362551 72.960653 442.194302
Stream 7 220.993226 28.528230 47.543365 336.191006 308.931194 9.767397 850.258452 66.121298
Min Qi 26.598252 22.520836 12.380935 11.145990 50.604856 9.767397 61.638366 66.121298
Max Qi 269.947278 150.572833 99.504351 554.474507 308.931194 201.001372 891.076608 666.450230
Avg Qi 127.586019 63.330811 47.352376 230.349403 124.959569 52.158324 388.017484 350.391153
Query Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Stream 0 50.439615 9.287196 15.892947 7.112715 250.527755 131.478131 54.458992 10.525842
Stream 1 420.919329 317.402771 101.818338 403.213385 724.539887 160.669174 65.374584 28.563034
Stream 2 464.378760 210.938167 23.395678 545.086468 736.005716 54.680686 398.880053 34.018918
Stream 3 350.083270 321.781561 48.652019 435.954962 378.872739 100.588804 289.350342 190.140640
Stream 4 306.265994 249.621982 79.280220 221.255121 348.932746 49.555802 100.062439 61.368814
Stream 5 511.923087 133.018420 134.199065 9.655693 662.658830 104.380635 82.847242 59.952271
Stream 6 578.362701 61.221715 145.613349 47.957006 621.993889 256.150595 77.124777 91.163005
Stream 7 418.450091 391.818564 29.360218 17.236628 761.850888 31.952329 50.393082 27.530882
Min Qi 306.265994 61.221715 23.395678 9.655693 348.932746 31.952329 50.393082 27.530882
Max Qi 578.362701 391.818564 145.613349 545.086468 761.850888 256.150595 398.880053 190.140640
Avg Qi 435.769033 240.829026 80.331270 240.051323 604.979242 108.282575 152.004646 70.391081
Query Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2
Stream 0 22.444111 37.978532 13.347320 26.553364 115.511143 7.670304 22.771613 8.761026
Stream 1 329.153807 19.198590 258.455295 556.256015 99.647793 14.878746 32.803289 8.771923
Stream 2 76.940373 74.916489 75.246897 16.035355 14.403643 32.348500 91.981362 41.426540
Stream 3 88.918404 238.858707 221.257060 688.441713 247.669761 5.345632 70.780594 49.352955
Stream 4 497.105081 167.874781 67.668514 76.820831 78.585717 3.655421 73.165786 29.401670
Stream 5 309.991618 123.023557 380.801141 347.055909 93.478502 18.351491 33.338814 12.557542
Stream 6 57.200926 154.489850 386.007137 103.558355 32.676369 92.863316 35.576966 14.061801
Stream 7 160.332088 46.934177 340.957970 84.479720 78.985110 60.568796 44.362737 8.831746
Min Qi 57.200926 19.198590 67.668514 16.035355 14.403643 3.655421 32.803289 8.771923
Max Qi 497.105081 238.858707 386.007137 688.441713 247.669761 92.863316 91.981362 49.352955
Avg Qi 217.091757 117.899450 247.199145 267.521128 92.206699 32.573129 54.572793 23.486311

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
10/06/2014 13:53 GMT
In Hoc Signo Vinces (part 19 of n): Scalability, 1000G, and 3000G [ Virtuso Data Space Bot ]

Scalability, specifically linear scalability, means that twice the data takes twice as long to process, or that double the gear processes the same data in half the time. This is only literally true for "embarrassingly parallel" workloads.

There are parts of TPC-H which have an embarrassingly parallel nature, like Q1 and Q7. There are parts that are almost as easy, like Q14, Q17, Q19, and Q21, where there is a big scan and a selective hash join with a hash table small enough to replicate everywhere. The scan scales linearly; building the hash does not, since it is done at single-server speed (once in each process). Some queries like Q9 and Q13 end up doing a big cross-partition join which runs into communication overheads.

This is our first look at how performance behaves with bigger data and a larger platform. The results shown here are interesting but are not final. I bet I can do better; by how much is what we'll find out soon enough.

We will here compare a 1000G setup on my desktop, and a 3000G setup at the CWI's Scilens cluster. The former is 2 boxes of dual Xeon E5 2630, and the latter is 8 boxes of dual Xeon E5 2650v2. All things run from memory and both have QDR IB interconnect. Counting cores and clock, the CWI cluster is 6x larger.

As a rough approximation, for the worst queries, 6x the gear runs 3x the data in the same amount of real time. The 1000G setup has near full platform utilization and the 3000G setup has about half platform utilization. In both cases, running two instances of the same query at the same time takes twice as long.

We use Q9 for this study. The plan makes a hash table of part with 1/14 of all parts, replicating to all processes. Then there is a hash table of partsupp with a key of ps_partkey, ps_suppkey, and a dependent of ps_supplycost. This is much larger than the part hash table and is therefore partitioned on ps_partkey. The build is for 1/14th of partsupp. Then there is a scan of lineitem filtered by the part hash table; then a cross-partition join to the partsupp hash table; then a cross partition join to orders, this time by index; then a hash join on a replicated hash table of supplier; then nation; then aggregation. The aggregation is done in each slice; then the slices are added up at the end.

The plan could be made better by one fewer partition crossing. Now there is a crossing from l_orderkey to l_partkey and back to o_orderkey. This would not be so if the cost model knew that the partsupp always hits. The cost model thinks it hits 1/14 of the time, because it does not know that the selection on the build is exactly the same as on the probe.

For the present purposes, the extra crossing just serves to make the matter of interest more visible.

So, for the 1000G setup, we have 43.6 seconds (s) and

Cluster 4 nodes, 44 s. 459 m/s 119788 KB/s  3120% cpu 0%  read 19% clw threads 1r 0w 0i buffers 17622126 68 d 0 w 0 pfs
 

For the 3000G setup, we have 49.9 s and

Cluster 16 nodes, 50 s. 49389 m/s 1801815 KB/s  7283% cpu 0%  read 18% clw threads 1r 0w 0i buffers 135122893 15895255 d 0 w 17 pfs
 

The platform utilization on the small system is better, at 31/48 (running/total threads); the large one has 73/256.

The large case is clearly network bound. If this were for CPU only, it should be done in half the time it takes the small system to do 1000G.

We confirm this by looking at write wait: 3940 seconds of thread time blocked on write over 50s of real time. The figures on the small one are 3.9s of thread time blocked for 39s of real time. The data transfer on the large one is 93 GB.

How to block less? One idea would be to write less. So we try compression; there is a Google snappy-based message compression option in Virtuoso.

We now get 39.6 s and

Cluster 16 nodes, 40 s. 65161 m/s 1239922 KB/s  10201% cpu 0%  read 21% clw threads 1r 0w 0i buffers 52828440 172 d 0 w 0 pfs
 

The write block time is 397 s of thread time over 39 s of real time, 10x better. The data transfer is 50.9 GB after compression. Snappy is somewhat effective for compression and very fast; in CPU profile, it is under 3% of Q9 on the small system. Gains on the small system are less, though, since blocking is not a big issue to start with.

This is still not full platform. But if the data transfer is further cut in half by a better plan, the situation will be quite good. Now we have 102/256 threads running, meaning that there could be another 40-50% of throughput to be added. The last 128 threads are second threads of a core, so count for roughly 30% of a real core.

The main cluster-specific operation is a send from one to many. This is now done by formulating the message to each recipient in a chain of string buffers; then, after all the messages are prepared, these are optionally compressed and sent to their recipient. This is needlessly simple: Compressing can proceed if ever there is a would-block situation on writing. If all the compression is done, then a blocked write should switch to another recipient, and only after all recipients have a would-block situation, then the thread can call-select with all descriptors and block on them collectively. There is a piece of code to this effect, but is not now being used. It has been seen to add no value in small cases, but could be useful here.

The IB fabric has been seen to do 1.8 GB/s bidirectionally on multiple independent point-to-point TCP links. This is about half the nominal 4 GB/s (40 Gbit/s with 10/8 encoding). So the aggregate throughputs that we see here are nowhere near the nominal spec of the network. Lower level interfaces and the occasional busy wait on the reading end could be tried to some advantage. We have not tried 10GbE either; but if that works at nominal speed, then 10GbE should also be good enough. We will try this at Amazon in due time.

In the meantime, there is a 3000G test made at the CWI cluster without message compression. The score is about 4x that of the single server at 300G using the same hardware. The run is with approximately half platform utilization. There are three runs of power plus throughput, the first run being cold.

Run Power Throughput Composite
Run 1 305,881.5 1,072,411.9 572,739.8
Run 2 1,292,085.1 1,179,391.6 1,234,453.1
Run 3 1,178,534.1 1,092,936.2 1,134,928.4

The numerical quantities summaries follow. One problem of the run is a high peak of query memory consumption leading to slowdown. Some parts should probably be done in multiple passes to keep the peak lower and not run into swapping. The details will have to be sorted out. This is a demonstration of capability; the perfected accomplishment is to follow.

3000G Run 1

Virt-H Executive Summary

Report Date September 29, 2014
Database Scale Factor 3000
Query Streams for
Throughput Test
8
Virt-H Power 305,881.5
Virt-H Throughput 1,072,411.9
Virt-H Composite
Query-per-Hour Metric
(Qph@100GB)
572,739.8
Measurement Interval in
Throughput Test (Ts)
1,772.554000 seconds

Duration of stream execution

Start Date/Time End Date/Time Duration
Stream 0 09/29/2014 12:54:52 09/29/2014 13:31:17 0:36:25
Stream 1 09/29/2014 13:31:24 09/29/2014 13:59:24 0:28:00
Stream 2 09/29/2014 13:31:24 09/29/2014 13:58:59 0:27:35
Stream 3 09/29/2014 13:31:24 09/29/2014 13:58:29 0:27:05
Stream 4 09/29/2014 13:31:24 09/29/2014 13:58:52 0:27:28
Stream 5 09/29/2014 13:31:24 09/29/2014 14:00:06 0:28:42
Stream 6 09/29/2014 13:31:24 09/29/2014 13:58:18 0:26:54
Stream 7 09/29/2014 13:31:24 09/29/2014 13:59:25 0:28:01
Stream 8 09/29/2014 13:31:24 09/29/2014 13:58:50 0:27:26
Refresh 0 09/29/2014 12:54:52 09/29/2014 12:56:59 0:02:07
09/29/2014 13:31:17 09/29/2014 13:31:23 0:00:06
Refresh 1 09/29/2014 14:00:38 09/29/2014 14:01:11 0:00:33
Refresh 2 09/29/2014 13:31:25 09/29/2014 13:36:57 0:05:32
Refresh 3 09/29/2014 13:36:56 09/29/2014 13:47:02 0:10:06
Refresh 4 09/29/2014 13:47:03 09/29/2014 13:51:40 0:04:37
Refresh 5 09/29/2014 13:51:42 09/29/2014 13:56:40 0:04:58
Refresh 6 09/29/2014 13:56:40 09/29/2014 13:59:25 0:02:45
Refresh 7 09/29/2014 13:59:25 09/29/2014 14:00:10 0:00:45
Refresh 8 09/29/2014 14:00:11 09/29/2014 14:00:37 0:00:26

Numerical Quantities Summary Timing Intervals in Seconds:

Query Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Stream 0 601.576975 90.803782 108.725110 177.112667 171.995572 2.098138 15.768311 152.511444
Stream 1 13.310341 32.722946 125.551415 1.912836 46.041675 13.294214 85.345068 165.424288
Stream 2 19.425885 9.248670 150.855556 7.085737 88.445566 10.490432 49.318554 322.500839
Stream 3 30.534391 14.273478 100.987791 59.341763 46.442443 9.613795 64.186196 146.324186
Stream 4 28.211213 37.134522 64.189335 10.931513 100.610673 9.929866 112.270530 108.489951
Stream 5 29.226411 18.132589 95.245160 63.100068 115.663908 6.151231 46.251309 127.742471
Stream 6 30.750930 20.888658 108.894177 55.168565 82.016828 69.451493 65.161517 103.697733
Stream 7 13.462570 18.033847 32.065492 78.910373 202.998301 10.688279 47.167022 139.601948
Stream 8 24.354314 16.711503 112.008551 8.307098 126.849630 7.127605 51.083118 98.648077
Min Qi 13.310341 9.248670 32.065492 1.912836 46.041675 6.151231 46.251309 98.648077
Max Qi 30.750930 37.134522 150.855556 78.910373 202.998301 69.451493 112.270530 322.500839
Avg Qi 23.659507 20.893277 98.724685 35.594744 101.133628 17.093364 65.097914 151.553687
Query Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Stream 0 92.991259 5.175922 42.238393 29.239879 367.805534 3.604910 15.557396 11.650267
Stream 1 149.502128 30.197806 50.786184 217.190836 283.545905 11.653171 73.321150 116.860455
Stream 2 245.783668 22.278841 50.578731 36.301810 181.405269 32.236754 57.631764 61.540533
Stream 3 377.782738 24.129319 84.097657 10.959661 171.698669 8.973519 54.532180 45.527142
Stream 4 341.148908 74.358770 85.782399 43.116347 151.146233 22.870727 74.439693 51.871535
Stream 5 72.259919 11.424035 79.310504 9.833135 562.871920 14.961209 127.861874 55.377721
Stream 6 373.301225 41.379753 81.983260 9.373200 95.039317 19.071346 76.159452 48.324504
Stream 7 449.871952 16.099152 48.047940 8.559784 211.094730 10.569071 26.710228 72.571454
Stream 8 395.771006 33.537585 54.850876 141.526389 153.763316 12.997092 127.961975 57.100346
Min Qi 72.259919 11.424035 48.047940 8.559784 95.039317 8.973519 26.710228 45.527142
Max Qi 449.871952 74.358770 85.782399 217.190836 562.871920 32.236754 127.961975 116.860455
Avg Qi 300.677693 31.675658 66.929694 59.607645 226.320670 16.666611 77.327289 63.646711
Query Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2
Stream 0 12.230334 70.991261 33.092797 17.517230 15.798438 19.743562 127.494687 5.893471
Stream 1 27.550293 14.970857 16.442806 111.138612 68.214095 7.884782 27.109441 6.087067
Stream 2 43.277918 12.748690 22.681844 92.835566 84.416610 14.661934 151.094498 153.285076
Stream 3 129.696125 13.435663 14.674499 129.179966 39.176513 6.286296 181.596838 416.052710
Stream 4 110.348816 7.080225 21.051910 85.758973 65.130356 7.292999 123.386514 151.000786
Stream 5 43.365006 9.847612 32.881770 94.752284 67.788314 9.035439 72.539334 223.967821
Stream 6 34.534280 36.347298 27.849276 122.736244 51.447492 25.051058 80.452175 84.519426
Stream 7 48.021860 30.594474 22.522426 99.245893 73.076698 7.260729 38.585852 5.697277
Stream 8 29.484201 12.368769 40.344043 84.137820 30.813313 4.856991 22.196547 4.600057
Min Qi 27.550293 7.080225 14.674499 84.137820 30.813313 4.856991 22.196547 4.600057
Max Qi 129.696125 36.347298 40.344043 129.179966 84.416610 25.051058 181.596838 416.052710
Avg Qi 58.284812 17.174198 24.806072 102.473170 60.007924 10.291279 87.120150 130.651277

3000G Run 2

Virt-H Executive Summary

Report Date September 29, 2014
Database Scale Factor 3000
Query Streams for
Throughput Test
8
Virt-H Power 1292085.1
Virt-H Throughput 1179391.6
Virt-H Composite
Query-per-Hour Metric
(Qph@100GB)
1234453.1
Measurement Interval in
Throughput Test (Ts)
1611.779000 seconds

Duration of stream execution

Start Date/Time End Date/Time Duration
Stream 0 09/29/2014 14:01:15 09/29/2014 14:06:48 0:05:33
Stream 1 09/29/2014 14:06:53 09/29/2014 14:30:22 0:23:29
Stream 2 09/29/2014 14:06:53 09/29/2014 14:32:30 0:25:37
Stream 3 09/29/2014 14:06:53 09/29/2014 14:31:23 0:24:30
Stream 4 09/29/2014 14:06:53 09/29/2014 14:31:34 0:24:41
Stream 5 09/29/2014 14:06:53 09/29/2014 14:32:53 0:26:00
Stream 6 09/29/2014 14:06:53 09/29/2014 14:29:51 0:22:58
Stream 7 09/29/2014 14:06:53 09/29/2014 14:31:34 0:24:41
Stream 8 09/29/2014 14:06:53 09/29/2014 14:30:35 0:23:42
Refresh 0 09/29/2014 14:01:15 09/29/2014 14:01:35 0:00:20
09/29/2014 14:06:49 09/29/2014 14:06:53 0:00:04
Refresh 1 09/29/2014 14:33:16 09/29/2014 14:33:45 0:00:29
Refresh 2 09/29/2014 14:06:55 09/29/2014 14:12:28 0:05:33
Refresh 3 09/29/2014 14:12:29 09/29/2014 14:21:55 0:09:26
Refresh 4 09/29/2014 14:21:55 09/29/2014 14:27:40 0:05:45
Refresh 5 09/29/2014 14:27:43 09/29/2014 14:31:14 0:03:31
Refresh 6 09/29/2014 14:31:14 09/29/2014 14:31:51 0:00:37
Refresh 7 09/29/2014 14:31:51 09/29/2014 14:32:52 0:01:01
Refresh 8 09/29/2014 14:32:52 09/29/2014 14:33:16 0:00:24

Numerical Quantities Summary Timing Intervals in Seconds:

Query Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Stream 0 9.451169 3.644118 18.419151 1.404395 15.740525 2.085038 15.171847 25.400834
Stream 1 19.558041 6.607300 85.774410 4.503525 81.448472 11.976129 92.140470 145.743853
Stream 2 31.042019 7.877299 71.958033 8.862111 142.452144 18.489193 81.003310 85.856529
Stream 3 38.833612 12.440326 86.063103 7.165120 84.707025 16.931531 100.442710 122.411252
Stream 4 15.751913 33.026762 50.457193 7.064220 114.130257 5.992556 66.035959 84.596973
Stream 5 18.462884 28.047942 110.690543 16.566547 104.403789 5.303453 72.552640 402.383383
Stream 6 17.858339 33.988800 110.431091 7.238431 72.229953 16.850955 68.231546 180.601000
Stream 7 23.055572 17.044813 96.105520 8.941132 171.130879 8.423100 70.634541 147.261648
Stream 8 19.840798 13.860740 74.961175 16.171566 56.165875 5.904921 47.646217 125.991819
Min Qi 15.751913 6.607300 50.457193 4.503525 56.165875 5.303453 47.646217 84.596973
Max Qi 38.833612 33.988800 110.690543 16.566547 171.130879 18.489193 100.442710 402.383383
Avg Qi 23.050397 19.111748 85.805134 9.564082 103.333549 11.233980 74.835924 161.855807
Query Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Stream 0 54.766945 5.551163 29.216632 3.035008 52.816902 3.346243 15.767022 10.066112
Stream 1 130.666380 9.658277 49.332720 103.036705 194.520370 12.166344 65.144599 97.158571
Stream 2 254.754936 22.605298 38.102466 21.121168 300.467330 12.262318 108.203491 50.696657
Stream 3 283.761567 19.327164 73.414574 7.431651 183.121904 12.573854 73.814766 46.802493
Stream 4 290.341947 57.452026 58.354221 13.066162 189.263163 18.998781 121.269774 54.831406
Stream 5 81.787025 8.410538 79.822552 16.005077 190.730342 21.697136 100.456487 46.744884
Stream 6 202.558515 39.360009 74.519981 15.960756 137.321631 26.583824 57.537668 60.758997
Stream 7 226.790801 44.175536 73.992368 7.561897 182.853851 17.597471 31.128055 44.389893
Stream 8 275.423934 21.980040 60.538239 39.736622 173.574795 58.786316 95.124912 25.564108
Min Qi 81.787025 8.410538 38.102466 7.431651 137.321631 12.166344 31.128055 25.564108
Max Qi 290.341947 57.452026 79.822552 103.036705 300.467330 58.786316 121.269774 97.158571
Avg Qi 218.260638 27.871111 63.509640 27.990005 193.981673 22.583255 81.584969 53.368376
Query Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2
Stream 0 13.620157 2.288504 4.166807 16.468447 9.991810 1.101775 20.152227 4.294680
Stream 1 44.026143 31.720525 25.684461 134.254716 30.797008 9.568594 24.328205 4.319533
Stream 2 40.283148 9.970277 29.731019 133.083785 29.322194 8.859556 73.251098 249.850045
Stream 3 44.288244 18.914661 38.162762 144.458624 22.556235 6.184842 117.267234 445.700238
Stream 4 67.147744 6.649451 27.876825 59.226248 69.373248 44.478703 61.381724 282.608075
Stream 5 36.403227 12.226129 21.997683 95.912670 44.219799 21.117974 106.473817 97.896971
Stream 6 42.114038 30.805969 25.929027 51.658733 26.475662 34.816500 31.309953 5.608395
Stream 7 48.601889 18.708127 18.893532 132.558026 50.476383 12.309402 22.661371 37.610815
Stream 8 34.413417 34.709883 37.058335 121.710608 44.676485 9.449332 19.311945 4.420232
Min Qi 34.413417 6.649451 18.893532 51.658733 22.556235 6.184842 19.311945 4.319533
Max Qi 67.147744 34.709883 38.162762 144.458624 69.373248 44.478703 117.267234 445.700238
Avg Qi 44.659731 20.463128 28.166705 109.107926 39.737127 18.348113 56.998168 141.001788

3000G Run 3

Virt-H Executive Summary

Report Date September 29, 2014
Database Scale Factor 3000
Query Streams for
Throughput Test
8
Virt-H Power 1178534.1
Virt-H Throughput 1092936.2
Virt-H Composite
Query-per-Hour Metric
(Qph@100GB)
1134928.4
Measurement Interval in
Throughput Test (Ts)
1739.269000 seconds

Duration of stream execution

Start Date/Time End Date/Time Duration
Stream 0 09/29/2014 14:33:48 09/29/2014 14:40:59 0:07:11
Stream 1 09/29/2014 14:41:04 09/29/2014 15:10:02 0:28:58
Stream 2 09/29/2014 14:41:04 09/29/2014 15:09:07 0:28:03
Stream 3 09/29/2014 14:41:04 09/29/2014 15:09:17 0:28:13
Stream 4 09/29/2014 14:41:04 09/29/2014 15:09:55 0:28:51
Stream 5 09/29/2014 14:41:04 09/29/2014 15:09:39 0:28:35
Stream 6 09/29/2014 14:41:04 09/29/2014 15:09:46 0:28:42
Stream 7 09/29/2014 14:41:04 09/29/2014 15:09:58 0:28:54
Stream 8 09/29/2014 14:41:04 09/29/2014 15:08:58 0:27:54
Refresh 0 09/29/2014 14:33:48 09/29/2014 14:34:07 0:00:19
09/29/2014 14:40:59 09/29/2014 14:41:04 0:00:05
Refresh 1 09/29/2014 15:06:57 09/29/2014 15:09:49 0:02:52
Refresh 2 09/29/2014 14:41:05 09/29/2014 14:47:39 0:06:34
Refresh 3 09/29/2014 14:47:40 09/29/2014 14:56:46 0:09:06
Refresh 4 09/29/2014 14:56:49 09/29/2014 15:03:19 0:06:30
Refresh 5 09/29/2014 15:03:24 09/29/2014 15:06:45 0:03:21
Refresh 6 09/29/2014 15:06:46 09/29/2014 15:06:49 0:00:03
Refresh 7 09/29/2014 15:06:50 09/29/2014 15:06:53 0:00:03
Refresh 8 09/29/2014 15:06:53 09/29/2014 15:10:04 0:03:11

Numerical Quantities Summary Timing Intervals in Seconds:

Query Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Stream 0 9.393632 5.001910 17.053567 1.427500 17.813839 2.230451 13.884490 25.610995
Stream 1 12.971454 9.383520 94.257760 1.603106 127.940946 20.791892 78.869819 138.521273
Stream 2 21.428177 31.431513 96.366083 5.611843 58.394596 11.279502 47.114473 407.135077
Stream 3 23.377920 37.474814 83.640621 9.152178 71.186158 11.001543 46.763758 110.015662
Stream 4 49.580860 31.979940 87.662950 8.983661 68.052295 14.367631 59.266063 301.788652
Stream 5 13.483836 20.203772 391.980128 12.505446 77.966993 10.487869 52.989448 226.837637
Stream 6 38.104903 21.271630 84.689348 8.626460 86.620802 11.981171 69.182098 111.810485
Stream 7 20.243617 12.298692 99.547203 6.020951 151.584400 17.528287 62.037348 101.023802
Stream 8 22.808294 17.583072 59.180595 5.618565 123.108771 11.477376 42.485363 92.035709
Min Qi 12.971454 9.383520 59.180595 1.603106 58.394596 10.487869 42.485363 92.035709
Max Qi 49.580860 37.474814 391.980128 12.505446 151.584400 20.791892 78.869819 407.135077
Avg Qi 25.249883 22.703369 124.665586 7.265276 95.606870 13.614409 57.338546 186.146037
Query Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Stream 0 146.487681 6.798942 29.834475 3.177879 55.067866 4.503738 17.215591 9.333281
Stream 1 177.581204 44.178095 69.746005 12.306166 215.602727 30.443709 64.276384 45.266949
Stream 2 211.311651 27.403143 61.412478 12.173058 216.879170 18.272234 96.753886 35.587072
Stream 3 482.581456 68.663026 60.354163 13.408513 187.921639 17.469237 62.337222 31.706120
Stream 4 178.297373 23.711312 67.129677 15.216904 328.149575 20.258853 78.891201 84.852368
Stream 5 209.496498 28.346366 55.584081 9.644075 131.622351 24.171156 80.046801 43.625932
Stream 6 521.691639 24.126176 72.964805 15.311409 146.152570 34.748843 71.957130 58.470644
Stream 7 580.320149 17.054563 56.172396 7.530832 200.100326 12.444021 25.910599 75.653693
Stream 8 472.231674 15.064398 89.875570 42.394675 166.589234 12.831209 81.697881 73.821769
Min Qi 177.581204 15.064398 55.584081 7.530832 131.622351 12.444021 25.910599 31.706120
Max Qi 580.320149 68.663026 89.875570 42.394675 328.149575 34.748843 96.753886 84.852368
Avg Qi 354.188955 31.068385 66.654897 15.998204 199.127199 21.329908 70.233888 56.123068
Query Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2
Stream 0 12.252670 2.593733 4.115862 16.895672 10.183350 1.240096 18.679685 4.876067
Stream 1 356.740980 21.197870 30.422216 81.779038 65.468650 3.947503 63.933750 107.563796
Stream 2 54.087768 10.152604 34.940701 113.510640 70.908809 12.316233 109.091578 283.076004
Stream 3 52.807104 18.525982 13.740089 212.364908 16.413964 17.998809 58.653503 483.718271
Stream 4 42.389062 36.157809 28.909260 86.427025 21.605419 7.608729 54.910853 331.074114
Stream 5 48.214794 15.778893 20.681799 130.560005 43.846752 33.905533 54.536966 139.563667
Stream 6 84.061840 26.224851 16.546432 117.265210 34.766856 39.037423 0.710642 1.645351
Stream 7 63.034890 15.966686 31.666488 112.689765 28.661943 12.828171 1.274731 1.780452
Stream 8 43.879104 8.596666 32.585746 177.928730 26.763334 6.112333 1.187693 0.533668
Min Qi 42.389062 8.596666 13.740089 81.779038 16.413964 3.947503 0.710642 0.533668
Max Qi 356.740980 36.157809 34.940701 212.364908 70.908809 39.037423 109.091578 483.718271
Avg Qi 93.151943 19.075170 26.186591 129.065665 38.554466 16.719342 43.037465 168.619415

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
09/30/2014 16:33 GMT Modified: 10/06/2014 13:57 GMT
In Hoc Signo Vinces (part 19 of n): Scalability, 1000G, and 3000G [ Orri Erling ]

Scalability, specifically linear scalability, means that twice the data takes twice as long to process, or that double the gear processes the same data in half the time. This is only literally true for "embarrassingly parallel" workloads.

There are parts of TPC-H which have an embarrassingly parallel nature, like Q1 and Q7. There are parts that are almost as easy, like Q14, Q17, Q19, and Q21, where there is a big scan and a selective hash join with a hash table small enough to replicate everywhere. The scan scales linearly; building the hash does not, since it is done at single-server speed (once in each process). Some queries like Q9 and Q13 end up doing a big cross-partition join which runs into communication overheads.

This is our first look at how performance behaves with bigger data and a larger platform. The results shown here are interesting but are not final. I bet I can do better; by how much is what we'll find out soon enough.

We will here compare a 1000G setup on my desktop, and a 3000G setup at the CWI's Scilens cluster. The former is 2 boxes of dual Xeon E5 2630, and the latter is 8 boxes of dual Xeon E5 2650v2. All things run from memory and both have QDR IB interconnect. Counting cores and clock, the CWI cluster is 6x larger.

As a rough approximation, for the worst queries, 6x the gear runs 3x the data in the same amount of real time. The 1000G setup has near full platform utilization and the 3000G setup has about half platform utilization. In both cases, running two instances of the same query at the same time takes twice as long.

We use Q9 for this study. The plan makes a hash table of part with 1/14 of all parts, replicating to all processes. Then there is a hash table of partsupp with a key of ps_partkey, ps_suppkey, and a dependent of ps_supplycost. This is much larger than the part hash table and is therefore partitioned on ps_partkey. The build is for 1/14th of partsupp. Then there is a scan of lineitem filtered by the part hash table; then a cross-partition join to the partsupp hash table; then a cross partition join to orders, this time by index; then a hash join on a replicated hash table of supplier; then nation; then aggregation. The aggregation is done in each slice; then the slices are added up at the end.

The plan could be made better by one fewer partition crossing. Now there is a crossing from l_orderkey to l_partkey and back to o_orderkey. This would not be so if the cost model knew that the partsupp always hits. The cost model thinks it hits 1/14 of the time, because it does not know that the selection on the build is exactly the same as on the probe.

For the present purposes, the extra crossing just serves to make the matter of interest more visible.

So, for the 1000G setup, we have 43.6 seconds (s) and

Cluster 4 nodes, 44 s. 459 m/s 119788 KB/s  3120% cpu 0%  read 19% clw threads 1r 0w 0i buffers 17622126 68 d 0 w 0 pfs
 

For the 3000G setup, we have 49.9 s and

Cluster 16 nodes, 50 s. 49389 m/s 1801815 KB/s  7283% cpu 0%  read 18% clw threads 1r 0w 0i buffers 135122893 15895255 d 0 w 17 pfs
 

The platform utilization on the small system is better, at 31/48 (running/total threads); the large one has 73/256.

The large case is clearly network bound. If this were for CPU only, it should be done in half the time it takes the small system to do 1000G.

We confirm this by looking at write wait: 3940 seconds of thread time blocked on write over 50s of real time. The figures on the small one are 3.9s of thread time blocked for 39s of real time. The data transfer on the large one is 93 GB.

How to block less? One idea would be to write less. So we try compression; there is a Google snappy-based message compression option in Virtuoso.

We now get 39.6 s and

Cluster 16 nodes, 40 s. 65161 m/s 1239922 KB/s  10201% cpu 0%  read 21% clw threads 1r 0w 0i buffers 52828440 172 d 0 w 0 pfs
 

The write block time is 397 s of thread time over 39 s of real time, 10x better. The data transfer is 50.9 GB after compression. Snappy is somewhat effective for compression and very fast; in CPU profile, it is under 3% of Q9 on the small system. Gains on the small system are less, though, since blocking is not a big issue to start with.

This is still not full platform. But if the data transfer is further cut in half by a better plan, the situation will be quite good. Now we have 102/256 threads running, meaning that there could be another 40-50% of throughput to be added. The last 128 threads are second threads of a core, so count for roughly 30% of a real core.

The main cluster-specific operation is a send from one to many. This is now done by formulating the message to each recipient in a chain of string buffers; then, after all the messages are prepared, these are optionally compressed and sent to their recipient. This is needlessly simple: Compressing can proceed if ever there is a would-block situation on writing. If all the compression is done, then a blocked write should switch to another recipient, and only after all recipients have a would-block situation, then the thread can call-select with all descriptors and block on them collectively. There is a piece of code to this effect, but is not now being used. It has been seen to add no value in small cases, but could be useful here.

The IB fabric has been seen to do 1.8 GB/s bidirectionally on multiple independent point-to-point TCP links. This is about half the nominal 4 GB/s (40 Gbit/s with 10/8 encoding). So the aggregate throughputs that we see here are nowhere near the nominal spec of the network. Lower level interfaces and the occasional busy wait on the reading end could be tried to some advantage. We have not tried 10GbE either; but if that works at nominal speed, then 10GbE should also be good enough. We will try this at Amazon in due time.

In the meantime, there is a 3000G test made at the CWI cluster without message compression. The score is about 4x that of the single server at 300G using the same hardware. The run is with approximately half platform utilization. There are three runs of power plus throughput, the first run being cold.

Run Power Throughput Composite
Run 1 305,881.5 1,072,411.9 572,739.8
Run 2 1,292,085.1 1,179,391.6 1,234,453.1
Run 3 1,178,534.1 1,092,936.2 1,134,928.4

The numerical quantities summaries follow. One problem of the run is a high peak of query memory consumption leading to slowdown. Some parts should probably be done in multiple passes to keep the peak lower and not run into swapping. The details will have to be sorted out. This is a demonstration of capability; the perfected accomplishment is to follow.

3000G Run 1

Virt-H Executive Summary

Report Date September 29, 2014
Database Scale Factor 3000
Query Streams for
Throughput Test
8
Virt-H Power 305,881.5
Virt-H Throughput 1,072,411.9
Virt-H Composite
Query-per-Hour Metric
(Qph@100GB)
572,739.8
Measurement Interval in
Throughput Test (Ts)
1,772.554000 seconds

Duration of stream execution

Start Date/Time End Date/Time Duration
Stream 0 09/29/2014 12:54:52 09/29/2014 13:31:17 0:36:25
Stream 1 09/29/2014 13:31:24 09/29/2014 13:59:24 0:28:00
Stream 2 09/29/2014 13:31:24 09/29/2014 13:58:59 0:27:35
Stream 3 09/29/2014 13:31:24 09/29/2014 13:58:29 0:27:05
Stream 4 09/29/2014 13:31:24 09/29/2014 13:58:52 0:27:28
Stream 5 09/29/2014 13:31:24 09/29/2014 14:00:06 0:28:42
Stream 6 09/29/2014 13:31:24 09/29/2014 13:58:18 0:26:54
Stream 7 09/29/2014 13:31:24 09/29/2014 13:59:25 0:28:01
Stream 8 09/29/2014 13:31:24 09/29/2014 13:58:50 0:27:26
Refresh 0 09/29/2014 12:54:52 09/29/2014 12:56:59 0:02:07
09/29/2014 13:31:17 09/29/2014 13:31:23 0:00:06
Refresh 1 09/29/2014 14:00:38 09/29/2014 14:01:11 0:00:33
Refresh 2 09/29/2014 13:31:25 09/29/2014 13:36:57 0:05:32
Refresh 3 09/29/2014 13:36:56 09/29/2014 13:47:02 0:10:06
Refresh 4 09/29/2014 13:47:03 09/29/2014 13:51:40 0:04:37
Refresh 5 09/29/2014 13:51:42 09/29/2014 13:56:40 0:04:58
Refresh 6 09/29/2014 13:56:40 09/29/2014 13:59:25 0:02:45
Refresh 7 09/29/2014 13:59:25 09/29/2014 14:00:10 0:00:45
Refresh 8 09/29/2014 14:00:11 09/29/2014 14:00:37 0:00:26

Numerical Quantities Summary Timing Intervals in Seconds:

Query Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Stream 0 601.576975 90.803782 108.725110 177.112667 171.995572 2.098138 15.768311 152.511444
Stream 1 13.310341 32.722946 125.551415 1.912836 46.041675 13.294214 85.345068 165.424288
Stream 2 19.425885 9.248670 150.855556 7.085737 88.445566 10.490432 49.318554 322.500839
Stream 3 30.534391 14.273478 100.987791 59.341763 46.442443 9.613795 64.186196 146.324186
Stream 4 28.211213 37.134522 64.189335 10.931513 100.610673 9.929866 112.270530 108.489951
Stream 5 29.226411 18.132589 95.245160 63.100068 115.663908 6.151231 46.251309 127.742471
Stream 6 30.750930 20.888658 108.894177 55.168565 82.016828 69.451493 65.161517 103.697733
Stream 7 13.462570 18.033847 32.065492 78.910373 202.998301 10.688279 47.167022 139.601948
Stream 8 24.354314 16.711503 112.008551 8.307098 126.849630 7.127605 51.083118 98.648077
Min Qi 13.310341 9.248670 32.065492 1.912836 46.041675 6.151231 46.251309 98.648077
Max Qi 30.750930 37.134522 150.855556 78.910373 202.998301 69.451493 112.270530 322.500839
Avg Qi 23.659507 20.893277 98.724685 35.594744 101.133628 17.093364 65.097914 151.553687
Query Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Stream 0 92.991259 5.175922 42.238393 29.239879 367.805534 3.604910 15.557396 11.650267
Stream 1 149.502128 30.197806 50.786184 217.190836 283.545905 11.653171 73.321150 116.860455
Stream 2 245.783668 22.278841 50.578731 36.301810 181.405269 32.236754 57.631764 61.540533
Stream 3 377.782738 24.129319 84.097657 10.959661 171.698669 8.973519 54.532180 45.527142
Stream 4 341.148908 74.358770 85.782399 43.116347 151.146233 22.870727 74.439693 51.871535
Stream 5 72.259919 11.424035 79.310504 9.833135 562.871920 14.961209 127.861874 55.377721
Stream 6 373.301225 41.379753 81.983260 9.373200 95.039317 19.071346 76.159452 48.324504
Stream 7 449.871952 16.099152 48.047940 8.559784 211.094730 10.569071 26.710228 72.571454
Stream 8 395.771006 33.537585 54.850876 141.526389 153.763316 12.997092 127.961975 57.100346
Min Qi 72.259919 11.424035 48.047940 8.559784 95.039317 8.973519 26.710228 45.527142
Max Qi 449.871952 74.358770 85.782399 217.190836 562.871920 32.236754 127.961975 116.860455
Avg Qi 300.677693 31.675658 66.929694 59.607645 226.320670 16.666611 77.327289 63.646711
Query Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2
Stream 0 12.230334 70.991261 33.092797 17.517230 15.798438 19.743562 127.494687 5.893471
Stream 1 27.550293 14.970857 16.442806 111.138612 68.214095 7.884782 27.109441 6.087067
Stream 2 43.277918 12.748690 22.681844 92.835566 84.416610 14.661934 151.094498 153.285076
Stream 3 129.696125 13.435663 14.674499 129.179966 39.176513 6.286296 181.596838 416.052710
Stream 4 110.348816 7.080225 21.051910 85.758973 65.130356 7.292999 123.386514 151.000786
Stream 5 43.365006 9.847612 32.881770 94.752284 67.788314 9.035439 72.539334 223.967821
Stream 6 34.534280 36.347298 27.849276 122.736244 51.447492 25.051058 80.452175 84.519426
Stream 7 48.021860 30.594474 22.522426 99.245893 73.076698 7.260729 38.585852 5.697277
Stream 8 29.484201 12.368769 40.344043 84.137820 30.813313 4.856991 22.196547 4.600057
Min Qi 27.550293 7.080225 14.674499 84.137820 30.813313 4.856991 22.196547 4.600057
Max Qi 129.696125 36.347298 40.344043 129.179966 84.416610 25.051058 181.596838 416.052710
Avg Qi 58.284812 17.174198 24.806072 102.473170 60.007924 10.291279 87.120150 130.651277

3000G Run 2

Virt-H Executive Summary

Report Date September 29, 2014
Database Scale Factor 3000
Query Streams for
Throughput Test
8
Virt-H Power 1292085.1
Virt-H Throughput 1179391.6
Virt-H Composite
Query-per-Hour Metric
(Qph@100GB)
1234453.1
Measurement Interval in
Throughput Test (Ts)
1611.779000 seconds

Duration of stream execution

Start Date/Time End Date/Time Duration
Stream 0 09/29/2014 14:01:15 09/29/2014 14:06:48 0:05:33
Stream 1 09/29/2014 14:06:53 09/29/2014 14:30:22 0:23:29
Stream 2 09/29/2014 14:06:53 09/29/2014 14:32:30 0:25:37
Stream 3 09/29/2014 14:06:53 09/29/2014 14:31:23 0:24:30
Stream 4 09/29/2014 14:06:53 09/29/2014 14:31:34 0:24:41
Stream 5 09/29/2014 14:06:53 09/29/2014 14:32:53 0:26:00
Stream 6 09/29/2014 14:06:53 09/29/2014 14:29:51 0:22:58
Stream 7 09/29/2014 14:06:53 09/29/2014 14:31:34 0:24:41
Stream 8 09/29/2014 14:06:53 09/29/2014 14:30:35 0:23:42
Refresh 0 09/29/2014 14:01:15 09/29/2014 14:01:35 0:00:20
09/29/2014 14:06:49 09/29/2014 14:06:53 0:00:04
Refresh 1 09/29/2014 14:33:16 09/29/2014 14:33:45 0:00:29
Refresh 2 09/29/2014 14:06:55 09/29/2014 14:12:28 0:05:33
Refresh 3 09/29/2014 14:12:29 09/29/2014 14:21:55 0:09:26
Refresh 4 09/29/2014 14:21:55 09/29/2014 14:27:40 0:05:45
Refresh 5 09/29/2014 14:27:43 09/29/2014 14:31:14 0:03:31
Refresh 6 09/29/2014 14:31:14 09/29/2014 14:31:51 0:00:37
Refresh 7 09/29/2014 14:31:51 09/29/2014 14:32:52 0:01:01
Refresh 8 09/29/2014 14:32:52 09/29/2014 14:33:16 0:00:24

Numerical Quantities Summary Timing Intervals in Seconds:

Query Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Stream 0 9.451169 3.644118 18.419151 1.404395 15.740525 2.085038 15.171847 25.400834
Stream 1 19.558041 6.607300 85.774410 4.503525 81.448472 11.976129 92.140470 145.743853
Stream 2 31.042019 7.877299 71.958033 8.862111 142.452144 18.489193 81.003310 85.856529
Stream 3 38.833612 12.440326 86.063103 7.165120 84.707025 16.931531 100.442710 122.411252
Stream 4 15.751913 33.026762 50.457193 7.064220 114.130257 5.992556 66.035959 84.596973
Stream 5 18.462884 28.047942 110.690543 16.566547 104.403789 5.303453 72.552640 402.383383
Stream 6 17.858339 33.988800 110.431091 7.238431 72.229953 16.850955 68.231546 180.601000
Stream 7 23.055572 17.044813 96.105520 8.941132 171.130879 8.423100 70.634541 147.261648
Stream 8 19.840798 13.860740 74.961175 16.171566 56.165875 5.904921 47.646217 125.991819
Min Qi 15.751913 6.607300 50.457193 4.503525 56.165875 5.303453 47.646217 84.596973
Max Qi 38.833612 33.988800 110.690543 16.566547 171.130879 18.489193 100.442710 402.383383
Avg Qi 23.050397 19.111748 85.805134 9.564082 103.333549 11.233980 74.835924 161.855807
Query Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Stream 0 54.766945 5.551163 29.216632 3.035008 52.816902 3.346243 15.767022 10.066112
Stream 1 130.666380 9.658277 49.332720 103.036705 194.520370 12.166344 65.144599 97.158571
Stream 2 254.754936 22.605298 38.102466 21.121168 300.467330 12.262318 108.203491 50.696657
Stream 3 283.761567 19.327164 73.414574 7.431651 183.121904 12.573854 73.814766 46.802493
Stream 4 290.341947 57.452026 58.354221 13.066162 189.263163 18.998781 121.269774 54.831406
Stream 5 81.787025 8.410538 79.822552 16.005077 190.730342 21.697136 100.456487 46.744884
Stream 6 202.558515 39.360009 74.519981 15.960756 137.321631 26.583824 57.537668 60.758997
Stream 7 226.790801 44.175536 73.992368 7.561897 182.853851 17.597471 31.128055 44.389893
Stream 8 275.423934 21.980040 60.538239 39.736622 173.574795 58.786316 95.124912 25.564108
Min Qi 81.787025 8.410538 38.102466 7.431651 137.321631 12.166344 31.128055 25.564108
Max Qi 290.341947 57.452026 79.822552 103.036705 300.467330 58.786316 121.269774 97.158571
Avg Qi 218.260638 27.871111 63.509640 27.990005 193.981673 22.583255 81.584969 53.368376
Query Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2
Stream 0 13.620157 2.288504 4.166807 16.468447 9.991810 1.101775 20.152227 4.294680
Stream 1 44.026143 31.720525 25.684461 134.254716 30.797008 9.568594 24.328205 4.319533
Stream 2 40.283148 9.970277 29.731019 133.083785 29.322194 8.859556 73.251098 249.850045
Stream 3 44.288244 18.914661 38.162762 144.458624 22.556235 6.184842 117.267234 445.700238
Stream 4 67.147744 6.649451 27.876825 59.226248 69.373248 44.478703 61.381724 282.608075
Stream 5 36.403227 12.226129 21.997683 95.912670 44.219799 21.117974 106.473817 97.896971
Stream 6 42.114038 30.805969 25.929027 51.658733 26.475662 34.816500 31.309953 5.608395
Stream 7 48.601889 18.708127 18.893532 132.558026 50.476383 12.309402 22.661371 37.610815
Stream 8 34.413417 34.709883 37.058335 121.710608 44.676485 9.449332 19.311945 4.420232
Min Qi 34.413417 6.649451 18.893532 51.658733 22.556235 6.184842 19.311945 4.319533
Max Qi 67.147744 34.709883 38.162762 144.458624 69.373248 44.478703 117.267234 445.700238
Avg Qi 44.659731 20.463128 28.166705 109.107926 39.737127 18.348113 56.998168 141.001788

3000G Run 3

Virt-H Executive Summary

Report Date September 29, 2014
Database Scale Factor 3000
Query Streams for
Throughput Test
8
Virt-H Power 1178534.1
Virt-H Throughput 1092936.2
Virt-H Composite
Query-per-Hour Metric
(Qph@100GB)
1134928.4
Measurement Interval in
Throughput Test (Ts)
1739.269000 seconds

Duration of stream execution

Start Date/Time End Date/Time Duration
Stream 0 09/29/2014 14:33:48 09/29/2014 14:40:59 0:07:11
Stream 1 09/29/2014 14:41:04 09/29/2014 15:10:02 0:28:58
Stream 2 09/29/2014 14:41:04 09/29/2014 15:09:07 0:28:03
Stream 3 09/29/2014 14:41:04 09/29/2014 15:09:17 0:28:13
Stream 4 09/29/2014 14:41:04 09/29/2014 15:09:55 0:28:51
Stream 5 09/29/2014 14:41:04 09/29/2014 15:09:39 0:28:35
Stream 6 09/29/2014 14:41:04 09/29/2014 15:09:46 0:28:42
Stream 7 09/29/2014 14:41:04 09/29/2014 15:09:58 0:28:54
Stream 8 09/29/2014 14:41:04 09/29/2014 15:08:58 0:27:54
Refresh 0 09/29/2014 14:33:48 09/29/2014 14:34:07 0:00:19
09/29/2014 14:40:59 09/29/2014 14:41:04 0:00:05
Refresh 1 09/29/2014 15:06:57 09/29/2014 15:09:49 0:02:52
Refresh 2 09/29/2014 14:41:05 09/29/2014 14:47:39 0:06:34
Refresh 3 09/29/2014 14:47:40 09/29/2014 14:56:46 0:09:06
Refresh 4 09/29/2014 14:56:49 09/29/2014 15:03:19 0:06:30
Refresh 5 09/29/2014 15:03:24 09/29/2014 15:06:45 0:03:21
Refresh 6 09/29/2014 15:06:46 09/29/2014 15:06:49 0:00:03
Refresh 7 09/29/2014 15:06:50 09/29/2014 15:06:53 0:00:03
Refresh 8 09/29/2014 15:06:53 09/29/2014 15:10:04 0:03:11

Numerical Quantities Summary Timing Intervals in Seconds:

Query Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Stream 0 9.393632 5.001910 17.053567 1.427500 17.813839 2.230451 13.884490 25.610995
Stream 1 12.971454 9.383520 94.257760 1.603106 127.940946 20.791892 78.869819 138.521273
Stream 2 21.428177 31.431513 96.366083 5.611843 58.394596 11.279502 47.114473 407.135077
Stream 3 23.377920 37.474814 83.640621 9.152178 71.186158 11.001543 46.763758 110.015662
Stream 4 49.580860 31.979940 87.662950 8.983661 68.052295 14.367631 59.266063 301.788652
Stream 5 13.483836 20.203772 391.980128 12.505446 77.966993 10.487869 52.989448 226.837637
Stream 6 38.104903 21.271630 84.689348 8.626460 86.620802 11.981171 69.182098 111.810485
Stream 7 20.243617 12.298692 99.547203 6.020951 151.584400 17.528287 62.037348 101.023802
Stream 8 22.808294 17.583072 59.180595 5.618565 123.108771 11.477376 42.485363 92.035709
Min Qi 12.971454 9.383520 59.180595 1.603106 58.394596 10.487869 42.485363 92.035709
Max Qi 49.580860 37.474814 391.980128 12.505446 151.584400 20.791892 78.869819 407.135077
Avg Qi 25.249883 22.703369 124.665586 7.265276 95.606870 13.614409 57.338546 186.146037
Query Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Stream 0 146.487681 6.798942 29.834475 3.177879 55.067866 4.503738 17.215591 9.333281
Stream 1 177.581204 44.178095 69.746005 12.306166 215.602727 30.443709 64.276384 45.266949
Stream 2 211.311651 27.403143 61.412478 12.173058 216.879170 18.272234 96.753886 35.587072
Stream 3 482.581456 68.663026 60.354163 13.408513 187.921639 17.469237 62.337222 31.706120
Stream 4 178.297373 23.711312 67.129677 15.216904 328.149575 20.258853 78.891201 84.852368
Stream 5 209.496498 28.346366 55.584081 9.644075 131.622351 24.171156 80.046801 43.625932
Stream 6 521.691639 24.126176 72.964805 15.311409 146.152570 34.748843 71.957130 58.470644
Stream 7 580.320149 17.054563 56.172396 7.530832 200.100326 12.444021 25.910599 75.653693
Stream 8 472.231674 15.064398 89.875570 42.394675 166.589234 12.831209 81.697881 73.821769
Min Qi 177.581204 15.064398 55.584081 7.530832 131.622351 12.444021 25.910599 31.706120
Max Qi 580.320149 68.663026 89.875570 42.394675 328.149575 34.748843 96.753886 84.852368
Avg Qi 354.188955 31.068385 66.654897 15.998204 199.127199 21.329908 70.233888 56.123068
Query Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2
Stream 0 12.252670 2.593733 4.115862 16.895672 10.183350 1.240096 18.679685 4.876067
Stream 1 356.740980 21.197870 30.422216 81.779038 65.468650 3.947503 63.933750 107.563796
Stream 2 54.087768 10.152604 34.940701 113.510640 70.908809 12.316233 109.091578 283.076004
Stream 3 52.807104 18.525982 13.740089 212.364908 16.413964 17.998809 58.653503 483.718271
Stream 4 42.389062 36.157809 28.909260 86.427025 21.605419 7.608729 54.910853 331.074114
Stream 5 48.214794 15.778893 20.681799 130.560005 43.846752 33.905533 54.536966 139.563667
Stream 6 84.061840 26.224851 16.546432 117.265210 34.766856 39.037423 0.710642 1.645351
Stream 7 63.034890 15.966686 31.666488 112.689765 28.661943 12.828171 1.274731 1.780452
Stream 8 43.879104 8.596666 32.585746 177.928730 26.763334 6.112333 1.187693 0.533668
Min Qi 42.389062 8.596666 13.740089 81.779038 16.413964 3.947503 0.710642 0.533668
Max Qi 356.740980 36.157809 34.940701 212.364908 70.908809 39.037423 109.091578 483.718271
Avg Qi 93.151943 19.075170 26.186591 129.065665 38.554466 16.719342 43.037465 168.619415

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
09/30/2014 16:33 GMT Modified: 10/06/2014 13:55 GMT
In Hoc Signo Vinces (part 18 of n): Cluster Dynamics [ Virtuso Data Space Bot ]

This article is about how scale-out differs from single-server. This shows large effects of parameters whose very existence most would not anticipate, and some low level metrics for assessing these. The moral of the story is that this is the stuff which makes the difference between merely surviving scale-out and winning with it. The developer and DBA would not normally know about this; thus these things fall into the category of adaptive self-configuration expected from the DBMS. But since this series is about what makes performance, I will discuss the dynamics such as they are and how to play these.

We take the prototypical cross partition join in Q13: Make a hash table of all customers, partitioned by c_custkey. This is independently done with full parallelism in each partition. Scan the orders, get the customer (in a different partition), and flag the customers that had at least one order. Then, to get the customers with no orders, return the customers that were not flagged in the previous pass.

The single-server time in part 12 was 7.8 and 6.0 with a single user. We consider the better of the times. The difference is due to allocating memory on the first go; on the second go the memory is already in reserve.

With default settings, we get 4595 ms (microseconds), with per node resource utilization at:


Cluster 4 nodes, 4 s. 112405 m/s 742602 KB/s  2749% cpu 0%  read 4% clw threads 1r 0w 0i buffers 8577766 287874 d 0 w 0 pfs
cl 1: 27867 m/s 185654 KB/s  733% cpu 0%  read 4% clw threads 1r 0w 0i buffers 2144242 71757 d 0 w 0 pfs
cl 2: 28149 m/s 185372 KB/s  672% cpu 0%  read 0% clw threads 0r 0w 0i buffers 2144640 71903 d 0 w 0 pfs
cl 3: 28220 m/s 185621 KB/s  675% cpu 0%  read 0% clw threads 0r 0w 0i buffers 2144454 71962 d 0 w 0 pfs
cl 4: 28150 m/s 185837 KB/s  667% cpu 0%  read 0% clw threads 0r 0w 0i buffers 2144430 72252 d 0 w 0 pfs

 

The top line is the summary; the lines below are per-process. The m/s is messages-per-second; KB/s is interconnect traffic per second; clw % is idle time spent waiting for a reply from another process. The cluster is set up with 4 processes across 2 machines, each with 2 NUMA nodes. Each process has affinity to the NUMA node, so local memory only. The time is reasonable in light of the overall CPU of 2700%. The maximum would be 4800% with all threads of all cores busy all the time.

The catch here is that we do not have a steady half-platform utilization all the time, but full platform peaks followed by synchronization barriers with very low utilization. So, we set the batch size differently:


cl_exec ('__dbf_set (''cl_dfg_batch_bytes'', 50000000)');

 

This means that we set, on each process, the cl_dfg_batch_bytes to 50M from a default of 10M. The effect is that each scan of orders, one thread per slice, 48 slices total, will produce 50MB worth of o_custkeys to be sent to the other partition for getting the customer. After each 50M, the thread stops and will produce the next batch when all are done and a global continue message is sent by the coordinator.

The time is now 3173 ms with:


Cluster 4 nodes, 3 s. 158220 m/s 1054944 KB/s  3676% cpu 0%  read 1% clw threads 1r 0w 0i buffers 8577766 287874 d 0 w 0 pfs
cl 1: 39594 m/s 263962 KB/s  947% cpu 0%  read 1% clw threads 1r 0w 0i buffers 2144242 71757 d 0 w 0 pfs
cl 2: 39531 m/s 263476 KB/s  894% cpu 0%  read 0% clw threads 0r 0w 0i buffers 2144640 71903 d 0 w 0 pfs
cl 3: 39523 m/s 263684 KB/s  933% cpu 0%  read 0% clw threads 0r 0w 0i buffers 2144454 71962 d 0 w 0 pfs
cl 4: 39535 m/s 263586 KB/s  900% cpu 0%  read 0% clw threads 0r 0w 0i buffers 2144430 72252 d 0 w 0 pfs

 

The platform utilization is better as we see. The throughput is nearly double that of the single-server, which is pretty good for a communication-heavy query.

This was done with a vector size of 10K. In other words, each partition gets 10K o_custkeys and splits these 48 ways to go to every recipient. 1/4 are in the same process, 1/4 in a different process on the same machine, and 2/4 on a different machine. The recipient gets messages with an average of 208 o_custkey values, puts them back together in batches of 10K, and passes these to the hash join with customer.

We try different vector sizes, such as 100K:


cl_exec ('__dbf_set (''dc_batch_sz'', 100000)');

 

There are two metrics of interest here: The write block time, and the scheduling overhead. The write block time is microseconds, which increases whenever a thread must wait before it can write to a connection. The scheduling overhead is cumulative clocks spent by threads while waiting for a critical section that deals with dispatching messages to consumer threads. Long messages make blocking; short messages make frequent scheduling decisions.


SELECT cl_sys_stat ('local_cll_clk', clr=>1), 
       cl_sys_stat ('write_block_usec', clr=>1)
;

 

cl_sys_stat gets the counters from all processes and returns the sum. clr=>1 means that the counter is cleared after read.

We do Q13 with vector sizes of 10, 100, and 1000K.

Vector size msec mtx wblock
10K 3297 10,829,910,329 0
100K 3150 1,663,238,367 59,132
1000K 3876 414,631,129 4,578,003

So, 100K seems to strike the best balance between scheduling and blocking on write.

The times are measured after several samples with each setting. The times stabilize after a few runs, as the appropriate size memory blocks are in reserve. Calling mmap to allocate these on the first run with each size has a very high penalty, e.g., 60s for the first run with 1M vector size. We note that blocking on write is really bad even though 1/3 of the time there is no network and 2/3 of the time there is a fast network (QDR IB) with no other load. Further, the affinities are set so that the thread responsible for incoming messages is always on core. Result variability on consecutive runs is under 5%, which is similar to single-server behavior.

It would seem that a mutex, as bad as it is, is still better than a distributed cause for going off core (blocking on write). The latency for continuing a thread thus blocked is of course higher than the latency for continuing one that is waiting for a mutex.

We note that a cluster with more machines can take a longer vector size because a vector spreads out to more recipients. The key seems to be to set the message size so that blocking on write is not common. This is a possible adaptive execution feature. We have seen no particular benefit from SDP (Sockets Direct Protocol) and its zero copy. This is a TCP replacement that comes with the InfiniBand drivers.

We will next look at replication/partitioning tradeoffs for hash joins. Then we can look at full runs.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
09/26/2014 17:07 GMT Modified: 10/06/2014 13:57 GMT
In Hoc Signo Vinces (part 17 of n): 100G and 300G Runs on Dual Xeon E5 2650v2 [ Virtuso Data Space Bot ]

This is an update presenting sample results on a newer platform for a single-server configuration. This is to verify that performance scales with the addition of cores and clock speed. Further, we note that the jump from 100G to 300G changes very little about the score. 3x larger takes approximately 3x longer, as long as things are in memory.

The platform is one node of the CWI cluster which was also used for the 500Gt RDF experiments reported on this blog. The specification is dual Xeon E5 2650v2 (8 core, 16 thread, 2.6 GHz) with 256 GB RAM. The disk setup is a RAID-0 of three 2 TB rotating disks.

For the 100G, we go from 240 to 395, which is about 1.64x. The new platform has 16 vs 12 cores and a clock of 2.6 as opposed to 2.3. This makes a multiplier of 1.5. The rest of the acceleration is probably attributable to faster memory clock. Anyway, the point of more speed from larger platform is made.

The top level scores per run are as follows; the numerical quantities summaries are appended.

100G

Run Power Throughput Composite
Run 1 391,000.1 401,029.4 395,983.0
Run 2 388,746.2 404,189.3 396,392.6

300G

Run Power Throughput Composite
Run 1 61,988.7 384,883.7 154,461.6
Run 2 423,431.8 387,248.6 404,936.3
Run 3 417,672.0 389,719.5 403,453.7

The interested may reproduce the results using the feature/analytics branch of the v7fasttrack git repository on GitHub as described in Part 13.

For the 300G runs, we note a much longer load time; see below, as this is seriously IO bound.

The first power test at 300G is a non-starter, even though this comes right after bulk load. Still, the data is not in working set and getting it from disk is simply an automatic disqualification, unless maybe one had 300 separate disks. This happens in TPC benchmarks, but not very often in the field. Looking at the first power run, the first queries take the longest, but by the time the power run starts, the working set is there. By an artifact of the metric (use of geometric mean for the power test), long queries are penalized less there than in the throughput run.

So, we run 3 executions instead of the prescribed 2, to have 2 executions from warm state.

To do 300G well in 256 GB of RAM, one needs either to use several SSDs, or to increase compression and keep all in memory, so no secondary storage at all. In order to keep all in memory, one could have stream-compression on string columns. Stream-compressing strings (e.g., o_comment, l_comment) does not pay if one is already in memory, but if stream-compressing strings eliminates going to secondary storage, then the win is sure.

As before, all caveats apply; the results are unaudited and for information only. Therefore we do not use the official metric name.

100G Run 1

Virt-H Executive Summary

Report Date September 15, 2014
Database Scale Factor 100
Start of Database Load 09/15/2014 07:04:08
End of Database Load 09/15/2014 07:15:58
Database Load Time 0:11:50
Query Streams for
Throughput Test
5
Virt-H Power 391,000.1
Virt-H Throughput 401,029.4
Virt-H Composite
Query-per-Hour Metric
(Qph@100GB)
395,983.0
Measurement Interval in
Throughput Test (Ts)
98.846000 seconds

Duration of stream execution

Start Date/Time End Date/Time Duration
Stream 0 09/15/2014 13:13:01 09/15/2014 13:13:28 0:00:27
Stream 1 09/15/2014 13:13:29 09/15/2014 13:15:06 0:01:37
Stream 2 09/15/2014 13:13:29 09/15/2014 13:15:07 0:01:38
Stream 3 09/15/2014 13:13:29 09/15/2014 13:15:07 0:01:38
Stream 4 09/15/2014 13:13:29 09/15/2014 13:15:04 0:01:35
Stream 5 09/15/2014 13:13:29 09/15/2014 13:15:08 0:01:39
Refresh 0 09/15/2014 13:13:01 09/15/2014 13:13:03 0:00:02
09/15/2014 13:13:28 09/15/2014 13:13:29 0:00:01
Refresh 1 09/15/2014 13:14:10 09/15/2014 13:14:16 0:00:06
Refresh 2 09/15/2014 13:13:29 09/15/2014 13:13:42 0:00:13
Refresh 3 09/15/2014 13:13:42 09/15/2014 13:13:53 0:00:11
Refresh 4 09/15/2014 13:13:53 09/15/2014 13:14:02 0:00:09
Refresh 5 09/15/2014 13:14:02 09/15/2014 13:14:10 0:00:08

Numerical Quantities Summary Timing Intervals in Seconds:

Query Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Stream 0 1.442477 0.304513 0.720263 0.351285 0.979414 0.479455 0.865992 0.875236
Stream 1 3.938133 0.920533 3.738724 2.769707 3.209728 1.339146 2.759384 3.626868
Stream 2 4.104738 0.952245 4.719658 0.865586 2.139267 0.850909 2.044402 2.600373
Stream 3 3.692119 1.024876 3.430172 1.579846 4.097845 1.859468 2.312921 6.238070
Stream 4 5.419537 0.531571 2.116176 1.256836 4.787617 2.117995 3.517466 3.982180
Stream 5 5.167029 0.746720 3.157557 1.255182 3.004802 2.131963 3.648316 2.835751
Min Qi 3.692119 0.531571 2.116176 0.865586 2.139267 0.850909 2.044402 2.600373
Max Qi 5.419537 1.024876 4.719658 2.769707 4.787617 2.131963 3.648316 6.238070
Avg Qi 4.464311 0.835189 3.432457 1.545431 3.447852 1.659896 2.856498 3.856648
Query Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Stream 0 2.606044 1.117063 1.847930 0.618534 4.327600 1.110908 0.995289 0.975910
Stream 1 7.463593 4.686463 4.549733 4.168129 15.759178 5.247666 4.495030 4.075198
Stream 2 9.398552 5.170904 3.934405 1.880683 19.968787 3.767992 6.965337 3.849845
Stream 3 7.581069 4.109905 4.301159 2.123634 17.683200 5.383603 4.376887 2.854777
Stream 4 9.927887 6.913209 3.351489 2.802724 16.985827 3.925148 4.691474 4.080586
Stream 5 7.035080 3.921425 6.844778 2.899238 14.839509 4.986742 6.629664 4.089547
Min Qi 7.035080 3.921425 3.351489 1.880683 14.839509 3.767992 4.376887 2.854777
Max Qi 9.927887 6.913209 6.844778 4.168129 19.968787 5.383603 6.965337 4.089547
Avg Qi 8.281236 4.960381 4.596313 2.774882 17.047300 4.662230 5.431678 3.789991
Query Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2
Stream 0 1.215956 0.745257 0.699801 1.281834 1.291110 0.518425 1.827192 1.014431
Stream 1 5.779854 2.383264 2.396793 6.130511 5.002700 1.968425 4.172437 2.427047
Stream 2 7.828176 1.833416 3.175649 4.785709 5.385834 1.403290 6.383005 6.366525
Stream 3 5.880139 1.797383 3.258024 5.601364 6.373216 1.977848 5.235542 6.385010
Stream 4 3.989621 1.252891 2.478303 4.678629 3.212176 2.740586 5.037995 3.911379
Stream 5 5.030440 2.010988 4.188428 6.221990 5.418788 2.187718 3.589915 3.517380
Min Qi 3.989621 1.252891 2.396793 4.678629 3.212176 1.403290 3.589915 2.427047
Max Qi 7.828176 2.383264 4.188428 6.221990 6.373216 2.740586 6.383005 6.385010
Avg Qi 5.701646 1.855588 3.099439 5.483641 5.078543 2.055573 4.883779 4.521468

100G Run 2

Virt-H Executive Summary

Report Date September 15, 2014
Database Scale Factor 100
Total Data Storage/Database Size 87,312M
Start of Database Load 09/15/2014 07:04:08
End of Database Load 09/15/2014 07:15:58
Database Load Time 0:11:50
Query Streams for
Throughput Test
5
Virt-H Power 388,746.2
Virt-H Throughput 404,189.3
Virt-H Composite
Query-per-Hour Metric
(Qph@100GB)
396,392.6
Measurement Interval in
Throughput Test (Ts)
98.074000 seconds

Duration of stream execution:

Start Date/Time End Date/Time Duration
Stream 0 09/15/2014 13:15:11 09/15/2014 13:15:38 0:00:27
Stream 1 09/15/2014 13:15:39 09/15/2014 13:17:13 0:01:34
Stream 2 09/15/2014 13:15:39 09/15/2014 13:17:16 0:01:37
Stream 3 09/15/2014 13:15:39 09/15/2014 13:17:15 0:01:36
Stream 4 09/15/2014 13:15:39 09/15/2014 13:17:17 0:01:38
Stream 5 09/15/2014 13:15:39 09/15/2014 13:17:15 0:01:36
Refresh 0 09/15/2014 13:15:11 09/15/2014 13:15:12 0:00:01
09/15/2014 13:15:38 09/15/2014 13:15:39 0:00:01
Refresh 1 09/15/2014 13:16:13 09/15/2014 13:16:20 0:00:07
Refresh 2 09/15/2014 13:15:39 09/15/2014 13:15:47 0:00:08
Refresh 3 09/15/2014 13:15:47 09/15/2014 13:15:56 0:00:09
Refresh 4 09/15/2014 13:15:56 09/15/2014 13:16:03 0:00:07
Refresh 5 09/15/2014 13:16:03 09/15/2014 13:16:12 0:00:09

Numerical Quantities Summary Timing Intervals in Seconds:

Query Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Stream 0 1.467681 0.277665 0.766102 0.365185 0.941206 0.549381 0.938998 0.803514
Stream 1 3.883169 1.488521 3.366920 1.627478 3.632321 2.065565 2.911138 2.444544
Stream 2 3.294589 1.138066 3.260775 1.899615 5.367725 1.820374 3.655119 2.186642
Stream 3 3.797641 0.995877 3.239690 2.483035 2.737690 1.505998 4.058083 4.268644
Stream 4 4.099187 0.402685 4.704959 1.469825 5.367910 2.783018 2.706164 2.551061
Stream 5 3.651273 1.598314 2.051899 1.283754 4.711897 1.519763 2.851300 2.484093
Min Qi 3.294589 0.402685 2.051899 1.283754 2.737690 1.505998 2.706164 2.186642
Max Qi 4.099187 1.598314 4.704959 2.483035 5.367910 2.783018 4.058083 4.268644
Avg Qi 3.745172 1.124693 3.324849 1.752741 4.363509 1.938944 3.236361 2.786997
Query Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Stream 0 2.734812 1.115539 1.679910 0.633239 4.391739 1.130082 1.137284 0.919646
Stream 1 9.271071 5.664855 3.377869 2.148228 16.046021 2.935643 4.897009 2.891040
Stream 2 10.272523 4.578427 4.086788 2.312762 16.295728 2.714776 6.393897 2.414951
Stream 3 7.095213 4.544636 4.073433 2.710320 18.789088 3.903873 5.471600 2.994184
Stream 4 7.567924 3.691088 3.951049 2.207944 18.189014 4.985841 6.568935 3.965322
Stream 5 8.173577 4.959777 4.736593 3.507469 17.106990 5.405699 7.357104 3.125788
Min Qi 7.095213 3.691088 3.377869 2.148228 16.046021 2.714776 4.897009 2.414951
Max Qi 10.272523 5.664855 4.736593 3.507469 18.789088 5.405699 7.357104 3.965322
Avg Qi 8.476062 4.687757 4.045146 2.577345 17.285368 3.989166 6.137709 3.078257
Query Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2
Stream 0 1.206347 0.792013 0.699476 1.349182 1.505387 0.543947 1.549135 0.824344
Stream 1 5.135036 1.873195 4.978155 5.988226 4.705365 1.211049 4.175947 3.579242
Stream 2 7.656125 2.229819 2.805272 6.629781 4.138014 1.423334 5.165700 3.197300
Stream 3 6.385983 2.086301 3.450305 3.292353 5.503905 2.302992 4.860041 3.865383
Stream 4 6.514967 2.876895 3.481100 1.629007 5.715903 2.121692 3.681208 3.347289
Stream 5 4.100205 2.400816 2.142291 4.710677 5.765320 1.616445 6.095817 3.007436
Min Qi 4.100205 1.873195 2.142291 1.629007 4.138014 1.211049 3.681208 3.007436
Max Qi 7.656125 2.876895 4.978155 6.629781 5.765320 2.302992 6.095817 3.865383
Avg Qi 5.958463 2.293405 3.371425 4.450009 5.165701 1.735102 4.795743 3.399330

300G Run 1

Virt-H Executive Summary

Report Date September 25, 2014
Database Scale Factor 300
Start of Database Load 09/25/2014 16:38:20
End of Database Load 09/25/2014 18:32:06
Database Load Time 1:53:46
Query Streams for
Throughput Test
6
Virt-H Power 61,988.7
Virt-H Throughput 384,883.7
Virt-H Composite
Query-per-Hour Metric
(Qph@300GB)
154,461.6
Measurement Interval in
Throughput Test (Ts)
370.498000 seconds

Duration of stream execution:

Start Date/Time End Date/Time Duration
Stream 0 09/25/2014 19:00:29 09/25/2014 19:22:25 0:21:56
Stream 1 09/25/2014 19:22:27 09/25/2014 19:28:23 0:05:56
Stream 2 09/25/2014 19:22:27 09/25/2014 19:28:23 0:05:56
Stream 3 09/25/2014 19:22:27 09/25/2014 19:28:26 0:05:59
Stream 4 09/25/2014 19:22:27 09/25/2014 19:28:13 0:05:46
Stream 5 09/25/2014 19:22:27 09/25/2014 19:28:38 0:06:11
Stream 6 09/25/2014 19:22:27 09/25/2014 19:28:38 0:06:11
Refresh 0 09/25/2014 19:00:29 09/25/2014 19:03:56 0:03:27
09/25/2014 19:22:25 09/25/2014 19:22:27 0:00:02
Refresh 1 09/25/2014 19:25:22 09/25/2014 19:25:58 0:00:36
Refresh 2 09/25/2014 19:22:27 09/25/2014 19:23:11 0:00:44
Refresh 3 09/25/2014 19:23:10 09/25/2014 19:23:40 0:00:30
Refresh 4 09/25/2014 19:23:40 09/25/2014 19:24:21 0:00:41
Refresh 5 09/25/2014 19:24:21 09/25/2014 19:24:58 0:00:37
Refresh 6 09/25/2014 19:24:59 09/25/2014 19:25:22 0:00:23

Numerical Quantities Summary Timing Intervals in Seconds:

Query Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Stream 0 183.735463 95.826361 79.826802 87.603164 47.099641 1.301704 2.606488 52.667426
Stream 1 9.400003 1.983777 15.839250 3.001843 15.593335 6.067716 8.870516 11.679706
Stream 2 12.634711 3.472203 13.683075 8.057952 16.500741 5.403771 11.181661 12.393932
Stream 3 10.807287 3.793587 15.844244 3.214977 15.960600 7.099744 10.424530 21.001623
Stream 4 11.900829 3.741707 14.219904 5.616907 16.487144 14.229782 11.100193 8.769539
Stream 5 13.933423 2.916529 19.453452 5.258843 16.706269 7.948711 8.982104 17.566729
Stream 6 17.084445 0.738683 11.503079 8.324812 23.483917 20.101834 9.207737 10.311292
Min Qi 9.400003 0.738683 11.503079 3.001843 15.593335 5.403771 8.870516 8.769539
Max Qi 17.084445 3.793587 19.453452 8.324812 23.483917 20.101834 11.181661 21.001623
Avg Qi 12.626783 2.774414 15.090501 5.579222 17.455334 10.141926 9.961123 13.620470
Query Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Stream 0 41.997798 2.727870 21.651730 25.704209 293.103984 3.171437 2.886688 5.298823
Stream 1 29.662265 22.788618 12.979253 7.121358 62.774323 22.132581 22.616793 21.625334
Stream 2 28.041750 22.481172 19.262140 5.790272 58.105179 16.809177 32.813330 12.692499
Stream 3 32.534297 15.460256 12.038047 7.012926 59.413740 18.540284 25.968635 16.716208
Stream 4 28.759993 15.123651 21.734471 6.920480 63.119744 12.848884 21.372432 11.662102
Stream 5 18.315308 21.781800 26.141212 8.230858 60.985590 22.369824 27.098660 25.283066
Stream 6 31.455961 27.078707 12.954580 11.081669 72.483462 12.376376 22.129120 11.439147
Min Qi 18.315308 15.123651 12.038047 5.790272 58.105179 12.376376 21.372432 11.439147
Max Qi 32.534297 27.078707 26.141212 11.081669 72.483462 22.369824 32.813330 25.283066
Avg Qi 28.128262 20.785701 17.518284 7.692927 62.813673 17.512854 25.333162 16.569726
Query Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2
Stream 0 7.793403 81.545934 41.648484 4.638731 25.003179 0.536267 206.980380 2.501589
Stream 1 27.058060 3.894254 8.664394 25.315007 11.921265 3.561859 22.936601 13.235777
Stream 2 25.718500 6.140657 8.856586 14.761290 11.870351 7.728217 13.882613 29.328859
Stream 3 15.896774 8.631035 15.742406 20.621604 13.370582 5.536313 14.677463 14.772753
Stream 4 22.458327 5.319241 11.973431 22.344017 11.534642 2.402683 24.214115 16.236299
Stream 5 13.407745 5.413278 8.800650 18.055743 17.528827 4.173171 15.927165 21.636801
Stream 6 8.069721 5.531066 13.233927 21.321389 7.622026 12.064182 11.457848 12.342336
Min Qi 8.069721 3.894254 8.664394 14.761290 7.622026 2.402683 11.457848 12.342336
Max Qi 27.058060 8.631035 15.742406 25.315007 17.528827 12.064182 24.214115 29.328859
Avg Qi 18.768188 5.821588 11.211899 20.403175 12.307949 5.911071 17.182634 17.925471

300G run 2

Virt-H Executive Summary

Report Date September 25, 2014
Database Scale Factor 300
Start of Database Load 09/25/2014 16:38:20
End of Database Load 09/25/2014 18:32:06
Database Load Time 1:53:46
Query Streams for
Throughput Test
6
Virt-H Power 423,431.8
Virt-H Throughput 387,248.6
Virt-H Composite
Query-per-Hour Metric
(Qph@300GB)
404,936.3
Measurement Interval in
Throughput Test (Ts)
368.236000 seconds

Duration of stream execution:

Start Date/Time End Date/Time Duration
Stream 0 09/25/2014 19:28:42 09/25/2014 19:29:58 0:01:16
Stream 1 09/25/2014 19:30:00 09/25/2014 19:36:04 0:06:04
Stream 2 09/25/2014 19:30:00 09/25/2014 19:36:00 0:06:00
Stream 3 09/25/2014 19:30:00 09/25/2014 19:36:06 0:06:06
Stream 4 09/25/2014 19:30:00 09/25/2014 19:36:07 0:06:07
Stream 5 09/25/2014 19:30:00 09/25/2014 19:35:53 0:05:53
Stream 6 09/25/2014 19:30:00 09/25/2014 19:36:08 0:06:08
Refresh 0 09/25/2014 19:28:41 09/25/2014 19:28:46 0:00:05
09/25/2014 19:29:58 09/25/2014 19:30:00 0:00:02
Refresh 1 09/25/2014 19:32:23 09/25/2014 19:32:55 0:00:32
Refresh 2 09/25/2014 19:30:00 09/25/2014 19:30:31 0:00:31
Refresh 3 09/25/2014 19:30:31 09/25/2014 19:31:00 0:00:29
Refresh 4 09/25/2014 19:31:01 09/25/2014 19:31:23 0:00:22
Refresh 5 09/25/2014 19:31:23 09/25/2014 19:31:54 0:00:31
Refresh 6 09/25/2014 19:31:55 09/25/2014 19:32:23 0:00:28

Numerical Quantities Summary Timing Intervals in Seconds:

Query Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Stream 0 4.197427 1.011516 2.535959 0.858781 2.857279 1.293530 2.682266 2.260502
Stream 1 15.467757 3.517499 13.820864 4.157259 13.141556 10.902710 16.899687 8.986535
Stream 2 15.639991 6.026485 13.521624 3.918031 17.336458 1.975310 9.718194 15.165247
Stream 3 14.891929 4.481383 15.322621 5.272911 15.266543 6.771253 13.430646 20.171084
Stream 4 14.560526 2.464157 11.567112 5.526629 20.531540 5.225971 16.288606 17.209475
Stream 5 10.390577 3.549165 9.598328 8.783847 17.351211 6.308214 12.606512 13.035716
Stream 6 16.275922 4.086475 14.109963 4.385887 10.174709 6.703266 8.936217 16.798526
Min Qi 10.390577 2.464157 9.598328 3.918031 10.174709 1.975310 8.936217 8.986535
Max Qi 16.275922 6.026485 15.322621 8.783847 20.531540 10.902710 16.899687 20.171084
Avg Qi 14.537784 4.020861 12.990085 5.340761 15.633670 6.314454 12.979977 15.227764
Query Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Stream 0 8.300092 2.598145 5.168418 1.619399 11.958836 3.191672 3.097822 2.497410
Stream 1 26.412829 17.354745 12.942454 8.169447 58.600101 15.227942 32.985324 13.914978
Stream 2 34.523245 17.635531 15.193748 8.435375 62.442800 16.276300 26.533303 12.414575
Stream 3 25.334301 18.595422 11.663933 10.029387 63.664992 20.378320 24.760768 15.710589
Stream 4 36.971957 15.645673 14.672851 13.196301 58.214728 17.375053 26.581101 11.624989
Stream 5 30.891797 12.993365 14.089049 10.515091 65.232712 20.807026 26.920526 11.362095
Stream 6 38.143281 21.106772 15.152299 18.845766 66.240343 12.295624 22.510610 18.081103
Min Qi 25.334301 12.993365 11.663933 8.169447 58.214728 12.295624 22.510610 11.362095
Max Qi 38.143281 21.106772 15.193748 18.845766 66.240343 20.807026 32.985324 18.081103
Avg Qi 32.046235 17.221918 13.952389 11.531894 62.399279 17.060044 26.715272 13.851388
Query Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2
Stream 0 4.016212 1.603004 1.836489 3.542383 3.901876 0.515102 4.759612 2.358873
Stream 1 22.162387 10.067834 15.772705 22.091355 12.974776 8.354196 19.342171 12.771250
Stream 2 25.647926 4.263008 11.590737 19.179326 17.899770 4.137031 15.720245 14.719776
Stream 3 14.511279 7.484608 20.735250 13.041037 17.139046 6.014141 16.234122 13.454647
Stream 4 19.297494 10.110707 10.907458 19.649066 15.206251 3.423503 11.268082 11.852223
Stream 5 17.445165 5.582309 15.266324 19.788382 14.245770 2.810949 16.601461 14.019717
Stream 6 25.115339 6.896503 11.661563 21.900028 5.520025 3.093050 15.436258 13.353446
Min Qi 14.511279 4.263008 10.907458 13.041037 5.520025 2.810949 11.268082 11.852223
Max Qi 25.647926 10.110707 20.735250 22.091355 17.899770 8.354196 19.342171 14.719776
Avg Qi 20.696598 7.400828 14.322339 19.274866 13.830940 4.638812 15.767057 13.361843

300G run 3:

Virt-H Executive Summary

Report Date September 25, 2014
Database Scale Factor 300
Total Data Storage/Database Size 258,888M
Start of Database Load 09/25/2014 16:38:20
End of Database Load 09/25/2014 18:32:06
Database Load Time 1:53:46
Query Streams for
Throughput Test
6
Virt-H Power 417,672.0
Virt-H Throughput 389,719.5
Virt-H Composite
Query-per-Hour Metric
(Qph@300GB)
403,453.7
Measurement Interval in
Throughput Test (Ts)
365.902000 seconds

Duration of stream execution:

Start Date/Time End Date/Time Duration
Stream 0 09/25/2014 19:36:11 09/25/2014 19:37:29 0:01:18
Stream 1 09/25/2014 19:37:32 09/25/2014 19:43:13 0:05:41
Stream 2 09/25/2014 19:37:32 09/25/2014 19:43:31 0:05:59
Stream 3 09/25/2014 19:37:32 09/25/2014 19:43:37 0:06:05
Stream 4 09/25/2014 19:37:32 09/25/2014 19:43:33 0:06:01
Stream 5 09/25/2014 19:37:32 09/25/2014 19:43:32 0:06:00
Stream 6 09/25/2014 19:37:32 09/25/2014 19:43:37 0:06:05
Refresh 0 09/25/2014 19:36:12 09/25/2014 19:36:16 0:00:04
09/25/2014 19:37:29 09/25/2014 19:37:31 0:00:02
Refresh 1 09/25/2014 19:40:02 09/25/2014 19:40:33 0:00:31
Refresh 2 09/25/2014 19:37:31 09/25/2014 19:38:01 0:00:30
Refresh 3 09/25/2014 19:38:01 09/25/2014 19:38:30 0:00:29
Refresh 4 09/25/2014 19:38:30 09/25/2014 19:38:58 0:00:28
Refresh 5 09/25/2014 19:38:58 09/25/2014 19:39:27 0:00:29
Refresh 6 09/25/2014 19:39:27 09/25/2014 19:40:01 0:00:34

Numerical Quantities Summary Timing Intervals in Seconds:

Query Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Stream 0 4.305006 1.083442 2.502758 0.845763 2.840824 1.346166 2.659511 2.233550
Stream 1 11.513360 3.732513 14.530428 3.819517 14.821291 7.561547 10.435082 8.984230
Stream 2 13.486433 3.373689 9.620363 3.914320 16.857542 5.837487 10.695443 17.901191
Stream 3 11.015942 1.780220 4.830412 9.073543 15.587709 9.661989 12.374931 15.262485
Stream 4 13.600461 0.820899 12.254226 7.799415 19.860761 13.145017 14.404345 11.807583
Stream 5 13.358000 3.885118 11.099935 4.845043 18.286721 6.424272 9.735255 15.041608
Stream 6 13.588873 3.789631 13.503399 5.130389 13.104065 3.517076 14.929079 19.831639
Min Qi 11.015942 0.820899 4.830412 3.819517 13.104065 3.517076 9.735255 8.984230
Max Qi 13.600461 3.885118 14.530428 9.073543 19.860761 13.145017 14.929079 19.831639
Avg Qi 12.760511 2.897012 10.973127 5.763705 16.419681 7.691231 12.095689 14.804789
Query Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Stream 0 8.553183 3.215484 4.652364 1.620089 11.936052 2.916132 3.219969 2.374276
Stream 1 29.441108 20.348266 9.994556 14.965432 60.537168 13.302875 30.159402 10.277570
Stream 2 41.799347 18.197400 16.773638 6.510347 67.461446 20.362328 0.109929 9.908769
Stream 3 24.306937 20.555376 17.140758 16.715188 61.724168 22.469230 27.967206 13.434167
Stream 4 34.820796 11.795664 18.015120 7.176057 63.134711 11.427374 23.959842 16.759246
Stream 5 23.139366 12.655317 13.152401 7.258740 64.273225 22.854106 28.803059 12.832364
Stream 6 27.955059 24.633526 11.046285 5.995041 74.965966 15.636579 22.803890 13.221303
Min Qi 23.139366 11.795664 9.994556 5.995041 60.537168 11.427374 0.109929 9.908769
Max Qi 41.799347 24.633526 18.015120 16.715188 74.965966 22.854106 30.159402 16.759246
Avg Qi 30.243769 18.030925 14.353793 9.770134 65.349447 17.675415 22.300555 12.738903
Query Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2
Stream 0 4.298092 1.702071 1.894548 4.118591 3.922889 0.491145 4.519734 2.347913
Stream 1 16.432222 6.908918 17.749058 18.756674 11.148628 5.464975 18.300673 12.972871
Stream 2 20.588544 4.387662 14.527229 23.844364 15.500462 15.543458 13.666574 15.240662
Stream 3 14.008049 6.222633 12.833421 22.811602 16.013232 9.449069 16.486111 12.974515
Stream 4 16.964699 8.106044 11.207675 22.483826 17.354675 4.641183 14.583941 13.679087
Stream 5 25.243144 7.359437 16.986615 19.855391 17.183725 5.750937 14.759597 13.052316
Stream 6 12.986721 10.160993 17.496662 19.267026 17.300224 4.955930 19.267721 15.421241
Min Qi 12.986721 4.387662 11.207675 18.756674 11.148628 4.641183 13.666574 12.972871
Max Qi 25.243144 10.160993 17.749058 23.844364 17.354675 15.543458 19.267721 15.421241
Avg Qi 17.703896 7.190948 15.133443 21.169814 15.750158 7.634259 16.177436 13.890115

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
09/26/2014 17:07 GMT Modified: 10/06/2014 13:58 GMT
In Hoc Signo Vinces (part 18 of n): Cluster Dynamics [ Orri Erling ]

This article is about how scale-out differs from single-server. This shows large effects of parameters whose very existence most would not anticipate, and some low level metrics for assessing these. The moral of the story is that this is the stuff which makes the difference between merely surviving scale-out and winning with it. The developer and DBA would not normally know about this; thus these things fall into the category of adaptive self-configuration expected from the DBMS. But since this series is about what makes performance, I will discuss the dynamics such as they are and how to play these.

We take the prototypical cross partition join in Q13: Make a hash table of all customers, partitioned by c_custkey. This is independently done with full parallelism in each partition. Scan the orders, get the customer (in a different partition), and flag the customers that had at least one order. Then, to get the customers with no orders, return the customers that were not flagged in the previous pass.

The single-server time in part 12 was 7.8 and 6.0 with a single user. We consider the better of the times. The difference is due to allocating memory on the first go; on the second go the memory is already in reserve.

With default settings, we get 4595 ms (microseconds), with per node resource utilization at:


Cluster 4 nodes, 4 s. 112405 m/s 742602 KB/s  2749% cpu 0%  read 4% clw threads 1r 0w 0i buffers 8577766 287874 d 0 w 0 pfs
cl 1: 27867 m/s 185654 KB/s  733% cpu 0%  read 4% clw threads 1r 0w 0i buffers 2144242 71757 d 0 w 0 pfs
cl 2: 28149 m/s 185372 KB/s  672% cpu 0%  read 0% clw threads 0r 0w 0i buffers 2144640 71903 d 0 w 0 pfs
cl 3: 28220 m/s 185621 KB/s  675% cpu 0%  read 0% clw threads 0r 0w 0i buffers 2144454 71962 d 0 w 0 pfs
cl 4: 28150 m/s 185837 KB/s  667% cpu 0%  read 0% clw threads 0r 0w 0i buffers 2144430 72252 d 0 w 0 pfs

 

The top line is the summary; the lines below are per-process. The m/s is messages-per-second; KB/s is interconnect traffic per second; clw % is idle time spent waiting for a reply from another process. The cluster is set up with 4 processes across 2 machines, each with 2 NUMA nodes. Each process has affinity to the NUMA node, so local memory only. The time is reasonable in light of the overall CPU of 2700%. The maximum would be 4800% with all threads of all cores busy all the time.

The catch here is that we do not have a steady half-platform utilization all the time, but full platform peaks followed by synchronization barriers with very low utilization. So, we set the batch size differently:


cl_exec ('__dbf_set (''cl_dfg_batch_bytes'', 50000000)');

 

This means that we set, on each process, the cl_dfg_batch_bytes to 50M from a default of 10M. The effect is that each scan of orders, one thread per slice, 48 slices total, will produce 50MB worth of o_custkeys to be sent to the other partition for getting the customer. After each 50M, the thread stops and will produce the next batch when all are done and a global continue message is sent by the coordinator.

The time is now 3173 ms with:


Cluster 4 nodes, 3 s. 158220 m/s 1054944 KB/s  3676% cpu 0%  read 1% clw threads 1r 0w 0i buffers 8577766 287874 d 0 w 0 pfs
cl 1: 39594 m/s 263962 KB/s  947% cpu 0%  read 1% clw threads 1r 0w 0i buffers 2144242 71757 d 0 w 0 pfs
cl 2: 39531 m/s 263476 KB/s  894% cpu 0%  read 0% clw threads 0r 0w 0i buffers 2144640 71903 d 0 w 0 pfs
cl 3: 39523 m/s 263684 KB/s  933% cpu 0%  read 0% clw threads 0r 0w 0i buffers 2144454 71962 d 0 w 0 pfs
cl 4: 39535 m/s 263586 KB/s  900% cpu 0%  read 0% clw threads 0r 0w 0i buffers 2144430 72252 d 0 w 0 pfs

 

The platform utilization is better as we see. The throughput is nearly double that of the single-server, which is pretty good for a communication-heavy query.

This was done with a vector size of 10K. In other words, each partition gets 10K o_custkeys and splits these 48 ways to go to every recipient. 1/4 are in the same process, 1/4 in a different process on the same machine, and 2/4 on a different machine. The recipient gets messages with an average of 208 o_custkey values, puts them back together in batches of 10K, and passes these to the hash join with customer.

We try different vector sizes, such as 100K:


cl_exec ('__dbf_set (''dc_batch_sz'', 100000)');

 

There are two metrics of interest here: The write block time, and the scheduling overhead. The write block time is microseconds, which increases whenever a thread must wait before it can write to a connection. The scheduling overhead is cumulative clocks spent by threads while waiting for a critical section that deals with dispatching messages to consumer threads. Long messages make blocking; short messages make frequent scheduling decisions.


SELECT cl_sys_stat ('local_cll_clk', clr=>1), 
       cl_sys_stat ('write_block_usec', clr=>1)
;

 

cl_sys_stat gets the counters from all processes and returns the sum. clr=>1 means that the counter is cleared after read.

We do Q13 with vector sizes of 10, 100, and 1000K.

Vector size msec mtx wblock
10K 3297 10,829,910,329 0
100K 3150 1,663,238,367 59,132
1000K 3876 414,631,129 4,578,003

So, 100K seems to strike the best balance between scheduling and blocking on write.

The times are measured after several samples with each setting. The times stabilize after a few runs, as the appropriate size memory blocks are in reserve. Calling mmap to allocate these on the first run with each size has a very high penalty, e.g., 60s for the first run with 1M vector size. We note that blocking on write is really bad even though 1/3 of the time there is no network and 2/3 of the time there is a fast network (QDR IB) with no other load. Further, the affinities are set so that the thread responsible for incoming messages is always on core. Result variability on consecutive runs is under 5%, which is similar to single-server behavior.

It would seem that a mutex, as bad as it is, is still better than a distributed cause for going off core (blocking on write). The latency for continuing a thread thus blocked is of course higher than the latency for continuing one that is waiting for a mutex.

We note that a cluster with more machines can take a longer vector size because a vector spreads out to more recipients. The key seems to be to set the message size so that blocking on write is not common. This is a possible adaptive execution feature. We have seen no particular benefit from SDP (Sockets Direct Protocol) and its zero copy. This is a TCP replacement that comes with the InfiniBand drivers.

We will next look at replication/partitioning tradeoffs for hash joins. Then we can look at full runs.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
09/26/2014 17:02 GMT Modified: 10/06/2014 13:54 GMT
In Hoc Signo Vinces (part 17 of n): 100G and 300G Runs on Dual Xeon E5 2650v2 [ Orri Erling ]

This is an update presenting sample results on a newer platform for a single-server configuration. This is to verify that performance scales with the addition of cores and clock speed. Further, we note that the jump from 100G to 300G changes very little about the score. 3x larger takes approximately 3x longer, as long as things are in memory.

The platform is one node of the CWI cluster which was also used for the 500Gt RDF experiments reported on this blog. The specification is dual Xeon E5 2650v2 (8 core, 16 thread, 2.6 GHz) with 256 GB RAM. The disk setup is a RAID-0 of three 2 TB rotating disks.

For the 100G, we go from 240 to 395, which is about 1.64x. The new platform has 16 vs 12 cores and a clock of 2.6 as opposed to 2.3. This makes a multiplier of 1.5. The rest of the acceleration is probably attributable to faster memory clock. Anyway, the point of more speed from larger platform is made.

The top level scores per run are as follows; the numerical quantities summaries are appended.

100G

Run Power Throughput Composite
Run 1 391,000.1 401,029.4 395,983.0
Run 2 388,746.2 404,189.3 396,392.6

300G

Run Power Throughput Composite
Run 1 61,988.7 384,883.7 154,461.6
Run 2 423,431.8 387,248.6 404,936.3
Run 3 417,672.0 389,719.5 403,453.7

The interested may reproduce the results using the feature/analytics branch of the v7fasttrack git repository on GitHub as described in Part 13.

For the 300G runs, we note a much longer load time; see below, as this is seriously IO bound.

The first power test at 300G is a non-starter, even though this comes right after bulk load. Still, the data is not in working set and getting it from disk is simply an automatic disqualification, unless maybe one had 300 separate disks. This happens in TPC benchmarks, but not very often in the field. Looking at the first power run, the first queries take the longest, but by the time the power run starts, the working set is there. By an artifact of the metric (use of geometric mean for the power test), long queries are penalized less there than in the throughput run.

So, we run 3 executions instead of the prescribed 2, to have 2 executions from warm state.

To do 300G well in 256 GB of RAM, one needs either to use several SSDs, or to increase compression and keep all in memory, so no secondary storage at all. In order to keep all in memory, one could have stream-compression on string columns. Stream-compressing strings (e.g., o_comment, l_comment) does not pay if one is already in memory, but if stream-compressing strings eliminates going to secondary storage, then the win is sure.

As before, all caveats apply; the results are unaudited and for information only. Therefore we do not use the official metric name.

100G Run 1

Virt-H Executive Summary

Report Date September 15, 2014
Database Scale Factor 100
Start of Database Load 09/15/2014 07:04:08
End of Database Load 09/15/2014 07:15:58
Database Load Time 0:11:50
Query Streams for
Throughput Test
5
Virt-H Power 391,000.1
Virt-H Throughput 401,029.4
Virt-H Composite
Query-per-Hour Metric
(Qph@100GB)
395,983.0
Measurement Interval in
Throughput Test (Ts)
98.846000 seconds

Duration of stream execution

Start Date/Time End Date/Time Duration
Stream 0 09/15/2014 13:13:01 09/15/2014 13:13:28 0:00:27
Stream 1 09/15/2014 13:13:29 09/15/2014 13:15:06 0:01:37
Stream 2 09/15/2014 13:13:29 09/15/2014 13:15:07 0:01:38
Stream 3 09/15/2014 13:13:29 09/15/2014 13:15:07 0:01:38
Stream 4 09/15/2014 13:13:29 09/15/2014 13:15:04 0:01:35
Stream 5 09/15/2014 13:13:29 09/15/2014 13:15:08 0:01:39
Refresh 0 09/15/2014 13:13:01 09/15/2014 13:13:03 0:00:02
09/15/2014 13:13:28 09/15/2014 13:13:29 0:00:01
Refresh 1 09/15/2014 13:14:10 09/15/2014 13:14:16 0:00:06
Refresh 2 09/15/2014 13:13:29 09/15/2014 13:13:42 0:00:13
Refresh 3 09/15/2014 13:13:42 09/15/2014 13:13:53 0:00:11
Refresh 4 09/15/2014 13:13:53 09/15/2014 13:14:02 0:00:09
Refresh 5 09/15/2014 13:14:02 09/15/2014 13:14:10 0:00:08

Numerical Quantities Summary Timing Intervals in Seconds:

Query Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Stream 0 1.442477 0.304513 0.720263 0.351285 0.979414 0.479455 0.865992 0.875236
Stream 1 3.938133 0.920533 3.738724 2.769707 3.209728 1.339146 2.759384 3.626868
Stream 2 4.104738 0.952245 4.719658 0.865586 2.139267 0.850909 2.044402 2.600373
Stream 3 3.692119 1.024876 3.430172 1.579846 4.097845 1.859468 2.312921 6.238070
Stream 4 5.419537 0.531571 2.116176 1.256836 4.787617 2.117995 3.517466 3.982180
Stream 5 5.167029 0.746720 3.157557 1.255182 3.004802 2.131963 3.648316 2.835751
Min Qi 3.692119 0.531571 2.116176 0.865586 2.139267 0.850909 2.044402 2.600373
Max Qi 5.419537 1.024876 4.719658 2.769707 4.787617 2.131963 3.648316 6.238070
Avg Qi 4.464311 0.835189 3.432457 1.545431 3.447852 1.659896 2.856498 3.856648
Query Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Stream 0 2.606044 1.117063 1.847930 0.618534 4.327600 1.110908 0.995289 0.975910
Stream 1 7.463593 4.686463 4.549733 4.168129 15.759178 5.247666 4.495030 4.075198
Stream 2 9.398552 5.170904 3.934405 1.880683 19.968787 3.767992 6.965337 3.849845
Stream 3 7.581069 4.109905 4.301159 2.123634 17.683200 5.383603 4.376887 2.854777
Stream 4 9.927887 6.913209 3.351489 2.802724 16.985827 3.925148 4.691474 4.080586
Stream 5 7.035080 3.921425 6.844778 2.899238 14.839509 4.986742 6.629664 4.089547
Min Qi 7.035080 3.921425 3.351489 1.880683 14.839509 3.767992 4.376887 2.854777
Max Qi 9.927887 6.913209 6.844778 4.168129 19.968787 5.383603 6.965337 4.089547
Avg Qi 8.281236 4.960381 4.596313 2.774882 17.047300 4.662230 5.431678 3.789991
Query Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2
Stream 0 1.215956 0.745257 0.699801 1.281834 1.291110 0.518425 1.827192 1.014431
Stream 1 5.779854 2.383264 2.396793 6.130511 5.002700 1.968425 4.172437 2.427047
Stream 2 7.828176 1.833416 3.175649 4.785709 5.385834 1.403290 6.383005 6.366525
Stream 3 5.880139 1.797383 3.258024 5.601364 6.373216 1.977848 5.235542 6.385010
Stream 4 3.989621 1.252891 2.478303 4.678629 3.212176 2.740586 5.037995 3.911379
Stream 5 5.030440 2.010988 4.188428 6.221990 5.418788 2.187718 3.589915 3.517380
Min Qi 3.989621 1.252891 2.396793 4.678629 3.212176 1.403290 3.589915 2.427047
Max Qi 7.828176 2.383264 4.188428 6.221990 6.373216 2.740586 6.383005 6.385010
Avg Qi 5.701646 1.855588 3.099439 5.483641 5.078543 2.055573 4.883779 4.521468

100G Run 2

Virt-H Executive Summary

Report Date September 15, 2014
Database Scale Factor 100
Total Data Storage/Database Size 87,312M
Start of Database Load 09/15/2014 07:04:08
End of Database Load 09/15/2014 07:15:58
Database Load Time 0:11:50
Query Streams for
Throughput Test
5
Virt-H Power 388,746.2
Virt-H Throughput 404,189.3
Virt-H Composite
Query-per-Hour Metric
(Qph@100GB)
396,392.6
Measurement Interval in
Throughput Test (Ts)
98.074000 seconds

Duration of stream execution:

Start Date/Time End Date/Time Duration
Stream 0 09/15/2014 13:15:11 09/15/2014 13:15:38 0:00:27
Stream 1 09/15/2014 13:15:39 09/15/2014 13:17:13 0:01:34
Stream 2 09/15/2014 13:15:39 09/15/2014 13:17:16 0:01:37
Stream 3 09/15/2014 13:15:39 09/15/2014 13:17:15 0:01:36
Stream 4 09/15/2014 13:15:39 09/15/2014 13:17:17 0:01:38
Stream 5 09/15/2014 13:15:39 09/15/2014 13:17:15 0:01:36
Refresh 0 09/15/2014 13:15:11 09/15/2014 13:15:12 0:00:01
09/15/2014 13:15:38 09/15/2014 13:15:39 0:00:01
Refresh 1 09/15/2014 13:16:13 09/15/2014 13:16:20 0:00:07
Refresh 2 09/15/2014 13:15:39 09/15/2014 13:15:47 0:00:08
Refresh 3 09/15/2014 13:15:47 09/15/2014 13:15:56 0:00:09
Refresh 4 09/15/2014 13:15:56 09/15/2014 13:16:03 0:00:07
Refresh 5 09/15/2014 13:16:03 09/15/2014 13:16:12 0:00:09

Numerical Quantities Summary Timing Intervals in Seconds:

Query Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Stream 0 1.467681 0.277665 0.766102 0.365185 0.941206 0.549381 0.938998 0.803514
Stream 1 3.883169 1.488521 3.366920 1.627478 3.632321 2.065565 2.911138 2.444544
Stream 2 3.294589 1.138066 3.260775 1.899615 5.367725 1.820374 3.655119 2.186642
Stream 3 3.797641 0.995877 3.239690 2.483035 2.737690 1.505998 4.058083 4.268644
Stream 4 4.099187 0.402685 4.704959 1.469825 5.367910 2.783018 2.706164 2.551061
Stream 5 3.651273 1.598314 2.051899 1.283754 4.711897 1.519763 2.851300 2.484093
Min Qi 3.294589 0.402685 2.051899 1.283754 2.737690 1.505998 2.706164 2.186642
Max Qi 4.099187 1.598314 4.704959 2.483035 5.367910 2.783018 4.058083 4.268644
Avg Qi 3.745172 1.124693 3.324849 1.752741 4.363509 1.938944 3.236361 2.786997
Query Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Stream 0 2.734812 1.115539 1.679910 0.633239 4.391739 1.130082 1.137284 0.919646
Stream 1 9.271071 5.664855 3.377869 2.148228 16.046021 2.935643 4.897009 2.891040
Stream 2 10.272523 4.578427 4.086788 2.312762 16.295728 2.714776 6.393897 2.414951
Stream 3 7.095213 4.544636 4.073433 2.710320 18.789088 3.903873 5.471600 2.994184
Stream 4 7.567924 3.691088 3.951049 2.207944 18.189014 4.985841 6.568935 3.965322
Stream 5 8.173577 4.959777 4.736593 3.507469 17.106990 5.405699 7.357104 3.125788
Min Qi 7.095213 3.691088 3.377869 2.148228 16.046021 2.714776 4.897009 2.414951
Max Qi 10.272523 5.664855 4.736593 3.507469 18.789088 5.405699 7.357104 3.965322
Avg Qi 8.476062 4.687757 4.045146 2.577345 17.285368 3.989166 6.137709 3.078257
Query Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2
Stream 0 1.206347 0.792013 0.699476 1.349182 1.505387 0.543947 1.549135 0.824344
Stream 1 5.135036 1.873195 4.978155 5.988226 4.705365 1.211049 4.175947 3.579242
Stream 2 7.656125 2.229819 2.805272 6.629781 4.138014 1.423334 5.165700 3.197300
Stream 3 6.385983 2.086301 3.450305 3.292353 5.503905 2.302992 4.860041 3.865383
Stream 4 6.514967 2.876895 3.481100 1.629007 5.715903 2.121692 3.681208 3.347289
Stream 5 4.100205 2.400816 2.142291 4.710677 5.765320 1.616445 6.095817 3.007436
Min Qi 4.100205 1.873195 2.142291 1.629007 4.138014 1.211049 3.681208 3.007436
Max Qi 7.656125 2.876895 4.978155 6.629781 5.765320 2.302992 6.095817 3.865383
Avg Qi 5.958463 2.293405 3.371425 4.450009 5.165701 1.735102 4.795743 3.399330

300G Run 1

Virt-H Executive Summary

Report Date September 25, 2014
Database Scale Factor 300
Start of Database Load 09/25/2014 16:38:20
End of Database Load 09/25/2014 18:32:06
Database Load Time 1:53:46
Query Streams for
Throughput Test
6
Virt-H Power 61,988.7
Virt-H Throughput 384,883.7
Virt-H Composite
Query-per-Hour Metric
(Qph@300GB)
154,461.6
Measurement Interval in
Throughput Test (Ts)
370.498000 seconds

Duration of stream execution:

Start Date/Time End Date/Time Duration
Stream 0 09/25/2014 19:00:29 09/25/2014 19:22:25 0:21:56
Stream 1 09/25/2014 19:22:27 09/25/2014 19:28:23 0:05:56
Stream 2 09/25/2014 19:22:27 09/25/2014 19:28:23 0:05:56
Stream 3 09/25/2014 19:22:27 09/25/2014 19:28:26 0:05:59
Stream 4 09/25/2014 19:22:27 09/25/2014 19:28:13 0:05:46
Stream 5 09/25/2014 19:22:27 09/25/2014 19:28:38 0:06:11
Stream 6 09/25/2014 19:22:27 09/25/2014 19:28:38 0:06:11
Refresh 0 09/25/2014 19:00:29 09/25/2014 19:03:56 0:03:27
09/25/2014 19:22:25 09/25/2014 19:22:27 0:00:02
Refresh 1 09/25/2014 19:25:22 09/25/2014 19:25:58 0:00:36
Refresh 2 09/25/2014 19:22:27 09/25/2014 19:23:11 0:00:44
Refresh 3 09/25/2014 19:23:10 09/25/2014 19:23:40 0:00:30
Refresh 4 09/25/2014 19:23:40 09/25/2014 19:24:21 0:00:41
Refresh 5 09/25/2014 19:24:21 09/25/2014 19:24:58 0:00:37
Refresh 6 09/25/2014 19:24:59 09/25/2014 19:25:22 0:00:23

Numerical Quantities Summary Timing Intervals in Seconds:

Query Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Stream 0 183.735463 95.826361 79.826802 87.603164 47.099641 1.301704 2.606488 52.667426
Stream 1 9.400003 1.983777 15.839250 3.001843 15.593335 6.067716 8.870516 11.679706
Stream 2 12.634711 3.472203 13.683075 8.057952 16.500741 5.403771 11.181661 12.393932
Stream 3 10.807287 3.793587 15.844244 3.214977 15.960600 7.099744 10.424530 21.001623
Stream 4 11.900829 3.741707 14.219904 5.616907 16.487144 14.229782 11.100193 8.769539
Stream 5 13.933423 2.916529 19.453452 5.258843 16.706269 7.948711 8.982104 17.566729
Stream 6 17.084445 0.738683 11.503079 8.324812 23.483917 20.101834 9.207737 10.311292
Min Qi 9.400003 0.738683 11.503079 3.001843 15.593335 5.403771 8.870516 8.769539
Max Qi 17.084445 3.793587 19.453452 8.324812 23.483917 20.101834 11.181661 21.001623
Avg Qi 12.626783 2.774414 15.090501 5.579222 17.455334 10.141926 9.961123 13.620470
Query Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Stream 0 41.997798 2.727870 21.651730 25.704209 293.103984 3.171437 2.886688 5.298823
Stream 1 29.662265 22.788618 12.979253 7.121358 62.774323 22.132581 22.616793 21.625334
Stream 2 28.041750 22.481172 19.262140 5.790272 58.105179 16.809177 32.813330 12.692499
Stream 3 32.534297 15.460256 12.038047 7.012926 59.413740 18.540284 25.968635 16.716208
Stream 4 28.759993 15.123651 21.734471 6.920480 63.119744 12.848884 21.372432 11.662102
Stream 5 18.315308 21.781800 26.141212 8.230858 60.985590 22.369824 27.098660 25.283066
Stream 6 31.455961 27.078707 12.954580 11.081669 72.483462 12.376376 22.129120 11.439147
Min Qi 18.315308 15.123651 12.038047 5.790272 58.105179 12.376376 21.372432 11.439147
Max Qi 32.534297 27.078707 26.141212 11.081669 72.483462 22.369824 32.813330 25.283066
Avg Qi 28.128262 20.785701 17.518284 7.692927 62.813673 17.512854 25.333162 16.569726
Query Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2
Stream 0 7.793403 81.545934 41.648484 4.638731 25.003179 0.536267 206.980380 2.501589
Stream 1 27.058060 3.894254 8.664394 25.315007 11.921265 3.561859 22.936601 13.235777
Stream 2 25.718500 6.140657 8.856586 14.761290 11.870351 7.728217 13.882613 29.328859
Stream 3 15.896774 8.631035 15.742406 20.621604 13.370582 5.536313 14.677463 14.772753
Stream 4 22.458327 5.319241 11.973431 22.344017 11.534642 2.402683 24.214115 16.236299
Stream 5 13.407745 5.413278 8.800650 18.055743 17.528827 4.173171 15.927165 21.636801
Stream 6 8.069721 5.531066 13.233927 21.321389 7.622026 12.064182 11.457848 12.342336
Min Qi 8.069721 3.894254 8.664394 14.761290 7.622026 2.402683 11.457848 12.342336
Max Qi 27.058060 8.631035 15.742406 25.315007 17.528827 12.064182 24.214115 29.328859
Avg Qi 18.768188 5.821588 11.211899 20.403175 12.307949 5.911071 17.182634 17.925471

300G run 2

Virt-H Executive Summary

Report Date September 25, 2014
Database Scale Factor 300
Start of Database Load 09/25/2014 16:38:20
End of Database Load 09/25/2014 18:32:06
Database Load Time 1:53:46
Query Streams for
Throughput Test
6
Virt-H Power 423,431.8
Virt-H Throughput 387,248.6
Virt-H Composite
Query-per-Hour Metric
(Qph@300GB)
404,936.3
Measurement Interval in
Throughput Test (Ts)
368.236000 seconds

Duration of stream execution:

Start Date/Time End Date/Time Duration
Stream 0 09/25/2014 19:28:42 09/25/2014 19:29:58 0:01:16
Stream 1 09/25/2014 19:30:00 09/25/2014 19:36:04 0:06:04
Stream 2 09/25/2014 19:30:00 09/25/2014 19:36:00 0:06:00
Stream 3 09/25/2014 19:30:00 09/25/2014 19:36:06 0:06:06
Stream 4 09/25/2014 19:30:00 09/25/2014 19:36:07 0:06:07
Stream 5 09/25/2014 19:30:00 09/25/2014 19:35:53 0:05:53
Stream 6 09/25/2014 19:30:00 09/25/2014 19:36:08 0:06:08
Refresh 0 09/25/2014 19:28:41 09/25/2014 19:28:46 0:00:05
09/25/2014 19:29:58 09/25/2014 19:30:00 0:00:02
Refresh 1 09/25/2014 19:32:23 09/25/2014 19:32:55 0:00:32
Refresh 2 09/25/2014 19:30:00 09/25/2014 19:30:31 0:00:31
Refresh 3 09/25/2014 19:30:31 09/25/2014 19:31:00 0:00:29
Refresh 4 09/25/2014 19:31:01 09/25/2014 19:31:23 0:00:22
Refresh 5 09/25/2014 19:31:23 09/25/2014 19:31:54 0:00:31
Refresh 6 09/25/2014 19:31:55 09/25/2014 19:32:23 0:00:28

Numerical Quantities Summary Timing Intervals in Seconds:

Query Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Stream 0 4.197427 1.011516 2.535959 0.858781 2.857279 1.293530 2.682266 2.260502
Stream 1 15.467757 3.517499 13.820864 4.157259 13.141556 10.902710 16.899687 8.986535
Stream 2 15.639991 6.026485 13.521624 3.918031 17.336458 1.975310 9.718194 15.165247
Stream 3 14.891929 4.481383 15.322621 5.272911 15.266543 6.771253 13.430646 20.171084
Stream 4 14.560526 2.464157 11.567112 5.526629 20.531540 5.225971 16.288606 17.209475
Stream 5 10.390577 3.549165 9.598328 8.783847 17.351211 6.308214 12.606512 13.035716
Stream 6 16.275922 4.086475 14.109963 4.385887 10.174709 6.703266 8.936217 16.798526
Min Qi 10.390577 2.464157 9.598328 3.918031 10.174709 1.975310 8.936217 8.986535
Max Qi 16.275922 6.026485 15.322621 8.783847 20.531540 10.902710 16.899687 20.171084
Avg Qi 14.537784 4.020861 12.990085 5.340761 15.633670 6.314454 12.979977 15.227764
Query Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Stream 0 8.300092 2.598145 5.168418 1.619399 11.958836 3.191672 3.097822 2.497410
Stream 1 26.412829 17.354745 12.942454 8.169447 58.600101 15.227942 32.985324 13.914978
Stream 2 34.523245 17.635531 15.193748 8.435375 62.442800 16.276300 26.533303 12.414575
Stream 3 25.334301 18.595422 11.663933 10.029387 63.664992 20.378320 24.760768 15.710589
Stream 4 36.971957 15.645673 14.672851 13.196301 58.214728 17.375053 26.581101 11.624989
Stream 5 30.891797 12.993365 14.089049 10.515091 65.232712 20.807026 26.920526 11.362095
Stream 6 38.143281 21.106772 15.152299 18.845766 66.240343 12.295624 22.510610 18.081103
Min Qi 25.334301 12.993365 11.663933 8.169447 58.214728 12.295624 22.510610 11.362095
Max Qi 38.143281 21.106772 15.193748 18.845766 66.240343 20.807026 32.985324 18.081103
Avg Qi 32.046235 17.221918 13.952389 11.531894 62.399279 17.060044 26.715272 13.851388
Query Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2
Stream 0 4.016212 1.603004 1.836489 3.542383 3.901876 0.515102 4.759612 2.358873
Stream 1 22.162387 10.067834 15.772705 22.091355 12.974776 8.354196 19.342171 12.771250
Stream 2 25.647926 4.263008 11.590737 19.179326 17.899770 4.137031 15.720245 14.719776
Stream 3 14.511279 7.484608 20.735250 13.041037 17.139046 6.014141 16.234122 13.454647
Stream 4 19.297494 10.110707 10.907458 19.649066 15.206251 3.423503 11.268082 11.852223
Stream 5 17.445165 5.582309 15.266324 19.788382 14.245770 2.810949 16.601461 14.019717
Stream 6 25.115339 6.896503 11.661563 21.900028 5.520025 3.093050 15.436258 13.353446
Min Qi 14.511279 4.263008 10.907458 13.041037 5.520025 2.810949 11.268082 11.852223
Max Qi 25.647926 10.110707 20.735250 22.091355 17.899770 8.354196 19.342171 14.719776
Avg Qi 20.696598 7.400828 14.322339 19.274866 13.830940 4.638812 15.767057 13.361843

300G run 3:

Virt-H Executive Summary

Report Date September 25, 2014
Database Scale Factor 300
Total Data Storage/Database Size 258,888M
Start of Database Load 09/25/2014 16:38:20
End of Database Load 09/25/2014 18:32:06
Database Load Time 1:53:46
Query Streams for
Throughput Test
6
Virt-H Power 417,672.0
Virt-H Throughput 389,719.5
Virt-H Composite
Query-per-Hour Metric
(Qph@300GB)
403,453.7
Measurement Interval in
Throughput Test (Ts)
365.902000 seconds

Duration of stream execution:

Start Date/Time End Date/Time Duration
Stream 0 09/25/2014 19:36:11 09/25/2014 19:37:29 0:01:18
Stream 1 09/25/2014 19:37:32 09/25/2014 19:43:13 0:05:41
Stream 2 09/25/2014 19:37:32 09/25/2014 19:43:31 0:05:59
Stream 3 09/25/2014 19:37:32 09/25/2014 19:43:37 0:06:05
Stream 4 09/25/2014 19:37:32 09/25/2014 19:43:33 0:06:01
Stream 5 09/25/2014 19:37:32 09/25/2014 19:43:32 0:06:00
Stream 6 09/25/2014 19:37:32 09/25/2014 19:43:37 0:06:05
Refresh 0 09/25/2014 19:36:12 09/25/2014 19:36:16 0:00:04
09/25/2014 19:37:29 09/25/2014 19:37:31 0:00:02
Refresh 1 09/25/2014 19:40:02 09/25/2014 19:40:33 0:00:31
Refresh 2 09/25/2014 19:37:31 09/25/2014 19:38:01 0:00:30
Refresh 3 09/25/2014 19:38:01 09/25/2014 19:38:30 0:00:29
Refresh 4 09/25/2014 19:38:30 09/25/2014 19:38:58 0:00:28
Refresh 5 09/25/2014 19:38:58 09/25/2014 19:39:27 0:00:29
Refresh 6 09/25/2014 19:39:27 09/25/2014 19:40:01 0:00:34

Numerical Quantities Summary Timing Intervals in Seconds:

Query Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Stream 0 4.305006 1.083442 2.502758 0.845763 2.840824 1.346166 2.659511 2.233550
Stream 1 11.513360 3.732513 14.530428 3.819517 14.821291 7.561547 10.435082 8.984230
Stream 2 13.486433 3.373689 9.620363 3.914320 16.857542 5.837487 10.695443 17.901191
Stream 3 11.015942 1.780220 4.830412 9.073543 15.587709 9.661989 12.374931 15.262485
Stream 4 13.600461 0.820899 12.254226 7.799415 19.860761 13.145017 14.404345 11.807583
Stream 5 13.358000 3.885118 11.099935 4.845043 18.286721 6.424272 9.735255 15.041608
Stream 6 13.588873 3.789631 13.503399 5.130389 13.104065 3.517076 14.929079 19.831639
Min Qi 11.015942 0.820899 4.830412 3.819517 13.104065 3.517076 9.735255 8.984230
Max Qi 13.600461 3.885118 14.530428 9.073543 19.860761 13.145017 14.929079 19.831639
Avg Qi 12.760511 2.897012 10.973127 5.763705 16.419681 7.691231 12.095689 14.804789
Query Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16
Stream 0 8.553183 3.215484 4.652364 1.620089 11.936052 2.916132 3.219969 2.374276
Stream 1 29.441108 20.348266 9.994556 14.965432 60.537168 13.302875 30.159402 10.277570
Stream 2 41.799347 18.197400 16.773638 6.510347 67.461446 20.362328 0.109929 9.908769
Stream 3 24.306937 20.555376 17.140758 16.715188 61.724168 22.469230 27.967206 13.434167
Stream 4 34.820796 11.795664 18.015120 7.176057 63.134711 11.427374 23.959842 16.759246
Stream 5 23.139366 12.655317 13.152401 7.258740 64.273225 22.854106 28.803059 12.832364
Stream 6 27.955059 24.633526 11.046285 5.995041 74.965966 15.636579 22.803890 13.221303
Min Qi 23.139366 11.795664 9.994556 5.995041 60.537168 11.427374 0.109929 9.908769
Max Qi 41.799347 24.633526 18.015120 16.715188 74.965966 22.854106 30.159402 16.759246
Avg Qi 30.243769 18.030925 14.353793 9.770134 65.349447 17.675415 22.300555 12.738903
Query Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2
Stream 0 4.298092 1.702071 1.894548 4.118591 3.922889 0.491145 4.519734 2.347913
Stream 1 16.432222 6.908918 17.749058 18.756674 11.148628 5.464975 18.300673 12.972871
Stream 2 20.588544 4.387662 14.527229 23.844364 15.500462 15.543458 13.666574 15.240662
Stream 3 14.008049 6.222633 12.833421 22.811602 16.013232 9.449069 16.486111 12.974515
Stream 4 16.964699 8.106044 11.207675 22.483826 17.354675 4.641183 14.583941 13.679087
Stream 5 25.243144 7.359437 16.986615 19.855391 17.183725 5.750937 14.759597 13.052316
Stream 6 12.986721 10.160993 17.496662 19.267026 17.300224 4.955930 19.267721 15.421241
Min Qi 12.986721 4.387662 11.207675 18.756674 11.148628 4.641183 13.666574 12.972871
Max Qi 25.243144 10.160993 17.749058 23.844364 17.354675 15.543458 19.267721 15.421241
Avg Qi 17.703896 7.190948 15.133443 21.169814 15.750158 7.634259 16.177436 13.890115

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
09/26/2014 17:02 GMT Modified: 10/06/2014 13:55 GMT
In Hoc Signo Vinces (part 16 of n): Introduction to Scale-Out [ Virtuso Data Space Bot ]

So far, we have analyzed TPC-H in a single-server, memory-only setting. We will now move to larger data and cluster implementations. In principle, TPC-H parallelizes well, so we should expect near-linear scalability; i.e., twice the gear runs twice as fast, or close enough.

In practice, things are not quite so simple. Larger data, particularly a different data-to-memory ratio, and the fact of having no shared memory, all play a role. There is also a network, so partitioned operations, which also existed in the single-server case, now have to send messages across machines, not across threads. For data loading and refreshes, there is generally no shared file system, so data distribution and parallelism have to be considered.

As an initial pass, we look at 100G and 1000G scales on the same test system as before. This is two machines, each with dual Xeon E5-2630, 192 GB RAM, 2 x 512 GB SSD, and QDR InfiniBand. We will also try other platforms, but if nothing else is said, this is the test system.

As of this writing, there is a working implementation, but it is not guaranteed to be optimal as yet. We will adjust it as we go through the workload. One outcome of the experiment will be a precise determination of the data-volume-to-RAM ratio that still gives good performance.

A priori, we know of the following things that complicate life with clusters:

  • Distributed memory — The working set must be in memory for a run to have a competitive score. A cluster can have a lot of memory, and the data is such that it partitions very evenly, so this appears at first not a problem. The difficulty comes with query memory: If each machine has 1/16th of the total RAM and a hash table would be 1/64th of the working set, on a single-server it is no problem just building the hash table. On a scale-out system, the hash table would be 1/4 of the working set if replicated on each node, which will not fit, especially if there are many such hash tables at the same time. Two main approaches exist: The hash table can be partitioned, but this will force the probe to go cross-partition, which takes time. The other possibility is to build the hash table many times, each time with a fraction of the data, and to run the probe side many times. Since hash tables often have Bloom filters, it is sometimes possible to replicate the Bloom filter and partition the hash table. One has also heard of hash tables that go to secondary storage, but should this happen, the race is already lost; so, we do not go there.

    We must evaluate different combinations of these techniques and have a cost model that accurately predicts the performance of each variant. Adding to realism is always safe but halfway difficult to do.

  • NUMA — Most servers are NUMA (non-uniform memory architecture), where each CPU socket has its own local memory. For single-server cases, we use all the memory for the process. Some implementations have special logic for memory affinity between threads. With scale-out there is the choice of having a server process per-NUMA-node or per-physical-machine. If per-NUMA-node, we are guaranteed only local memory accesses. This is a tradeoff to be evaluated.

  • Network and Scheduling — Execution on a cluster is always vectored, for the simple reason that sending single-tuple messages is unfeasible in terms of performance. With an otherwise vectored architecture, the message batching required on a cluster comes naturally. However, the larger the cluster, the more partitions there are, which rapidly gets into shorter messages. Increasing the vector size is possible and messages become longer, but indefinite increase in vector size has drawbacks for cache locality and takes memory. To run well, each thread must stay on core. There are two ways of being taken off core ahead of time: Blocking for a mutex, and blocking for network. Lots of short messages run into scheduling overhead, since the recipient must decide what to do with each, which is not really possible without some sort of critical section. This is more efficient if messages are longer, as the decision time does not depend on message length. Longer messages are however liable to block on write at the sender side. So one pays in either case. This is another tradeoff to be balanced.

  • Flow control — A query is a pipeline of producers and consumers. Sometimes the consumer is in a different partition. The producer must not get indefinitely ahead of the consumer because this would run out of memory, but it must stay sufficiently ahead so as not to stop the consumer. In practice, there are synchronization barriers to check even progress. These will decrease platform utilization, because two threads never finish at exactly the same time. The price of not having these is having no cap on transient memory consumption.

  • Un-homogenous performance — Identical machines do not always perform identically. This is seen especially with disk, where wear on SSDs can affect write speed, and where uncontrollable hazards of data placement will get uneven read speeds on rotating media. Purely memory-bound performance is quite close, though. Un-anticipatable and uncontrollable hazards of scheduling cause different times of arrival of network messages, which introduces variation in run time on consecutive runs. Single-servers have some such variation from threading, but the effects are larger with a network.

The logical side of query optimization stays the same. Pushing down predicates is always good, and all the logical tricks with moving conditions between subqueries stay the same.

Schema design stays much the same, but there is the extra question of partitioning keys. In this implementation, there are only indices on identifiers, not on dates, for example. So, for a primary key to foreign key join, if there is an index on the foreign key, the index should be partitioned the same way as the primary key. So, joining from orders to lineitem on orderkey will be co-located. Joining from customer to orders by index will be colocated for the c_custkey = o_custkey part (assuming an index on o_custkey) and cross-partition for getting the customer row on c_custkey, supposing that the query needs some property of the customer other than c_custkey or c_orderkey.

A secondary question is the partition granularity. For good compression, nearby values should be consecutive, so here we leave the low 12 bits out of the partitioning. This has effect on bulk load and refreshes, for example, so that a batch of 10,000 lineitems, ordered on l_orderkey will go to only 2 or 3 distinct destinations, thus getting longer messages and longer insert batches, which is more efficient.

This is a quick overview of the wisdom so far. In subsequent installments, we will take a quantitative look at the tradeoffs and consider actual queries. As a conclusion, we will show a full run on a couple of different platforms, and likely provide Amazon machine images for the interested to see for themselves. Virtuoso Cluster is not open source, but the cloud will provide easy access.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
09/24/2014 13:05 GMT Modified: 10/06/2014 13:58 GMT
In Hoc Signo Vinces (part 16 of n): Introduction to Scale-Out [ Orri Erling ]

So far, we have analyzed TPC-H in a single-server, memory-only setting. We will now move to larger data and cluster implementations. In principle, TPC-H parallelizes well, so we should expect near-linear scalability; i.e., twice the gear runs twice as fast, or close enough.

In practice, things are not quite so simple. Larger data, particularly a different data-to-memory ratio, and the fact of having no shared memory, all play a role. There is also a network, so partitioned operations, which also existed in the single-server case, now have to send messages across machines, not across threads. For data loading and refreshes, there is generally no shared file system, so data distribution and parallelism have to be considered.

As an initial pass, we look at 100G and 1000G scales on the same test system as before. This is two machines, each with dual Xeon E5-2630, 192 GB RAM, 2 x 512 GB SSD, and QDR InfiniBand. We will also try other platforms, but if nothing else is said, this is the test system.

As of this writing, there is a working implementation, but it is not guaranteed to be optimal as yet. We will adjust it as we go through the workload. One outcome of the experiment will be a precise determination of the data-volume-to-RAM ratio that still gives good performance.

A priori, we know of the following things that complicate life with clusters:

  • Distributed memory — The working set must be in memory for a run to have a competitive score. A cluster can have a lot of memory, and the data is such that it partitions very evenly, so this appears at first not a problem. The difficulty comes with query memory: If each machine has 1/16th of the total RAM and a hash table would be 1/64th of the working set, on a single-server it is no problem just building the hash table. On a scale-out system, the hash table would be 1/4 of the working set if replicated on each node, which will not fit, especially if there are many such hash tables at the same time. Two main approaches exist: The hash table can be partitioned, but this will force the probe to go cross-partition, which takes time. The other possibility is to build the hash table many times, each time with a fraction of the data, and to run the probe side many times. Since hash tables often have Bloom filters, it is sometimes possible to replicate the Bloom filter and partition the hash table. One has also heard of hash tables that go to secondary storage, but should this happen, the race is already lost; so, we do not go there.

    We must evaluate different combinations of these techniques and have a cost model that accurately predicts the performance of each variant. Adding to realism is always safe but halfway difficult to do.

  • NUMA — Most servers are NUMA (non-uniform memory architecture), where each CPU socket has its own local memory. For single-server cases, we use all the memory for the process. Some implementations have special logic for memory affinity between threads. With scale-out there is the choice of having a server process per-NUMA-node or per-physical-machine. If per-NUMA-node, we are guaranteed only local memory accesses. This is a tradeoff to be evaluated.

  • Network and Scheduling — Execution on a cluster is always vectored, for the simple reason that sending single-tuple messages is unfeasible in terms of performance. With an otherwise vectored architecture, the message batching required on a cluster comes naturally. However, the larger the cluster, the more partitions there are, which rapidly gets into shorter messages. Increasing the vector size is possible and messages become longer, but indefinite increase in vector size has drawbacks for cache locality and takes memory. To run well, each thread must stay on core. There are two ways of being taken off core ahead of time: Blocking for a mutex, and blocking for network. Lots of short messages run into scheduling overhead, since the recipient must decide what to do with each, which is not really possible without some sort of critical section. This is more efficient if messages are longer, as the decision time does not depend on message length. Longer messages are however liable to block on write at the sender side. So one pays in either case. This is another tradeoff to be balanced.

  • Flow control — A query is a pipeline of producers and consumers. Sometimes the consumer is in a different partition. The producer must not get indefinitely ahead of the consumer because this would run out of memory, but it must stay sufficiently ahead so as not to stop the consumer. In practice, there are synchronization barriers to check even progress. These will decrease platform utilization, because two threads never finish at exactly the same time. The price of not having these is having no cap on transient memory consumption.

  • Un-homogenous performance — Identical machines do not always perform identically. This is seen especially with disk, where wear on SSDs can affect write speed, and where uncontrollable hazards of data placement will get uneven read speeds on rotating media. Purely memory-bound performance is quite close, though. Un-anticipatable and uncontrollable hazards of scheduling cause different times of arrival of network messages, which introduces variation in run time on consecutive runs. Single-servers have some such variation from threading, but the effects are larger with a network.

The logical side of query optimization stays the same. Pushing down predicates is always good, and all the logical tricks with moving conditions between subqueries stay the same.

Schema design stays much the same, but there is the extra question of partitioning keys. In this implementation, there are only indices on identifiers, not on dates, for example. So, for a primary key to foreign key join, if there is an index on the foreign key, the index should be partitioned the same way as the primary key. So, joining from orders to lineitem on orderkey will be co-located. Joining from customer to orders by index will be colocated for the c_custkey = o_custkey part (assuming an index on o_custkey) and cross-partition for getting the customer row on c_custkey, supposing that the query needs some property of the customer other than c_custkey or c_orderkey.

A secondary question is the partition granularity. For good compression, nearby values should be consecutive, so here we leave the low 12 bits out of the partitioning. This has effect on bulk load and refreshes, for example, so that a batch of 10,000 lineitems, ordered on l_orderkey will go to only 2 or 3 distinct destinations, thus getting longer messages and longer insert batches, which is more efficient.

This is a quick overview of the wisdom so far. In subsequent installments, we will take a quantitative look at the tradeoffs and consider actual queries. As a conclusion, we will show a full run on a couple of different platforms, and likely provide Amazon machine images for the interested to see for themselves. Virtuoso Cluster is not open source, but the cloud will provide easy access.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
09/24/2014 13:05 GMT Modified: 10/06/2014 13:55 GMT
 <<     | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |     >>
Powered by OpenLink Virtuoso Universal Server
Running on Linux platform