Not logged in : Login
(Sponging disallowed)

About: In Hoc Signo Vinces (part 18 of n): Cluster Dynamics     Goto   Sponge   NotDistinct   Permalink

An Entity of Type : schema:BlogPosting, within Data Space : www.openlinksw.com associated with source document(s)
QRcode icon
http://www.openlinksw.com/describe/?url=http%3A%2F%2Fwww.openlinksw.com%2Fdataspace%2Fvdb%2Fweblog%2Fvdb%2527s%2520BLOG%2520%255B136%255D%2F1821&graph=http%3A%2F%2Fwww.openlinksw.com%2Fdataspace&graph=http%3A%2F%2Fwww.openlinksw.com%2Fdataspace

AttributesValues
has container
Date Created
maker
Date Modified
link
id
  • 722734c97f9f117cae389992dc4e2969
content
  • This article is about how scale-out differs from single-server. This shows large effects of parameters whose very existence most would not anticipate, and some low level metrics for assessing these. The moral of the story is that this is the stuff which makes the difference between merely surviving scale-out and winning with it. The developer and DBA would not normally know about this; thus these things fall into the category of adaptive self-configuration expected from the DBMS. But since this series is about what makes performance, I will discuss the dynamics such as they are and how to play these.

    We take the prototypical cross partition join in Q13: Make a hash table of all customers, partitioned by c_custkey. This is independently done with full parallelism in each partition. Scan the orders, get the customer (in a different partition), and flag the customers that had at least one order. Then, to get the customers with no orders, return the customers that were not flagged in the previous pass.

    The single-server time in part 12 was 7.8 and 6.0 with a single user. We consider the better of the times. The difference is due to allocating memory on the first go; on the second go the memory is already in reserve.

    With default settings, we get 4595 ms (microseconds), with per node resource utilization at:

    
    Cluster 4 nodes, 4 s. 112405 m/s 742602 KB/s  2749% cpu 0%  read 4% clw threads 1r 0w 0i buffers 8577766 287874 d 0 w 0 pfs
    cl 1: 27867 m/s 185654 KB/s  733% cpu 0%  read 4% clw threads 1r 0w 0i buffers 2144242 71757 d 0 w 0 pfs
    cl 2: 28149 m/s 185372 KB/s  672% cpu 0%  read 0% clw threads 0r 0w 0i buffers 2144640 71903 d 0 w 0 pfs
    cl 3: 28220 m/s 185621 KB/s  675% cpu 0%  read 0% clw threads 0r 0w 0i buffers 2144454 71962 d 0 w 0 pfs
    cl 4: 28150 m/s 185837 KB/s  667% cpu 0%  read 0% clw threads 0r 0w 0i buffers 2144430 72252 d 0 w 0 pfs
    
     

    The top line is the summary; the lines below are per-process. The m/s is messages-per-second; KB/s is interconnect traffic per second; clw % is idle time spent waiting for a reply from another process. The cluster is set up with 4 processes across 2 machines, each with 2 NUMA nodes. Each process has affinity to the NUMA node, so local memory only. The time is reasonable in light of the overall CPU of 2700%. The maximum would be 4800% with all threads of all cores busy all the time.

    The catch here is that we do not have a steady half-platform utilization all the time, but full platform peaks followed by synchronization barriers with very low utilization. So, we set the batch size differently:

    
    cl_exec ('__dbf_set (''cl_dfg_batch_bytes'', 50000000)');
    
     

    This means that we set, on each process, the cl_dfg_batch_bytes to 50M from a default of 10M. The effect is that each scan of orders, one thread per slice, 48 slices total, will produce 50MB worth of o_custkeys to be sent to the other partition for getting the customer. After each 50M, the thread stops and will produce the next batch when all are done and a global continue message is sent by the coordinator.

    The time is now 3173 ms with:

    
    Cluster 4 nodes, 3 s. 158220 m/s 1054944 KB/s  3676% cpu 0%  read 1% clw threads 1r 0w 0i buffers 8577766 287874 d 0 w 0 pfs
    cl 1: 39594 m/s 263962 KB/s  947% cpu 0%  read 1% clw threads 1r 0w 0i buffers 2144242 71757 d 0 w 0 pfs
    cl 2: 39531 m/s 263476 KB/s  894% cpu 0%  read 0% clw threads 0r 0w 0i buffers 2144640 71903 d 0 w 0 pfs
    cl 3: 39523 m/s 263684 KB/s  933% cpu 0%  read 0% clw threads 0r 0w 0i buffers 2144454 71962 d 0 w 0 pfs
    cl 4: 39535 m/s 263586 KB/s  900% cpu 0%  read 0% clw threads 0r 0w 0i buffers 2144430 72252 d 0 w 0 pfs
    
     

    The platform utilization is better as we see. The throughput is nearly double that of the single-server, which is pretty good for a communication-heavy query.

    This was done with a vector size of 10K. In other words, each partition gets 10K o_custkeys and splits these 48 ways to go to every recipient. 1/4 are in the same process, 1/4 in a different process on the same machine, and 2/4 on a different machine. The recipient gets messages with an average of 208 o_custkey values, puts them back together in batches of 10K, and passes these to the hash join with customer.

    We try different vector sizes, such as 100K:

    
    cl_exec ('__dbf_set (''dc_batch_sz'', 100000)');
    
     

    There are two metrics of interest here: The write block time, and the scheduling overhead. The write block time is microseconds, which increases whenever a thread must wait before it can write to a connection. The scheduling overhead is cumulative clocks spent by threads while waiting for a critical section that deals with dispatching messages to consumer threads. Long messages make blocking; short messages make frequent scheduling decisions.

    
    SELECT cl_sys_stat ('local_cll_clk', clr=>1), 
           cl_sys_stat ('write_block_usec', clr=>1)
    ;
    
     

    cl_sys_stat gets the counters from all processes and returns the sum. clr=>1 means that the counter is cleared after read.

    We do Q13 with vector sizes of 10, 100, and 1000K.

    Vector size msec mtx wblock
    10K 3297 10,829,910,329 0
    100K 3150 1,663,238,367 59,132
    1000K 3876 414,631,129 4,578,003

    So, 100K seems to strike the best balance between scheduling and blocking on write.

    The times are measured after several samples with each setting. The times stabilize after a few runs, as the appropriate size memory blocks are in reserve. Calling mmap to allocate these on the first run with each size has a very high penalty, e.g., 60s for the first run with 1M vector size. We note that blocking on write is really bad even though 1/3 of the time there is no network and 2/3 of the time there is a fast network (QDR IB) with no other load. Further, the affinities are set so that the thread responsible for incoming messages is always on core. Result variability on consecutive runs is under 5%, which is similar to single-server behavior.

    It would seem that a mutex, as bad as it is, is still better than a distributed cause for going off core (blocking on write). The latency for continuing a thread thus blocked is of course higher than the latency for continuing one that is waiting for a mutex.

    We note that a cluster with more machines can take a longer vector size because a vector spreads out to more recipients. The key seems to be to set the message size so that blocking on write is not common. This is a possible adaptive execution feature. We have seen no particular benefit from SDP (Sockets Direct Protocol) and its zero copy. This is a TCP replacement that comes with the InfiniBand drivers.

    We will next look at replication/partitioning tradeoffs for hash joins. Then we can look at full runs.

    To be continued...

    In Hoc Signo Vinces (TPC-H) Series

Title
  • In Hoc Signo Vinces (part 18 of n): Cluster Dynamics
is described using
atom:source
atom:updated
  • 2015-06-10T16:09:25Z
atom:title
  • In Hoc Signo Vinces (part 18 of n): Cluster Dynamics
links to
atom:author
label
  • In Hoc Signo Vinces (part 18 of n): Cluster Dynamics
atom:published
  • 2014-09-26T21:07:20Z
http://rdfs.org/si...ices#has_services
type
is made of
is link of
is atom:contains of
is atom:entry of
is container of of
is http://rdfs.org/si...vices#services_of of
Faceted Search & Find service v1.17_git122 as of Jan 03 2023


Alternative Linked Data Documents: iSPARQL | ODE     Content Formats:   [cxml] [csv]     RDF   [text] [turtle] [ld+json] [rdf+json] [rdf+xml]     ODATA   [atom+xml] [odata+json]     Microdata   [microdata+json] [html]    About   
This material is Open Knowledge   W3C Semantic Web Technology [RDF Data] Valid XHTML + RDFa
OpenLink Virtuoso version 08.03.3330 as of Apr 5 2024, on Linux (x86_64-generic-linux-glibc25), Single-Server Edition (30 GB total memory, 26 GB memory in use)
Data on this page belongs to its respective rights holders.
Virtuoso Faceted Browser Copyright © 2009-2024 OpenLink Software