No epic is complete without a descent into hell. Enter the historia calamitatum of the 500 Giga-triples (Gt) at CWI's Scilens cluster.
Now, from last time, we know to generate the data without 10 GB of namespace prefixes per file and with many short files. So we have 1.5 TB of gzipped data in 40,000 files, spread over 12 machines. The data generator has again been modified. Now the generation was about 4 days. Also from last time, we know to treat small integers specially when they occur as partition keys: 1 and 2 are very common values and skew becomes severe if they all go to the same partition; hence consecutive small INTs each go to a different partition, but for larger ones the low 8 bits are ignored, which is good for compression: Consecutive values must fall in consecutive places, but not for small INTs. Another uniquely brain-dead feature of the BSBM generator has also been rectified: When generating multiple files, the program would put things in files in a round-robin manner, instead of putting consecutive numbers in consecutive places, which is how every other data generator or exporter does it. This impacts bulk load locality and as you, dear reader, ought to know by now, performance comes from (1) locality and (2) parallelism.
The machines are similar to last time: each a dual E5 2650 v2 with 256 GB RAM and QDR InfiniBand (IB). No SSD this time, but a slightly higher clock than last time; anyway, a different set of machines.
The first experiment is with triples, so no characteristic sets, no schema.
So, first day (Monday), we notice that one cannot allocate more than 9 GB of memory. Then we figure out that it cannot be done with malloc, whether in small or large pieces, but it can with mmap. Ain't seen that before. One day shot. Then, towards the end of day 2, load begins. But it does not run for more than 15 minutes before a network error causes the whole thing to abort. All subsequent tries die within 15 minutes. Then, in the morning of day 3, we switch from IB to Gigabit Ethernet (GigE). For loading this is all the same; the maximal aggregate throughput is 800 MB/s, which is around 40% of the nominal bidirectional capacity of 12 GigE's. So, it works better, for 30 minutes, and one can even stop the load and do a checkpoint. But after resuming, one box just dies; does not even respond to ping. We change this to another. After this, still running on GigE, there are no more network errors. So, at the end of day 3, maybe 10% of the data are in. But now it takes 2h21min to make a checkpoint, i.e., make the loaded data durable on disk. One of the boxes manages to write 2 MB/s to a RAID-0 of three 2 TB drives. Bad disk, seen such before. The data can however be read back once the write is finally done.
Well, this is a non-starter. So, by mid-day of day 4, another machine has been replaced. Now writing to disk is possible within expected delays.
In the afternoon of day 4, the load rate is about 4.3 Mega-triples (Mt) per second, all going in RAM.
In the evening of day 4, adding more files to load in parallel increases the load rate to between 4.9 and 5.2 Mt/s. This is about as fast as this will go, since the load is not exactly even. This comes from the RDF stupidity of keeping an index on everything, so even object values where an index is useless get indexed, leading to some load peaks. For example, there is an index on POSG for triples were the predicate is rdf:type and the object is a common type. Use of characteristic sets will stop this nonsense.
But let us not get ahead of the facts: At 9:10 PM of day 4, the whole cluster goes unreachable. No, this is not a software crash or swapping; this also affects boxes on which nothing of the experiment was running. A whole night of running is shot.
A previous scale model experiment of loading 37.5 Gt in 192 GB of RAM, paging to a pair of 2 TB disks, has been done a week before. This finishes in time, keeping a load rate of above 400 Kt/s on a 12-core box.
At 10AM on day 5 (Friday), the cluster is rebooted; a whole night's run missed. The cluster starts and takes about 30 minutes to get to its former 5 Mt/s load rate. We now try switching the network back to InfiniBand. The whole ethernet network seemed to have crashed at 9PM on day 4. This is of course unexplained but the experiment had been driving the ethernet at about half its cross-sectional throughput, so maybe a switch crashed. We will never know. We will now try IB rather than risk this happening again, especially since if it did repeat, the whole weekend would be shot, as we would have to wait for the admin to reboot the lot on Monday (day 8).
So, at noon on day 5, the cluster is restarted with IB. The cruising speed is now 6.2 Mt/s, thanks to the faster network. The cross sectional throughput is about 960 MB/s, up from 720 MB/s, which accounts for the difference. CPU load is correspondingly up. This is still not full platform since there is load unbalance as noted above.
At 9PM on day 5, the rate is around 5.7 Mt/s with the peak node at 1500% CPU out of a possible 1600%. The next one is under 800%, which is just to show what it means to index everything. In specific, the node that has the highest CPU is the one in whose partition the bsbm:offer class falls, so that there is a local peak since one of every 9 or so triples says that something is an offer. The stupidity of the triple store is to index garbage like this to begin with. The reason why the performance is still good is that a POSG index where P and O are fixed and the S is densely ascending is very good, with everything but the S represented as run lengths and the S as bitmaps. Still, no representation at all is better for performance than even the most efficient representation.
The journey consists of 3 different parts. At 10PM, the 3rd and last part is started. The triples have more literals, but the load is more even. The cruising speed is 4.3 Mt/s down from 6.2, but the data has a different shape, including more literals.
The last stretch of the data is about reviews. This stretch of the data has less skew. So we increase parallelism, running 8 x 24 files at a time. The load rate goes above 6.3 Mt/s.
At 6:45 in the morning of day 6, the data is all loaded. The count of triples is 490.0 billion. If the load were done in a single stretch without stops and reconfiguration, it would likely go in under 24h. The average rate for a 4 hour sample between midnight and 4AM of day 6 is 6.8 MT/s. The resulting database files add up to 10.9 TB, with about 20% of the volume in unallocated pages.
At this time, noon of day 6, we find that some cross-partition joins need more distinct pieces of memory than the default kernel settings allow per process. A large number of partitions makes a large number of sometimes long messages which makes many mmaps. So we will wait until morning of day 8 (Monday) for the administrator to set these. In the meantime, we analyze the behavior of the workload on the 37 Gt scale model cluster on my desktop.
To be continued...
About this entry:
Author: Virtuso Data Space Bot
Published: 08/18/2014 16:55 GMT-0500
09/06/2014 20:48 GMT-0500
Comment Status: 0 Comments