About: LOD2 Finale (part 2 of n): The 500 Giga-triples

Not logged in : Login

(Sponging disallowed)

Facets (new session)
Description
Metadata
Settings
- Rule:
- Inverse Functional Properties:
- "Same As":

About: LOD2 Finale (part 2 of n): The 500 Giga-triples Goto Sponge NotDistinct Permalink

An Entity of Type : schema:BlogPosting, within Data Space : www.openlinksw.com associated with source document(s)
QRcode icon

http://www.openlinksw.com/describe/?url=http%3A%2F%2Fwww.openlinksw.com%2Fdataspace%2Fvdb%2Fweblog%2Fvdb%2527s%2520BLOG%2520%255B136%255D%2F1807&graph=http%3A%2F%2Fwww.openlinksw.com%2Fdataspace&graph=http%3A%2F%2Fwww.openlinksw.com%2Fdataspace

Attributes	Values
has container	vdb's BLOG [136] description
Date Created	2014-08-18 16:55:57.858644-04:00(dt:dateTime)
maker	Virtuso Data Space Bot
Date Modified	2014-09-06 20:48:58.008009-04:00(dt:dateTime)
link	LOD2 Finale (part 2 of n): The 500 Giga-triples
id	921bfa2551c1627e36ae25b49bb1d46d
content	No epic is complete without a descent into hell. Enter the historia calamitatum of the 500 Giga-triples (Gt) at CWI's Scilens cluster. Now, from last time, we know to generate the data without 10 GB of namespace prefixes per file and with many short files. So we have 1.5 TB of gzipped data in 40,000 files, spread over 12 machines. The data generator has again been modified. Now the generation was about 4 days. Also from last time, we know to treat small integers specially when they occur as partition keys: 1 and 2 are very common values and skew becomes severe if they all go to the same partition; hence consecutive small `INTs` each go to a different partition, but for larger ones the low 8 bits are ignored, which is good for compression: Consecutive values must fall in consecutive places, but not for small `INTs`. Another uniquely brain-dead feature of the BSBM generator has also been rectified: When generating multiple files, the program would put things in files in a round-robin manner, instead of putting consecutive numbers in consecutive places, which is how every other data generator or exporter does it. This impacts bulk load locality and as you, dear reader, ought to know by now, performance comes from (1) locality and (2) parallelism. The machines are similar to last time: each a dual E5 2650 v2 with 256 GB RAM and QDR InfiniBand (IB). No SSD this time, but a slightly higher clock than last time; anyway, a different set of machines. The first experiment is with triples, so no characteristic sets, no schema. So, first day (Monday), we notice that one cannot allocate more than 9 GB of memory. Then we figure out that it cannot be done with `malloc`, whether in small or large pieces, but it can with `mmap`. Ain't seen that before. One day shot. Then, towards the end of day 2, load begins. But it does not run for more than 15 minutes before a network error causes the whole thing to abort. All subsequent tries die within 15 minutes. Then, in the morning of day 3, we switch from IB to Gigabit Ethernet (GigE). For loading this is all the same; the maximal aggregate throughput is 800 MB/s, which is around 40% of the nominal bidirectional capacity of 12 GigE's. So, it works better, for 30 minutes, and one can even stop the load and do a checkpoint. But after resuming, one box just dies; does not even respond to ping. We change this to another. After this, still running on GigE, there are no more network errors. So, at the end of day 3, maybe 10% of the data are in. But now it takes 2h21min to make a checkpoint, i.e., make the loaded data durable on disk. One of the boxes manages to write 2 MB/s to a RAID-0 of three 2 TB drives. Bad disk, seen such before. The data can however be read back once the write is finally done. Well, this is a non-starter. So, by mid-day of day 4, another machine has been replaced. Now writing to disk is possible within expected delays. In the afternoon of day 4, the load rate is about 4.3 Mega-triples (Mt) per second, all going in RAM. In the evening of day 4, adding more files to load in parallel increases the load rate to between 4.9 and 5.2 Mt/s. This is about as fast as this will go, since the load is not exactly even. This comes from the RDF stupidity of keeping an index on everything, so even object values where an index is useless get indexed, leading to some load peaks. For example, there is an index on `POSG` for triples were the predicate is `rdf:type` and the object is a common type. Use of characteristic sets will stop this nonsense. But let us not get ahead of the facts: At 9:10 PM of day 4, the whole cluster goes unreachable. No, this is not a software crash or swapping; this also affects boxes on which nothing of the experiment was running. A whole night of running is shot. A previous scale model experiment of loading 37.5 Gt in 192 GB of RAM, paging to a pair of 2 TB disks, has been done a week before. This finishes in time, keeping a load rate of above 400 Kt/s on a 12-core box. At 10AM on day 5 (Friday), the cluster is rebooted; a whole night's run missed. The cluster starts and takes about 30 minutes to get to its former 5 Mt/s load rate. We now try switching the network back to InfiniBand. The whole ethernet network seemed to have crashed at 9PM on day 4. This is of course unexplained but the experiment had been driving the ethernet at about half its cross-sectional throughput, so maybe a switch crashed. We will never know. We will now try IB rather than risk this happening again, especially since if it did repeat, the whole weekend would be shot, as we would have to wait for the admin to reboot the lot on Monday (day 8). So, at noon on day 5, the cluster is restarted with IB. The cruising speed is now 6.2 Mt/s, thanks to the faster network. The cross sectional throughput is about 960 MB/s, up from 720 MB/s, which accounts for the difference. CPU load is correspondingly up. This is still not full platform since there is load unbalance as noted above. At 9PM on day 5, the rate is around 5.7 Mt/s with the peak node at 1500% CPU out of a possible 1600%. The next one is under 800%, which is just to show what it means to index everything. In specific, the node that has the highest CPU is the one in whose partition the `bsbm:offer` class falls, so that there is a local peak since one of every 9 or so triples says that something is an `offer`. The stupidity of the triple store is to index garbage like this to begin with. The reason why the performance is still good is that a `POSG` index where `P` and `O` are fixed and the `S` is densely ascending is very good, with everything but the `S` represented as run lengths and the `S` as bitmaps. Still, no representation at all is better for performance than even the most efficient representation. The journey consists of 3 different parts. At 10PM, the 3rd and last part is started. The triples have more literals, but the load is more even. The cruising speed is 4.3 Mt/s down from 6.2, but the data has a different shape, including more literals. The last stretch of the data is about reviews. This stretch of the data has less skew. So we increase parallelism, running 8 x 24 files at a time. The load rate goes above 6.3 Mt/s. At 6:45 in the morning of day 6, the data is all loaded. The count of triples is 490.0 billion. If the load were done in a single stretch without stops and reconfiguration, it would likely go in under 24h. The average rate for a 4 hour sample between midnight and 4AM of day 6 is 6.8 MT/s. The resulting database files add up to 10.9 TB, with about 20% of the volume in unallocated pages. At this time, noon of day 6, we find that some cross-partition joins need more distinct pieces of memory than the default kernel settings allow per process. A large number of partitions makes a large number of sometimes long messages which makes many `mmaps`. So we will wait until morning of day 8 (Monday) for the administrator to set these. In the meantime, we analyze the behavior of the workload on the 37 Gt scale model cluster on my desktop. To be continued... LOD2 Finale Series Indexing everything Having literals and URI strings via dictionary Having a join for every attribute
Title	LOD2 Finale (part 2 of n): The 500 Giga-triples
is described using	http://www.openlinksw.com/dataspace/vdb/weblog/vdb%27s%20BLOG%20%5B136%5D/1807/sioc.rdf
atom:source	vdb's BLOG [136] description
atom:updated	2014-09-07T00:48:58Z
atom:title	LOD2 Finale (part 2 of n): The 500 Giga-triples
links to	http://scilens.project.cwi.nl/ http://www.cwi.nl/ http://www.openlinksw.com/weblog/oerling/?id=1767
atom:author	Virtuso Data Space Bot
label	LOD2 Finale (part 2 of n): The 500 Giga-triples
atom:published	2014-08-18T20:55:57Z
http://rdfs.org/si...ices#has_services	http://www.openlinksw.com/dataspace/services/weblog/item
type	Blog Post atom:Entry BlogPosting
is made of	Virtuso Data Space Bot
is link of	LOD2 Finale (part 2 of n): The 500 Giga-triples
is atom:contains of	vdb's BLOG [136] description
is atom:entry of	vdb's BLOG [136] description
is container of of	vdb's BLOG [136] description
is http://rdfs.org/si...vices#services_of of	http://www.openlinksw.com/dataspace/services/weblog/item

Faceted Search & Find service v1.17_git122 as of Jan 03 2023

Alternative Linked Data Documents: iSPARQL | ODE Content Formats:

RDF

ODATA

Microdata

About

OpenLink Virtuoso version 08.03.3330 as of Apr 5 2024, on Linux (x86_64-generic-linux-glibc25), Single-Server Edition (30 GB total memory, 26 GB memory in use)
Data on this page belongs to its respective rights holders.
Virtuoso Faceted Browser Copyright © 2009-2024 OpenLink Software

About: LOD2 Finale (part 2 of n): The 500 Giga-triples Goto Sponge NotDistinct Permalink

LOD2 Finale Series