This HTML5 document contains 64 embedded RDF statements represented using HTML+Microdata notation.

The embedded RDF content will be recognized by any processor of HTML5 Microdata.

Namespace Prefixes

PrefixIRI
n11http://www.openlinksw.com/about/id/entity/http/www.openlinksw.com/dataspace/person/iODBC/
n14http://virtuoso.openlinksw.com:8889/about/id/entity/http/www.openlinksw.com/dataspace/organization/
n24http://www.openlinksw.com/about/id/entity/http/www.openlinksw.com/dataspace/organization/openlink/
n15http://rdfs.org/sioc/services#
dchttp://purl.org/dc/elements/1.1/
schemahttp://schema.org/
n23http://www.openlinksw.com/about/id/entity/http/www.openlinksw.com/dataspace/organization/vdb/
dctermshttp://purl.org/dc/terms/
n7http://www.openlinksw.com/dataspace/vdb/weblog/vdb%27s%20BLOG%20%5B136%5D/1807/
rdfshttp://www.w3.org/2000/01/rdf-schema#
rdfhttp://www.w3.org/1999/02/22-rdf-syntax-ns#
n21http://data.openlinksw.com/about/id/entity/http/www.openlinksw.com/dataspace/organization/
atomhttp://atomowl.org/ontologies/atomrdf#
n35http://ods.openlinksw.com:8889/about/id/entity/http/www.openlinksw.com/dataspace/organization/
xsdhhttp://www.w3.org/2001/XMLSchema#
n10http://www.openlinksw.com/about/id/entity/http/www.openlinksw.com/dataspace/services/weblog/
n29http://www.cwi.nl/
n17http://scilens.project.cwi.nl/
n13http://www.openlinksw.com/about/id/entity/http/www.openlinksw.com/dataspace/organization/
siochttp://rdfs.org/sioc/ns#
n2http://www.openlinksw.com/dataspace/vdb/weblog/vdb%27s%20BLOG%20%5B136%5D/
n39http://www.openlinksw.com/about/id/entity/http/www.openlinksw.com/dataspace/vdb/weblog/
n36http://www.openlinksw.com/dataspace/vdb/weblog/vdb%27s%20BLOG%20%5B136%5D/1807#
n41http://www.openlinksw.com/about/id/entity/http/www.openlinksw.com/dataspace/person/oerling/
n9http://www.openlinksw.com/dataspace/organization/vdb#
oplhttp://www.openlinksw.com/schema/attribution#
n5http://www.openlinksw.com/about/id/entity/http/www.openlinksw.com/dataspace/person/
n34http://virtuoso.openlinksw.com:8889/about/id/entity/http/www.openlinksw.com/dataspace/
n19https://www.openlinksw.com/about/id/entity/http/www.openlinksw.com/dataspace/organization/
n42https://www.openlinksw.com/about/id/entity/http/www.openlinksw.com/dataspace/vdb/weblog/
n32http://www.openlinksw.com/dataspace/vdb/weblog/vdb%27s%20BLOG%20%5B136%5D/1807/page/
foafhttp://xmlns.com/foaf/0.1/
n38http://virtuoso.openlinksw.com:8889/about/id/entity/http/www.openlinksw.com/dataspace/vdb/weblog/
n37http://virtuoso.openlinksw.com:8889/about/id/entity/http/www.openlinksw.com/dataspace/services/weblog/
siocthttp://rdfs.org/sioc/types#
wdrshttp://www.w3.org/2007/05/powder-s#
n22http://www.openlinksw.com/about/id/entity/http/www.openlinksw.com/dataspace/organization/dav/
n28http://virtuoso.openlinksw.com:8889/about/id/entity/http/www.openlinksw.com/dataspace/person/
n16http://www.openlinksw.com/dataspace/services/weblog/
n20http://www.openlinksw.com/weblog/oerling/
n27http://www.openlinksw.com/dataspace/vdb/weblog/
n18http://www.openlinksw.com/about/id/entity/http/www.openlinksw.com/dataspace/
n33http://www.openlinksw.com/dataspace/vdb#

Statements

Subject Item
n2:1807
sioc:has_container
n27:vdb%27s%20BLOG%20%5B136%5D
dcterms:created
2014-08-18T16:55:57.858644-04:00
foaf:maker
n9:this
foaf:topic
n27:vdb%27s%20BLOG%20%5B136%5D n33:this n36:this n16:item n32:1 n9:this
wdrs:describedby
n5:borislav n5:hwilliams n10:item n11:about.rdf n13:dav n14:vdb n13:openlink n18:vdb n19:openlink n5:netrista n21:openlink n22:sioc.rdf n23:about.rdf n24:sioc.rdf n24:foaf.rdf n28:jeep1688 n13:uda n5:jeep1688 n13:vdb n5:dba n34:vdb n35:openlink n23:sioc.rdf n5:upstream n5:openlink n5:iODBC n37:item n38:vdb%27s%20BLOG%20%5B136%5D n39:vdb%27s%20BLOG%20%5B136%5D n5:oat n41:about.rdf n5:tutorial_demo n5:oerling n42:vdb%27s%20BLOG%20%5B136%5D
rdfs:seeAlso
n32:1
dcterms:modified
2014-09-06T20:48:58.008009-04:00
sioc:link
n2:1807
sioc:id
921bfa2551c1627e36ae25b49bb1d46d
sioc:content
<p>No epic is complete without a descent into hell. Enter the <i>historia calamitatum</i> of the 500 Giga-triples (Gt) at <a href="http://www.cwi.nl/" id="link-id0x15b8cbb8">CWI</a>&#39;s <a href="http://scilens.project.cwi.nl/" id="link-id0x15ead658">Scilens</a> cluster.</p> <p>Now, <a href="http://www.openlinksw.com/weblog/oerling/?id=1767" id="link-id0x15a53918">from last time</a>, we know to generate the data without 10 GB of namespace prefixes per file and with many short files. So we have 1.5 TB of gzipped data in 40,000 files, spread over 12 machines. The data generator has again been modified. Now the generation was about 4 days. Also from last time, we know to treat small integers specially when they occur as partition keys: 1 and 2 are very common values and skew becomes severe if they all go to the same partition; hence consecutive small <code>INTs</code> each go to a different partition, but for larger ones the low 8 bits are ignored, which is good for compression: Consecutive values must fall in consecutive places, but not for small <code>INTs</code>. Another uniquely brain-dead feature of the BSBM generator has also been rectified: When generating multiple files, the program would put things in files in a round-robin manner, instead of putting consecutive numbers in consecutive places, which is how every other data generator or exporter does it. This impacts bulk load locality and as you, dear reader, ought to know by now, performance comes from (1) locality and (2) parallelism.</p> <p>The machines are similar to last time: each a dual E5 2650 v2 with 256 GB RAM and QDR InfiniBand (IB). No SSD this time, but a slightly higher clock than last time; anyway, a different set of machines.</p> <p>The first experiment is with triples, so no characteristic sets, no schema.</p> <p>So, first day (Monday), we notice that one cannot allocate more than 9 GB of memory. Then we figure out that it cannot be done with <code>malloc</code>, whether in small or large pieces, but it can with <code>mmap</code>. Ain&#39;t seen that before. One day shot. Then, towards the end of day 2, load begins. But it does not run for more than 15 minutes before a network error causes the whole thing to abort. All subsequent tries die within 15 minutes. Then, in the morning of day 3, we switch from IB to Gigabit Ethernet (GigE). For loading this is all the same; the maximal aggregate throughput is 800 MB/s, which is around 40% of the nominal bidirectional capacity of 12 GigE&#39;s. So, it works better, for 30 minutes, and one can even stop the load and do a checkpoint. But after resuming, one box just dies; does not even respond to ping. We change this to another. After this, still running on GigE, there are no more network errors. So, at the end of day 3, maybe 10% of the data are in. But now it takes 2h21min to make a checkpoint, i.e., make the loaded data durable on disk. One of the boxes manages to write 2 MB/s to a RAID-0 of three 2 TB drives. Bad disk, seen such before. The data can however be read back once the write is finally done.</p> <p>Well, this is a non-starter. So, by mid-day of day 4, another machine has been replaced. Now writing to disk is possible within expected delays.</p> <p>In the afternoon of day 4, the load rate is about 4.3 Mega-triples (Mt) per second, all going in RAM.</p> <p>In the evening of day 4, adding more files to load in parallel increases the load rate to between 4.9 and 5.2 Mt/s. This is about as fast as this will go, since the load is not exactly even. This comes from the RDF stupidity of keeping an index on everything, so even object values where an index is useless get indexed, leading to some load peaks. For example, there is an index on <code>POSG</code> for triples were the predicate is <code>rdf:type</code> and the object is a common type. Use of characteristic sets will stop this nonsense.</p> <p>But let us not get ahead of the facts: At 9:10 PM of day 4, the whole cluster goes unreachable. No, this is not a software crash or swapping; this also affects boxes on which nothing of the experiment was running. A whole night of running is shot. </p> <p>A previous scale model experiment of loading 37.5 Gt in 192 GB of RAM, paging to a pair of 2 TB disks, has been done a week before. This finishes in time, keeping a load rate of above 400 Kt/s on a 12-core box.</p> <p>At 10AM on day 5 (Friday), the cluster is rebooted; a whole night&#39;s run missed. The cluster starts and takes about 30 minutes to get to its former 5 Mt/s load rate. We now try switching the network back to InfiniBand. The whole ethernet network seemed to have crashed at 9PM on day 4. This is of course unexplained but the experiment had been driving the ethernet at about half its cross-sectional throughput, so maybe a switch crashed. We will never know. We will now try IB rather than risk this happening again, especially since if it did repeat, the whole weekend would be shot, as we would have to wait for the admin to reboot the lot on Monday (day 8).</p> <p>So, at noon on day 5, the cluster is restarted with IB. The cruising speed is now 6.2 Mt/s, thanks to the faster network. The cross sectional throughput is about 960 MB/s, up from 720 MB/s, which accounts for the difference. CPU load is correspondingly up. This is still not full platform since there is load unbalance as noted above.</p> <p>At 9PM on day 5, the rate is around 5.7 Mt/s with the peak node at 1500% CPU out of a possible 1600%. The next one is under 800%, which is just to show what it means to index everything. In specific, the node that has the highest CPU is the one in whose partition the <code>bsbm:offer</code> class falls, so that there is a local peak since one of every 9 or so triples says that something is an <code>offer</code>. The stupidity of the triple store is to index garbage like this to begin with. The reason why the performance is still good is that a <code>POSG</code> index where <code>P</code> and <code>O</code> are fixed and the <code>S</code> is densely ascending is very good, with everything but the <code>S</code> represented as run lengths and the <code>S</code> as bitmaps. Still, no representation at all is better for performance than even the most efficient representation.</p> <p>The journey consists of 3 different parts. At 10PM, the 3rd and last part is started. The triples have more literals, but the load is more even. The cruising speed is 4.3 Mt/s down from 6.2, but the data has a different shape, including more literals.</p> <p>The last stretch of the data is about reviews. This stretch of the data has less skew. So we increase parallelism, running 8 x 24 files at a time. The load rate goes above 6.3 Mt/s.</p> <p>At 6:45 in the morning of day 6, the data is all loaded. The count of triples is 490.0 billion. If the load were done in a single stretch without stops and reconfiguration, it would likely go in under 24h. The average rate for a 4 hour sample between midnight and 4AM of day 6 is 6.8 MT/s. The resulting database files add up to 10.9 TB, with about 20% of the volume in unallocated pages.</p> <p>At this time, noon of day 6, we find that some cross-partition joins need more distinct pieces of memory than the default kernel settings allow per process. A large number of partitions makes a large number of sometimes long messages which makes many <code>mmaps</code>. So we will wait until morning of day 8 (Monday) for the administrator to set these. In the meantime, we analyze the behavior of the workload on the 37 Gt scale model cluster on my desktop.</p> <p>To be continued...</p> <h3>LOD2 Finale Series</h3> <ul> <li> <a>Indexing everything</a> </li> <li> Having literals and URI strings via dictionary </i> </li> <li> Having a join for every attribute </i> </li> </ul>
dc:title
LOD2 Finale (part 2 of n): The 500 Giga-triples
sioc:has_creator
n33:this
opl:isDescribedUsing
n7:sioc.rdf
atom:source
n27:vdb%27s%20BLOG%20%5B136%5D
atom:updated
2014-09-07T00:48:58Z
atom:title
LOD2 Finale (part 2 of n): The 500 Giga-triples
sioc:links_to
n17: n20:?id=1767 n29:
atom:author
n9:this
rdfs:label
LOD2 Finale (part 2 of n): The 500 Giga-triples
atom:published
2014-08-18T20:55:57Z
n15:has_services
n16:item
rdf:type
schema:BlogPosting sioct:BlogPost atom:Entry