<?xml version="1.0" encoding="UTF-8" ?>
<!--RDF based XML document generated By OpenLink Virtuoso-->
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
 <rss:channel xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/">
  <rss:title>Orri Erling&#39;s Weblog</rss:title>
  <rss:link>http://www.openlinksw.com/weblog/oerling/</rss:link>
  <rss:description />
  <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">oerling@openlinksw.com</dc:creator>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2013-05-18T22:02:10Z</dc:date>
  <rss:items>
   <rdf:Seq>
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2011-03-22#1691" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2011-03-22#1689" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2011-03-22#1684" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2011-03-22#1683" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2011-03-22#1682" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2011-03-10#1678" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2011-03-10#1677" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2011-03-09#1675" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2011-03-09#1673" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2011-03-07#1671" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2011-03-07#1669" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2011-03-07#1667" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2011-03-04#1665" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2011-03-02#1663" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2011-02-28#1660" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2011-02-28#1658" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2011-01-19#1649" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2010-09-21#1631" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2010-09-21#1630" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2010-03-15#1614" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2009-10-27#1585" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2009-09-01#1576" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2009-06-29#1562" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-12-17#1504" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-11-20#1484" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-11-04#1479" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-11-03#1471" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-10-26#1465" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-10-02#1448" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-08-27#1422" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-08-25#1418" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-08-06#1409" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-07-30#1400" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-05-09#1358" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-03-06#1321" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-02-04#1308" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-02-01#1304" />
   </rdf:Seq>
  </rss:items>
 </rss:channel>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2011-03-22#1691">
  <rss:title>Transaction Semantics in RDF and Relational Models</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-03-22T23:55:43Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">As a part of defining benchmark audit for testing ACID properties on RDF stores, we will here examine different RDF scenarios where lack of concurrency control causes inconsistent results. In so doing, we consider common implementation techniques and implications as concern locking (pessimistic) and multi-version (optimistic) concurrency control schemes. In the following, we will talk in terms of triples, but the discussion can be trivially generalized to quads. We will use numbers for IRIs and literals. In most implementations, the internal representation for these is indeed a number (or at least some data type that has a well defined collation order). For ease of presentation, we consider a single index with key parts SPO. Any other index-like setting with any possible key order will have similar issues. Insert (Create) and Delete INSERT and DELETE as defined in SPARQL are queries which generate a result set which is then used for instantiating triple patterns. We note that a DELETE may delete a triple which the DELETE has not read; thus the delete set is not a subset of the read set. The SQL equivalent is the DELETE FROM table WHERE key IN ( SELECT key1 FROM other_table ) expression, supposing it were implemented as a scan of other_table and an index lookup followed by DELETE on table. The meaning of INSERT is that the triples in question exist after the operation, and the meaning of DELETE is that said triples do not exist. In a transactional context, this means that the after-image of the transaction is guaranteed either to have or not-have said triples. Suppose that the triples { 1 0 0 }, { 1 5 6 }, and { 1 5 7 } exist in the beginning. If we DELETE { 1 ?x ?y } and concurrently INSERT { 1 2 4 . 1 2 3 . 1 3 5 }, then whichever was considered to be first by the concurrency control of the DBMS would complete first, and the other after that. Thus the end state would either have no triples with subject 1 or would have the three just inserted. Suppose the INSERT inserts the first triple, { 1 2 4 }. The DELETE at the same time reads all triples with subject 1. The exclusive read waits for the uncommitted INSERT. The INSERT then inserts the second triple, { 1 2 3 }. Depending on the isolation of the read, this either succeeds, since no { 1 2 3 } was read, or causes a deadlock. The first corresponds to REPEATABLE READ isolation; the second to SERIALIZABLE. We would not get the desired end-state of either all the inserted triples or no triples with subjectÂ 1 if the read or the DELETE were not serializable. Furthermore if a DELETE template produced a triple that did not exist in the pre-image, the DELETE semantics still imply that this also does not exist in the after-image, which implies serializability. Read and Update Let us consider the prototypical transaction example of transferring funds from one account to another. Two balances are updated, and a history record is inserted. The initial state is a balance 10 b balance 10 We transfer 1 from a to b, and at the same time transfer 2 from b to a. The end state must have a at 11 and b at 9. A relational database needs REPEATABLE READ isolation for this. With RDF, txn1 reads that a has a balance of 10. At the same time, txn1 reads the balance of a. txn2 waits because the read of txn1 is exclusive. txn1 proceeds and read the balance of b. It then updates the balance of a and b. All goes without the deadlock which is always cited in this scenario, because the locks are acquired in the same order. The act of updating the balance of a, since RDF does not really have an update-in-place, consists of deleting { a balance 10 } and inserting { a balance 9 }. This gets done and txn1 commits. At this point, txn2 proceeds after its wait on the row that stated { a balance 10 }. This row is now gone, and txn2 sees that a has no balance, which is quite possible in RDF&#39;s schema-less model. We see that REPEATABLE READ is not adequate with RDF, even though it is with relational. The reason why there is no UPDATE-in-place is that the PRIMARY KEY of the triple includes all the parts, including the object. Even in a RDBMS, an UPDATE of a primary key part amounts to a DELETE-plus-INSERT. One could here argue that an implementation might still UPDATE-in-place if the key order were not changed. This would resolve the special case of the accounts but not a more general case. Thus we see that the read of the balance must be SERIALIZABLE. This means that the read locks the space before the first balance, so that no insertion may take place. In this way the read of txn2 waits on the lock that is conceptually before the first possible match of { a balance ?x }. locking order and OLTP To implement TPC-C, I would update the table with the highest cardinality first, and then all tables in descending order of cardinality. In this way, the locks with the highest likelihood for contention are held for the least time. If locking multiple rows of a table, these should be locked in a deterministic order, e.g., lowest key-value first. In this way, the workload would not deadlock. In actual fact, with clusters and parallel execution, the lock acquisition will not be guaranteed to be serial, so deadlocks do not entirely go away, but still may get fewer. Besides, any outside transaction might still lock in the wrong order and cause deadlocks, which is why the OLTP application must in any case be built to deal with the possibility of deadlock. This is the conventional relational view of the matter. In more recent times, in-memory schemes with deterministic lock acquisition (Abadi VLDB 2010) or single-threaded atomic execution of transactions (Uni Munich BIRTE workshop at VLDB2010, VoltDB) have been proposed. There the transaction is described as a stored procedure, possibly with extra annotations. These techniques might apply to RDF also. RDF is however an unlikely model for transaction-intensive applications, so we will not for now examine these further. RDBMS usually implement row-level locking. This means that once a column of a row has an uncommitted state, any other transaction is prevented from changing the row. This has no ready RDF equivalent. RDF is usually implemented as a row-per-triple system and applying row-level locking to this does not give the semantic one expects of a relational row. I would argue that it is not essential to enforce transactional guarantees in units of rows. The guarantees must apply between data that is read and written by a transaction. It does not need to apply to columns that the transaction does not reference. To take the TPC-C example, the new order transaction updates the stock level and the delivery transaction updates the delivery count on the stock table. In practice, a delivery and a new order falling on the same row of stock will lock each other out, but nothing in the semantics of the workload mandates this. It does not seem a priori necessary to recreate the row as a unit of concurrency control in RDF. One could say that a multi-attribute whole (such as an address) ought to be atomic for concurrency control, but then applications updating addresses will most likely read and update all the fields together even if only the street name changes. Pessimistic Vs. Optimistic Concurrency Control We have so far spoken only in terms of row-level locking, which is to my knowledge the most widely used model in RDBMS, and one we implement ourselves. Some databases (e.g., MonetDB and VectorWise) implement optimistic concurrency control. The general idea is that each transaction has a read and write set and when a transaction commits, any other transactions whose read or write set intersects with the write set of the committing transaction are marked un-committable. Once a transaction thus becomes un-committable, it may presumably continue reading indefinitely but may no longer commit its updates. Optimistic concurrency is generally coupled with multi-version semantics where the pre-image of a transaction is a clean committed state of the database as of a specific point in time, i.e., snapshot isolation. To implement SERIALIZABLE isolation, i.e., the guarantee that if a transaction twice performs a COUNT the result will be the same, one locks also the row that precedes the set of selected rows and marks each lock so as to prevent an insert to the right of the lock in key order. The same thing may be done in an optimistic setting. Positional Handling of Updates in Column Stores [Heman, Zukowski, CWI science library] discusses management of multiple consecutive snapshots in some detail. The paper does not go into the details of different levels of isolation but nothing there suggests that serializability could not be supported. There is some complexity in marking the space between ordered rows as non-insertable across multiple versions but this should be feasible enough. The issue of optimistic Vs. pessimistic concurrency does not seem to be affected by the differences between RDF and relational models. We note that an OLTP workload can be made to run with very few transaction aborts (deadlocks) by properly ordering operations when using a locking scheme. The same does not work with optimistic concurrency since updates happen immediately and transaction aborts occur whenever the writes of one intersect the reads or writes of another, regardless of the order in which these were made. Developers seldom understand transactions; therefore DBMS should, within the limits of the possible, optimize locking order for locking schemes. A simple example is locking in key order when doing an operation on a set of values. A more complex variant would consist of analyzing data dependencies in stored procedures and reordering updates so as to get the highest cardinality tables first. We note that this latter trick also benefits optimistic schemes. In RDF, the same principles apply but distinguishing cardinality of an updated set will have to rely on statistics of predicate cardinality. Such are anyhow needed for query optimization. Eventual Consistency Web scale systems that need to maintain consistent state across multiple data centers sometimes use &quot;eventual consistency&quot; schemes. Two-phase-commit becomes very inefficient as latency increases, thus strict transactional semantics have prohibitive cost if the system is more distributed than a cluster with a fast interconnect. Eventual consistency schemes (Amazon Dynamo, Yahoo! PNUTS) maintain history information on the record which is the unit of concurrency control. The record is typically a non-first normal form chunk of related data that it makes sense to store together from the application&#39;s viewpoint. Application logic can then be applied to reconciling differing copies of the same logical record. Such a scheme seems a priori ill-suited for RDF, where the natural unit of concurrency control would seem to be the quad. We first note that only recently changed (i.e., DELETEd + INSERTed quads, as there is no UPDATE-in-place) need history information. This history information can be stored away from the quad itself, thus not disrupting compression. When detecting that one site has INSERTed a quad that another has DELETEd in the same general time period, application logic can still be applied for reading related quads in order to arrive at a decision on how to reconcile two databases that have diverged. The same can apply to conflicting values of properties that for the application should be single-valued. Comparing time-stamped transaction logs on quads is not fundamentally different from comparing record histories in Dynamo or PNUTS. As we overcome the data size penalties that have until recently been associated with RDF, RDF becomes even more interesting as a data model for large online systems such as social network platforms where frequent application changes lead to volatility of schema. Key value stores are currently found in such applications, but they generally do not provide the query flexibility at which RDF excels. Conclusions We have gone over basic aspects of the endlessly complex and variable topic of transactions, and drawn parallels as well as outlined two basic differences between relational and RDF systems: What used to be REPEATABLE READ becomes SERIALIZABLE; and row-level locking becomes locking at the level of a single attribute value. For the rest, we see that the optimistic and pessimistic modes of concurrency control, as well as guidelines for writing transaction procedures, remain much the same. Based on this overview, it should be possible to design an ACID test for describing the ACID behavior of benchmarked systems. We do not intend to make transaction support a qualification requirement for an RDF benchmark, but information on transaction support will still be valuable in comparing different systems.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>As a part of defining benchmark audit for testing <a class="auto-href" href="http://dbpedia.org/resource/ACID" id="link-id0x1eae93c0">ACID</a> properties on <a class="auto-href" href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1f0adbe0">RDF</a> stores, we will here examine different RDF scenarios where lack of concurrency control causes inconsistent results.  In so doing, we consider common implementation techniques and implications as concern locking (pessimistic) and multi-version (optimistic) concurrency control schemes.</p>

<p>In the following, we will talk in terms of triples, but the discussion can be trivially generalized to quads.  We will use numbers for IRIs and literals.  In most implementations, the internal representation for these is indeed a number (or at least some <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0x1ea79630">data</a> type that has a well defined collation order).  For ease of presentation, we consider a single index with key parts <code>SPO</code>.  Any other index-like setting with any possible key order will have similar issues. </p>

<h2>Insert (Create) and Delete </h2>

<p>
<code>INSERT</code> and <code>DELETE</code> as defined in <a class="auto-href" href="http://dbpedia.org/resource/SPARQL" id="link-id0x1f335c60">SPARQL</a> are queries which generate a result set which is then used for instantiating triple patterns.  We note that a <code>DELETE</code> may delete a triple which the <code>DELETE</code> has not read; thus the delete set is not a subset of the read set.  The <a class="auto-href" href="http://dbpedia.org/resource/SQL" id="link-id0x1e99b4b8">SQL</a> equivalent is the </p>

<blockquote>
 <code><pre>DELETE FROM table WHERE key IN 
   ( SELECT key1 FROM other_table )</pre>
 </code>
</blockquote>

<p>expression, supposing it were implemented as a scan of <code>other_table</code> and an index lookup followed by <code>DELETE</code> on table. </p>

<p>The meaning of <code>INSERT</code> is that the triples in question exist after the operation, and the meaning of <code>DELETE</code> is that said triples do not exist. In a transactional context, this means that the after-image of the transaction is guaranteed either to have or not-have said triples. </p>

<p>Suppose that the triples <code>{ 1 0 0 }</code>, <code>{ 1 5 6 }</code>, and <code>{ 1 5 7 }</code> exist in the beginning. If we <code>DELETE { 1 ?x ?y }</code> and concurrently <code>INSERT { 1 2 4 . 1 2 3 . 1 3 5 }</code>, then whichever was considered to be first by the concurrency control of the DBMS would complete first, and the other after that.  Thus the end state would either have no triples with subject <code>1</code> or would have the three just inserted. </p>

<p>Suppose the <code>INSERT</code> inserts the first triple, <code>{ 1 2 4 }</code>.  The <code>DELETE</code> at the same time reads all triples with subject <code>1</code>.  The exclusive read waits for the uncommitted <code>INSERT</code>.  The <code>INSERT</code> then inserts the second triple, <code>{ 1 2 3 }</code>. Depending on the isolation of the read, this either succeeds, since no <code>{ 1 2 3 }</code> was read, or causes a deadlock.  The first corresponds to <code>REPEATABLE READ</code> isolation; the second to <code>SERIALIZABLE</code>.</p>

<p>We would not get the desired end-state of either <i>all the inserted triples</i> or <i>no triples with subjectÂ <code>1</code></i> if the read or the <code>DELETE</code> were not serializable.</p>

<p>Furthermore if a <code>DELETE</code> template produced a triple that did not exist in the pre-image, the <code>DELETE</code> semantics still imply that this also does not exist in the after-image, which implies serializability.</p>


<h2>Read and Update</h2>

<p>Let us consider the prototypical transaction example of transferring funds from one account to another. Two balances are updated, and a history record is inserted.</p>

<p>The initial state is </p>

<blockquote>
<code><pre>a  balance  10
b  balance  10</pre></code>
</blockquote>

<p>We transfer <code>1</code> from <code>a</code> to <code>b</code>, and at the same time transfer <code>2</code> from <code>b</code> to <code>a</code>.  The end state must have <code>a</code> at <code>11</code> and <code>b</code> at <code>9</code>.</p>

<p>A relational database needs <code>REPEATABLE READ</code> isolation for this.</p>

<p>With RDF, <code>txn1</code> reads that <code>a</code> has a <code>balance</code> of <code>10</code>.   At the same time, <code>txn1</code> reads the <code>balance</code> of <code>a</code>.  <code>txn2</code> waits because the read of <code>txn1</code> is exclusive.  <code>txn1</code> proceeds and read the <code>balance</code> of <code>b</code>.  It then updates the <code>balance</code> of <code>a</code> and <code>b</code>. </p>

<p>All goes without the deadlock which is always cited in this scenario, because the locks are acquired in the same order. The act of updating the balance of <code>a</code>, since RDF does not really have an update-in-place, consists of deleting <code>{ a balance 10 }</code> and inserting <code>{ a balance 9 }</code>.  This gets done and <code>txn1</code> commits. At this point, <code>txn2</code> proceeds after its wait on the row that stated <code>{ a balance 10 }</code>.  This row is now gone, and <code>txn2</code> sees that <code>a</code> has no balance, which is quite possible in RDF&#39;s <a class="auto-href" href="http://dbpedia.org/resource/Database_schema" id="link-id0x1c933cf0">schema</a>-less model.</p>

<p>We see that <code>REPEATABLE READ</code> is not adequate with RDF, even though it is with relational. The reason why there is no <code>UPDATE</code>-in-place is that the <code>PRIMARY KEY</code> of the triple includes all the parts, including the object. Even in a <a class="auto-href" href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x1f3cc3c8">RDBMS</a>, an <code>UPDATE</code> of a primary key part amounts to a <code>DELETE</code>-plus-<code>INSERT</code>.  One could here argue that an implementation might still <code>UPDATE</code>-in-place if the key order were not changed.  This would resolve the special case of the accounts but not a more general case.</p>

<p>Thus we see that the read of the balance must be <code>SERIALIZABLE</code>.  This means that the read locks the space before the first balance, so that no insertion may take place.  In this way the read of <code>txn2</code> waits on the lock that is conceptually before the first possible match of <code>{ a balance ?x }</code>.</p>


<h2>locking order and OLTP </h2>

<p>To implement <a class="auto-href" href="http://www.tpc.org/" id="link-id0x1e20f2e8">TPC</a>-<a class="auto-href" href="http://dbpedia.org/resource/C%2B%2B" id="link-id0x1fa46718">C</a>, I would update the table with the highest cardinality first, and then all tables in descending order of cardinality.  In this way, the locks with the highest likelihood for contention are held for the least time.  If locking multiple rows of a table, these should be locked in a deterministic order, e.g., lowest key-value first.  In this way, the workload would not deadlock.  In actual fact, with clusters and parallel execution, the lock acquisition will not be guaranteed to be serial, so deadlocks do not entirely go away, but still may get fewer.  Besides, any outside transaction might still lock in the wrong order and cause deadlocks, which is why the OLTP application must in any case be built to deal with the possibility of deadlock.</p>

<p>This is the conventional relational view of the matter.  In more recent times, in-memory schemes with deterministic lock acquisition (<a href="http://cs-www.cs.yale.edu/homes/dna/papers/determinism-vldb10.pdf" id="link-id0x1c5d9340">Abadi VLDB 2010</a>) or single-threaded atomic execution of transactions (<a href="http://bird.cs.tu-berlin.de:8008/birte2010/" id="link-id0x1ec0ed18">Uni Munich BIRTE workshop at VLDB2010</a>, <a href="http://www.voltdb.com/" id="link-id0x1ab6e380">VoltDB</a>) have been proposed. There the transaction is described as a stored procedure, possibly with extra annotations.  These techniques might apply to RDF also. RDF is however an unlikely model for transaction-intensive applications, so we will not for now examine these further.</p>

<p>RDBMS usually implement row-level locking.  This means that once a column of a row has an uncommitted state, any other transaction is prevented from changing the row.  This has no ready RDF equivalent. RDF is usually implemented as a row-per-triple system and applying row-level locking to this does not give the semantic one expects of a relational row.  </p>

<p>I would argue that it is not essential to enforce transactional guarantees in units of rows.  The guarantees must apply between data that is <i>read</i> and <i>written</i> by a transaction.  It does not need to apply to columns that the transaction does not reference.  To take the TPC-C example, the <i>new order</i> transaction updates the stock level and the <i>delivery</i> transaction updates the delivery count on the stock table. In practice, a <i>delivery</i> and a <i>new order</i> falling on the same row of stock will lock each other out, but nothing in the semantics of the workload mandates this.</p>

<p>It does not seem <i>a priori</i> necessary to recreate the row as a unit of concurrency control in RDF.  One could say that a multi-attribute whole (such as an address) ought to be atomic for concurrency control, but then applications updating addresses will most likely read and update all the fields together even if only the street name changes.</p>


<h2>Pessimistic Vs. Optimistic Concurrency Control </h2>

<p>We have so far spoken only in terms of row-level locking, which is to my <a class="auto-href" href="http://dbpedia.org/resource/Knowledge" id="link-id0x1f1230a0">knowledge</a> the most widely used model in RDBMS, and one we implement ourselves.  Some databases (e.g., <a class="auto-href" href="http://dbpedia.org/resource/MonetDB" id="link-id0x173a5538">MonetDB</a> and <a class="auto-href" href="http://www.ingres.com/vectorwise/" id="link-id0x16feb008">VectorWise</a>) implement optimistic concurrency control. The general idea is that each transaction has a read and write set and when a transaction commits, any other transactions whose read or write set intersects with the write set of the committing transaction are marked un-committable.  Once a transaction thus becomes un-committable, it may presumably continue reading indefinitely but may no longer commit its updates. Optimistic concurrency is generally coupled with multi-version semantics where the pre-image of a transaction is a clean committed state of the database as of a specific point in time, i.e., snapshot isolation.  </p>

<p>To implement <code>SERIALIZABLE</code> isolation, i.e., the guarantee that if a transaction twice performs a <code>COUNT</code> the result will be the same, one locks also the row that precedes the set of selected rows and marks each lock so as to prevent an insert to the right of the lock in key order.  The same thing may be done in an optimistic setting.</p>

<p>
  <a href="http://event.cwi.nl/SIGMOD-RWE/2010/22-7f15a1/paper.pdf" id="link-id0x1d5de810">Positional Handling of Updates in Column Stores</a> [Heman, Zukowski, <a class="auto-href" href="http://dbpedia.org/resource/National_Research_Institute_for_Mathematics_and_Computer_Science" id="link-id0x1df9c990">CWI</a> science library] discusses management of multiple consecutive snapshots in some detail. The paper does not go into the details of different levels of isolation but nothing there suggests that serializability could not be supported.  There is some complexity in marking the space between ordered rows as non-insertable across multiple versions but this should be feasible enough. </p>

<p>The issue of optimistic Vs. pessimistic concurrency does not seem to be affected by the differences between RDF and relational models.  We note that an OLTP workload can be made to run with very few transaction aborts (deadlocks) by properly ordering operations when using a locking scheme.  The same does not work with optimistic concurrency since updates happen immediately and transaction aborts occur whenever the writes of one intersect the reads or writes of another, regardless of the order in which these were made.</p>

<p>Developers seldom understand transactions; therefore DBMS should, within the limits of the possible, optimize locking order for locking schemes.  A simple example is locking in key order when doing an operation on a set of values.  A more complex variant would consist of analyzing data dependencies in stored procedures and reordering updates so as to get the highest cardinality tables first.  We note that this latter trick also benefits optimistic schemes.</p>

<p>In RDF, the same principles apply but distinguishing cardinality of an updated set will have to rely on statistics of predicate cardinality. Such are anyhow needed for query <a class="auto-href" href="http://dbpedia.org/resource/Program_optimization" id="link-id0x1f51d5d0">optimization</a>.</p>

<h2>Eventual Consistency </h2>

<p>Web scale systems that need to maintain consistent state across multiple data centers sometimes use &quot;eventual consistency&quot; schemes.  <a class="auto-href" href="http://dbpedia.org/resource/Two-phase_commit_protocol" id="link-id0x1e3ba5d8">Two-phase-commit</a> becomes very inefficient as latency increases, thus strict transactional semantics have prohibitive cost if the system is more distributed than a cluster with a fast interconnect.</p>

<p>Eventual consistency schemes (<a href="http://dbpedia.org/page/Dynamo_(storage_system)" id="link-id0x1f9db8f8">Amazon Dynamo</a>, <a href="http://research.yahoo.com/project/212" id="link-id0x1da3db80">Yahoo! PNUTS</a>) maintain history <a class="auto-href" href="http://dbpedia.org/resource/Information" id="link-id0x8bf48e8">information</a> on the record which is the unit of concurrency control.  The record is typically a non-first normal form chunk of related data that it makes sense to store together from the application&#39;s viewpoint.  Application logic can then be applied to reconciling differing copies of the same logical record. </p>

<p>Such a scheme seems <i>a priori</i> ill-suited for RDF, where the natural unit of concurrency control would seem to be the quad.  We first note that only recently changed (i.e., <code>DELETEd + INSERTed</code> quads, as there is no <code>UPDATE</code>-in-place) need history information.  This history information can be stored away from the quad itself, thus not disrupting compression.  When detecting that one site has <code>INSERTed</code> a quad that another has <code>DELETEd</code> in the same general time period, application logic can still be applied for reading related quads in order to arrive at a decision on how to reconcile two databases that have diverged.  The same can apply to conflicting values of properties that for the application should be single-valued.  Comparing time-stamped transaction logs on quads is not fundamentally different from comparing record histories in Dynamo or PNUTS.</p>

<p>As we overcome the data size penalties that have until recently been associated with RDF, RDF becomes even more interesting as a data model for large online systems such as social network platforms where frequent application changes lead to volatility of schema.  Key value stores are currently found in such applications, but they generally do not provide the query flexibility at which RDF excels. </p>


<h2>Conclusions </h2>

<p>We have gone over basic aspects of the endlessly complex and variable topic of transactions, and drawn parallels as well as outlined two basic differences between relational and RDF systems: What used to be <code>REPEATABLE READ</code> becomes <code>SERIALIZABLE</code>; and row-level locking becomes locking at the level of a single attribute value.  For the rest, we see that the optimistic and pessimistic modes of concurrency control, as well as guidelines for writing transaction procedures, remain much the same.</p>

<p>Based on this overview, it should be possible to design an ACID test for describing the ACID behavior of benchmarked systems.  We do not intend to make transaction support a qualification requirement for an RDF benchmark, but information on transaction support will still be valuable in comparing different systems.</p>

]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2011-03-22#1689">
  <rss:title>RDF and Transactions</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-03-22T22:52:56Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">I will here talk about RDF and transactions for developers in general. The next one talks about specifics and is for specialists. Transactions are certainly not the first thing that comes to mind when one hears &quot;RDF&quot;. We have at times used a recruitment questionnaire where we ask applicants to define a transaction. Many vaguely remember that it is a unit of work, but usually not more than that. We sometimes get questions from users about why they get an error message that says &quot;deadlock&quot;. &quot;Deadlock&quot; is what happens when multiple users concurrently update balances on multiple bank accounts in the wrong order. What does this have to do with RDF? There are in fact users who even use XA with a Virtuoso-based RDF application. Franz also has publicized their development of full ACID capabilities for AllegroGraph. RDF is a database schema model, and transactions will inevitably become an issue in databases. At the same time, the developer population trained with MySQL and PHP is not particularly transaction-aware. Transactions have gone out of style, declares the No-SQL crowd. Well, it is not so much SQL they object to but ACID, i.e., transactional guarantees. We will talk more about this in the next post. The SPARQL language and protocol do not go into transactions, except for expressing the wish that an UPDATE request to an end-point be atomic. But beware -- atomicity is a gateway drug, and soon one finds oneself on full ACID. If one says that a thing will either happen in its entirety or not at all, which is what (A) atomicity means, then the question arises of (I) isolation; that is, what happens if somebody else does something to the same data at the same time? Then comes the question of whether a thing, once having happened, will stay that way; i.e., (D) durability. Finally, there is (C) consistency, which means that the transaction&#39;s result must not contradict restrictions the database is supposed to enforce. RDF usually has no restrictions; thus consistency mostly means that the internal state of the DBMS must be consistent, e.g., different indices on triples/quads should contain the same data. There are, of course, database-like consistency criteria that one can express in RDF Schema and OWL, concerning data types, mandatory presence of properties, or restrictions on cardinality (i.e., one may only have one spouse at a time, and the like). If one indeed did enforce them all, then RDF would be very like the relational model -- with all the restrictions, but without the 40 years of work on RDBMS performance. For this reason, RDF use tends to involve data that is not structured enough to be a good fit for RDBMS. There is of course the OWL side, where consistency is important but is defined in such complex ways that they again are not a good fit for RDBMS. RDF could be seen to be split between the schema-last world and the knowledge representation world. I will here focus on the schema-last side. Transactions are relevant in RDF in two cases: 1. If data is trickle loaded in small chunks, one likes to know that the chunks do not get lost or corrupted; 2. If the application has any semantics that reserve resources, then these operations need transactions. The latter is not so common with RDF but examples include read-write situations, like checking if a seat is available and then reserving it. Transactionality guarantees that the same seat does not get reserved twice. Web people argue with some justification that since the four cardinal virtues of database never existed on the web to begin with, applying strict ACID to web data is beside the point, like locking the stable after the horse has long since run away. This may be so; yet the systems used for processing data, whether that data is dirty or not, benefit from predictable operation under concurrency and from not losing data. Analytics workloads are not primarily about transactions, but still need to specify what happens with updates. Analyzing data from measurements may not have concurrent updates, but there the transaction issue is replaced by the question of making explicit how the data was acquired and what processing has been applied to it before storage. As mentioned before, the LOD2 project is at the crossroads of RDF and database. I construe its mission to be the making of RDF into a respectable database discipline. Database respectability in turn is as good as inconceivable without addressing the very bedrock on which this science was founded: transactions. As previously argued, we need well-defined and auditable benchmarks. This again brings up the topic of transactions. Once we embark on the database benchmark route, there is no way around this. TPC-H mandates that the system under test support transactions, and the audit involves a test for this. We can do no less. This has led me to more closely examine the issue of RDF and transactions, and whether there exist differences between transactions applied to RDF and to relational data. As concerns Virtuoso, our position has been that one can get full ACID in Virtuoso, whether in SQL or SPARQL, by using a connected client (e.g., ODBC, JDBC, or the Jena or Sesame frameworks), and setting the isolation options on the connection. Having taken this step, one then must take the next step, which consists of dealing with deadlocks; i.e., with concurrent utilization, it may happen that the database at any time notifies the client that the transaction got aborted and the client must retry. Web developers especially do not like this, because this is not what MySQL has taught them to expect. MySQL does have transactional back-ends like InnoDB, but often gets used without transactions. With the March 2011 Virtuoso releases, we have taken a closer look at transactions with RDF. It is more practical to reduce the possibility of errors than to require developers to pay attention. For this reason we have automated isolation settings for RDF, greatly reduced the incidence of deadlocks, and even incorporated automatic deadlock retries where applicable. If all users lock resources they need in the same order, there will be no deadlocks. This is what we do with RDF load in Virtuoso 7; thus any mix of concurrent INSERTs and DELETEs, if these are under a certain size (normally 10000 quads) are guaranteed never to fail due to locking. These could still fail due to running out of space, though. With previous versions, there always was a possibility of having an INSERT or DELETE fail because of deadlock with multiple users. Vectored INSERT and DELETE are sufficient for making web crawling or archive maintenance practically deadlock free, since there the primary transaction is the INSERT or DELETE of a small graph. Furthermore, since the SPARQL protocol has no way of specifying transactions consisting of multiple client-server exchanges, the SPARQL end-point may deal with deadlocks by itself. If all else fails, it can simply execute requests one after the other, thus eliminating any possibility of locking. We note that many statements will be intrinsically free of deadlocks by virtue of always locking in key order, but this cannot be universally guaranteed with arbitrary size operations; thus concurrent operations might still sometimes deadlock. Anyway, vectored execution as introduced in Virtuoso 7, besides getting easily double-speed random access, also greatly reduces deadlocks by virtue of ordering operations. In the next post we will talk about what transactions mean with RDF and whether there is any difference with the relational model.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>I will here talk about <a class="auto-href" href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x235282b8">RDF</a> and transactions for developers in general. The next one talks about specifics and is for specialists.</p>

<p>Transactions are certainly not the first thing that comes to mind when one hears &quot;RDF&quot;.  We have at times used a recruitment questionnaire where we ask applicants to define a transaction.  Many vaguely remember that it is a unit of work, but usually not more than that.  We sometimes get questions from users about why they get an error message that says &quot;deadlock&quot;.  &quot;Deadlock&quot; is what happens when multiple users concurrently update balances on multiple bank accounts in the wrong order.  What does this have to do with RDF?</p>

<p>There are in fact users who even use XA with a <a class="auto-href" href="http://virtuoso.openlinksw.com" id="link-id0x235e5938">Virtuoso</a>-based RDF application.  <a class="auto-href" href="http://semanticweb.org/id/Franz_Inc" id="link-id0x28c09308">Franz</a> also has publicized their development of full <a class="auto-href" href="http://dbpedia.org/resource/ACID" id="link-id0x2365f710">ACID</a> capabilities for <a class="auto-href" href="http://semanticweb.org/id/AllegroGraph" id="link-id0x22caecb0">AllegroGraph</a>.  RDF is a database <a class="auto-href" href="http://dbpedia.org/resource/Database_schema" id="link-id0x235f1f70">schema</a> model, and transactions will inevitably become an issue in databases.</p>

<p>At the same time, the developer population trained with <a class="auto-href" href="http://dbpedia.org/resource/MySQL" id="link-id0x240f6a90">MySQL</a> and <a class="auto-href" href="http://dbpedia.org/resource/PHP" id="link-id0x238cd088">PHP</a> is not particularly transaction-aware.  Transactions have gone out of style, declares the No-<a class="auto-href" href="http://dbpedia.org/resource/SQL" id="link-id0x232d9068">SQL</a> crowd.  Well, it is not so much SQL they object to but ACID, i.e., transactional guarantees. We will talk more about this in the next post.  The <a class="auto-href" href="http://dbpedia.org/resource/SPARQL" id="link-id0x238c70a0">SPARQL</a> language and protocol do not go into transactions, except for expressing the wish that an <code>UPDATE</code> request to an end-point be atomic. But beware -- atomicity is a gateway drug, and soon one finds oneself on full ACID.  </p>

<p>If one says that a thing will either happen <i>in its entirety</i> or <i>not at all,</i> which is what (A) atomicity means, then the question arises of (I) isolation; that is, what happens if somebody else does something to the same <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0x23eadf50">data</a> at the same time?  Then comes the question of whether a thing, once having happened, will stay that way; i.e., (D) durability. Finally, there is (<a class="auto-href" href="http://dbpedia.org/resource/C%2B%2B" id="link-id0x23a1e280">C</a>) consistency, which means that the transaction&#39;s result must not contradict restrictions the database is supposed to enforce.  RDF usually has no restrictions; thus consistency mostly means that the internal state of the DBMS must be consistent, e.g., different indices on triples/quads should contain the same data.</p>

<p>There are, of course, database-like consistency criteria that one can express in RDF Schema and <a class="auto-href" href="http://dbpedia.org/resource/Web_Ontology_Language" id="link-id0x287b18e8">OWL</a>, concerning data types, mandatory presence of properties, or restrictions on cardinality (i.e., one may only have one spouse at a time, and the like).  </p>

<p>If one indeed did enforce them all, then RDF would be very like the relational model -- with all the restrictions, but without the 40 years of work on <a class="auto-href" href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x2450b488">RDBMS</a> performance.  For this reason, RDF use tends to involve data that is not structured enough to be a good fit for RDBMS.</p>

<p>There is of course the OWL side, where consistency is important but is defined in such complex ways that they again are not a good fit for RDBMS.  RDF could be seen to be split between the schema-last world and the <a class="auto-href" href="http://dbpedia.org/resource/Knowledge" id="link-id0x2324ac40">knowledge</a> representation world.  I will here focus on the schema-last side.</p>

<p>Transactions are relevant in RDF in two cases: 1. If data is trickle loaded in small chunks, one likes to know that the chunks do not get lost or corrupted; 2. If the application has any semantics that reserve resources, then these operations need transactions.  The latter is not so common with RDF but examples include read-write situations, like checking if a seat is available and then reserving it. Transactionality guarantees that the same seat does not get reserved twice.</p>

<p>Web people argue with some justification that since the four cardinal virtues of database never existed on the web to begin with, applying strict ACID to web data is beside the point, like locking the stable after the horse has long since run away.  This may be so; yet the systems used for processing data, whether that data is dirty or not, benefit from predictable operation under concurrency and from not losing data.</p>

<p>Analytics workloads are not primarily about transactions, but still need to specify what happens with updates.  Analyzing data from measurements may not have concurrent updates, but there the transaction issue is replaced by the question of making explicit how the data was acquired and what processing has been applied to it before storage.</p>


<p>As mentioned before, the <a class="auto-href" href="http://lod2.eu/" id="link-id0x28ac0250">LOD2</a> project is at the crossroads of RDF and database.  I construe its mission to be the making of RDF into a respectable database discipline.  Database respectability in turn is as good as inconceivable without addressing the very bedrock on which this science was founded: transactions.</p>

<p>As previously argued, we need well-defined and auditable benchmarks.  This again brings up the topic of transactions.  Once we embark on the database benchmark route, there is no way around this. <a class="auto-href" href="http://www.tpc.org/" id="link-id0x284d2d80">TPC</a>-<a class="auto-href" href="http://dbpedia.org/resource/TPC-H" id="link-id0x280dcd40">H</a> mandates that the system under test support transactions, and the audit involves a test for this.  We can do no less.</p>

<p>This has led me to more closely examine the issue of RDF and transactions, and whether there exist differences between transactions applied to RDF and to relational data.  </p>

<p>As concerns Virtuoso, our position has been that one can get full ACID in Virtuoso, whether in SQL or SPARQL, by using a connected client (e.g., <a class="auto-href" href="http://dbpedia.org/resource/Open_Database_Connectivity" id="link-id0x235cecf0">ODBC</a>, <a class="auto-href" href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id0x27c4a0c0">JDBC</a>, or the <a class="auto-href" href="http://jena.sourceforge.net/" id="link-id0x283a89a8">Jena</a> or <a class="auto-href" href="http://sourceforge.net/projects/sesame/" id="link-id0x284b3490">Sesame</a> frameworks), and setting the isolation options on the connection.  Having taken this step, one then must take the next step, which consists of dealing with deadlocks; i.e., with concurrent utilization, it may happen that the database at any time notifies the client that the transaction got aborted and the client must retry.</p>

<p>Web developers especially do not like this, because this is not what MySQL has taught them to expect. MySQL does have transactional back-ends like InnoDB, but often gets used without transactions.</p>

<p>With the March 2011 Virtuoso releases, we have taken a closer look at transactions with RDF.  It is more practical to reduce the possibility of errors than to require developers to pay attention. For this reason we have automated isolation settings for RDF, greatly reduced the incidence of deadlocks, and even incorporated automatic deadlock retries where applicable.</p>

<p>If all users lock resources they need in the same order, there will be no deadlocks.  This is what we do with RDF load in Virtuoso 7; thus any mix of concurrent <code>INSERTs</code> and <code>DELETEs</code>, if these are under a certain size (normally 10000 quads) are guaranteed never to fail due to locking.  These could still fail due to running out of space, though. With previous versions, there always was a possibility of having an <code>INSERT</code> or <code>DELETE</code> fail because of deadlock with multiple users.   Vectored <code>INSERT</code> and <code>DELETE</code> are sufficient for    making web crawling or archive maintenance practically deadlock free, since there the primary transaction is the <code>INSERT</code> or <code>DELETE</code> of a small graph. </p>

<p>Furthermore, since the <a class="auto-href" href="http://www.w3.org/TR/rdf-sparql-protocol/" id="link-id0x22ca4300">SPARQL protocol</a> has no way of specifying transactions consisting of multiple client-server exchanges, the SPARQL end-point may deal with deadlocks by itself.  If all else fails, it can simply execute requests one after the other, thus eliminating any possibility of locking.  We note that many statements will be intrinsically free of deadlocks by virtue of always locking in key order, but this cannot be universally guaranteed with arbitrary size operations; thus concurrent operations might still sometimes deadlock.  Anyway, vectored execution as introduced in Virtuoso 7, besides getting easily double-speed random access, also greatly reduces deadlocks by virtue of ordering operations.</p>

<p>In the next post we will talk about what transactions mean with RDF and whether there is any difference with the relational model.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2011-03-22#1684">
  <rss:title>Benchmarks, Redux (part 15): BSBM Test Driver Enhancements</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-03-22T22:32:28Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">This article covers the changes we have made to the BSBM test driver during our series of experiments. Drill-down mode - For queries that have a product type as parameter, the test driver will invoke the query multiple times with each time a random subtype of the product type of the previous invocation. The starting point of the drill-down is an a random type from a settable level in the hierarchy. The rationale for the drill-down mode is that depending on the parameter choice, there can be 1000x differences in query run time. Thus run times of consecutive query mixes will be incomparable unless we guarantee that each mix has a predictable number of queries with a product type from each level in the hierarchy. Permutation of query mix - In the BI workload, the queries are run in a random order on each thread in multiuser mode. Doing exactly the same thing on many threads is not realistic for large queries. The data access patterns must be spread out in order to evaluate how bulk IO is organized with differing concurrent demands. The permutations are deterministic on consecutive runs and do not depend on the non-deterministic timing of concurrent activities. For queries with a drill-down, the individual executions that make up the drill-down are still consecutive. New metrics - The BI Power is the geometric mean of query run times scaled to queries per hour and multiplied by the scale factor, where 100 Mt is considered the unit scale. The BI Throughput is the arithmetic mean of the run times scaled to QPH and adjusted to scale as with the Power metric. These are analogous to the TPC-H Power and Throughput metrics. The Power is defined as (scale_factor / 284826) * 3600 / ((t0 * t1 * ... * tn) ^(1 / n)) The Throughput is defined as (scale_factor / 284826) * 3600 / ((t0 + t2 + ... + tn) / n) The magic number 284826 is the scale that generates approximately 100 million triples (100 Mt). We consider this &quot;scale one.&quot; The reason for the multiplication is that scores at different scales should get similar numbers, otherwise 10x larger scale would result roughly in 10x lower throughput with the BI queries. We also show the percentage each query represents from the total time the test driver waits for responses. Deadlock retry - When running update mixes, it is possible that a transaction gets aborted by a deadlock. We have made a retry logic for this. Cluster mode - Cluster databases may have multiple interchangeable HTTP listeners. With this mode, one can specify multiple end-points so a multi-user workload can divide itself evenly over these. Identifying matter - A version number was added to test driver output. Use of the new switches is also indicated in the test driver output. SUT CPU - In comparing results it is crucial to differentiate between in memory runs and IO bound runs. To make this easier, we have added an option to report server CPU times over the timed portion (excluding warm-ups). A pluggable self-script determines the CPU times for the system; thus clusters can be handled, too. The time is given as a sum of the time the server processes have aged during the run and as a percentage over the wall-clock time. These changes will soon be available as a diff and as a source tree. This version is labeled BSBM Test Driver 1.1-opl; the -opl signifies OpenLink additions. We invite FU Berlin to include these enhancements into their Source Forge repository of the BSBM test driver. There is more precise documentation of these options in the README file in the above distribution. The next planned upgrade of the test driver concerns adding support for &quot;RDF-H&quot;, the RDF adaptation of the industry standard TPC-H decision support benchmark for RDBMS. Benchmarks, Redux Series Benchmarks, Redux (part 1): On RDF Benchmarks Benchmarks, Redux (part 2): A Benchmarking Story Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs Benchmarks, Redux (part 6): BSBM and I/O, continued Benchmarks, Redux (part 7): What Does BSBM Explore Measure? Benchmarks, Redux (part 8): BSBM Explore and Update Benchmarks, Redux (part 9): BSBM With Cluster Benchmarks, Redux (part 10): LOD2 and the Benchmark Process Benchmarks, Redux (part 11): The Substance of Benchmarks Benchmarks, Redux (part 12): Our Own BSBM Results Report Benchmarks, Redux (part 13): BSBM BI Modifications Benchmarks, Redux (part 14): BSBM BI Mix Benchmarks, Redux (part 15): BSBM Test Driver Enhancements (this post)</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>This article covers the changes we have made to the <a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x283a2528">BSBM</a> test driver during our series of experiments.</p>

<ul>
 <li>
  <p>
    <b>Drill-down mode</b> - For queries that have a product type as parameter, the test driver will invoke the query multiple times with each time a random subtype of the product type of the previous invocation. The starting point of the drill-down is an a random type from a settable level in the hierarchy.  The rationale for the drill-down mode is that depending on the parameter choice, there can be 1000x differences in query run time.  Thus run times of consecutive query mixes will be incomparable unless we guarantee that each mix has a predictable number of queries with a product type from each level in the hierarchy.</p>
 </li>

<li>
  <b>Permutation of query mix</b> - In the BI workload, the queries are run in a random order on each thread in multiuser mode.  Doing exactly the same thing on many threads is not realistic for large queries. The <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0x23880860">data</a> access patterns must be spread out in order to evaluate how bulk IO is organized with differing concurrent demands. The permutations are deterministic on consecutive runs and do not depend on the non-deterministic timing of concurrent activities.  For queries with a drill-down, the individual executions that make up the drill-down are still consecutive.</li>

<li>
  <p>
    <b>New metrics</b> - The BI Power is the geometric mean of query run times scaled to queries per hour and multiplied by the scale factor, where 100 Mt is considered the unit scale. The BI Throughput is the arithmetic mean of the run times scaled to QPH and adjusted to scale as with the Power metric. These are analogous to the <a class="auto-href" href="http://www.tpc.org/" id="link-id0x28ccd3f8">TPC</a>-<a class="auto-href" href="http://dbpedia.org/resource/TPC-H" id="link-id0x29ad25c8">H</a> Power and Throughput metrics. </p>
<p>The <i>Power</i> is defined as</p> 
<blockquote>(scale_factor / 284826) *  3600 / ((t0 * t1 * ... * tn) ^(1 / n)) </blockquote>
<p>The <i>Throughput</i> is defined as</p> 
<blockquote>(scale_factor / 284826) *  3600 / ((t0 + t2 + ... +  tn) / n)</blockquote>
<p>The magic number 284826 is the scale that generates approximately 100 million triples (100 Mt).  We consider this &quot;scale one.&quot;  The reason for the multiplication is that scores at different scales should get similar numbers, otherwise 10x larger scale would result roughly in 10x lower throughput with the BI queries.</p>

<p>We also show the percentage each query represents from the total time the test driver waits for responses. </p>
</li>

<li>
  <p>
    <b>Deadlock retry</b> - When running update mixes, it is possible that a transaction gets aborted by a deadlock.   We have made a retry logic for this.</p>
</li>

<li>
  <p>
    <b>Cluster mode</b> - Cluster databases may have multiple interchangeable <a class="auto-href" href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x236532c8">HTTP</a> listeners.  With this mode, one can specify multiple end-points so a multi-user workload can divide itself evenly over these.</p>
</li>

<li>
  <p>
    <b>Identifying matter</b> - A version number was added to test driver output.  Use of the new switches is also indicated in the test driver output.</p>
</li>

<li>
  <p>
    <b>SUT <a class="auto-href" href="http://dbpedia.org/resource/Central_processing_unit" id="link-id0x249c8f68">CPU</a></b> - In comparing results it is crucial to differentiate between in memory runs and IO bound runs.  To make this easier, we have added an option to report server CPU times over the timed portion (excluding warm-ups).  A pluggable self-script determines the CPU times for the system; thus clusters can be handled, too.  The time is given as a sum of the time the server processes have aged during the run and as a percentage over the wall-clock time.</p>
</li>
</ul>

<p>These changes will soon be available <a href="http://blogs.usnet.private:8893/RPC2" id="link-id0x1f9a57c0">as a diff</a> and <a href="http://blogs.usnet.private:8893/RPC2" id="link-id0x1f2fea08">as a source tree</a>. This version is labeled <b><code>BSBM Test Driver 1.1-opl</code></b>; the <b><code>-opl</code></b> signifies OpenLink additions.  </p>

<p>We invite FU Berlin to include these enhancements into their Source Forge repository of the BSBM test driver.  There is more precise documentation of these options in the README file in the above distribution.</p>

<p>The next planned upgrade of the test driver concerns adding support for &quot;<a class="auto-href" href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x23de9eb8">RDF</a>-H&quot;, the RDF adaptation of the industry standard TPC-H decision support benchmark for <a class="auto-href" href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x22cca4e0">RDBMS</a>.</p>



<h3>
<i>Benchmarks, Redux</i> Series</h3>
<ul>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1658" id="link-id0x1db2be00">Benchmarks, Redux (part 1): On RDF Benchmarks</a>
</li>

<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x1dfcc038">Benchmarks, Redux (part 2): A Benchmarking Story</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1663" id="link-id0x197c26d0">Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1665" id="link-id0x1d149cf0">Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1667" id="link-id0x1ab69450">Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1669" id="link-id0x1e67d688">Benchmarks, Redux (part 6): BSBM and I/O, continued</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1671" id="link-id0x1dad87c8">Benchmarks, Redux (part 7): What Does BSBM Explore Measure?</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1673" id="link-id0x1cc73830">Benchmarks, Redux (part 8): BSBM Explore and Update </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1675" id="link-id0x1d6879a8">Benchmarks, Redux (part 9): BSBM With Cluster</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1677" id="link-id0x1dfae510">Benchmarks, Redux (part 10): LOD2 and the Benchmark Process</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1678" id="link-id0x1ef052a0">Benchmarks, Redux (part 11): The Substance of Benchmarks</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1dadddb0">Benchmarks, Redux (part 12): Our Own BSBM Results Report</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1e662ef0">Benchmarks, Redux (part 13): BSBM BI Modifications </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1df6fa70">Benchmarks, Redux (part 14): BSBM BI Mix </a>
</li>
<li>
Benchmarks, Redux (part 15): BSBM Test Driver Enhancements <i>(this post)</i>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2011-03-22#1683">
  <rss:title>Benchmarks, Redux (part 14): BSBM BI Mix</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-03-22T22:31:32Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">In this post, we look at how we run the BSBM-BI mix. We consider the 100 Mt and 1000 Mt scales with Virtuoso 7 using the same hardware and software as in the previous posts. The changes to workload and metric are given in the previous post. Our intent here is to look at whether the metric works, and to see what results will look like in general. We are as much testing the benchmark as we are testing the system-under-test (SUT). The results shown here will likely not be comparable with future ones because we will most likely change the composition of the workload since it seems a bit out of balance. Anyway, for the sake of disclosure, we attach the query templates. The test driver we used will be made available soon, so the interested may still try a comparison with their systems. If you practice with this workload for the coming races, the effort will surely not be wasted. Once we have come up with a rules document, we will redo all that we have published so far by-the-book, and have it audited as part of the LOD2 service we plan for this (see previous posts in this series). This will introduce comparability; but before we get that far with the BI workload, the workload needs to evolve a bit. Below we show samples of test driver output; the whole output is downloadable. 100 Mt Single User bsbm/testdriver -runs 1 -w 0 -idir /bs/1 -drill \ -ucf bsbm/usecases/businessIntelligence/sparql.txt \ -dg http://bsbm.org http://localhost:8604/sparql 0: 43348.14ms, total: 43440ms Scale factor: 284826 Explore Endpoints: 1 Update Endpoints: 1 Drilldown: on Number of warmup runs: 0 Seed: 808080 Number of query mix runs (without warmups): 1 times min/max Querymix runtime: 43.3481s / 43.3481s Elapsed runtime: 43.348 seconds QMpH: 83.049 query mixes per hour CQET: 43.348 seconds average runtime of query mix CQET (geom.): 43.348 seconds geometric mean runtime of query mix AQET (geom.): 0.492 seconds geometric mean runtime of query Throughput: 1494.874 BSBM-BI throughput: qph*scale BI Power: 7309.820 BSBM-BI Power: qph*scale (geom) 100 Mt 8 User Thread 6: query mix 3: 195793.09ms, total: 196086.18ms Thread 8: query mix 0: 197843.84ms, total: 198010.50ms Thread 7: query mix 4: 201806.28ms, total: 201996.26ms Thread 2: query mix 5: 221983.93ms, total: 222105.96ms Thread 4: query mix 7: 225127.55ms, total: 225317.49ms Thread 3: query mix 6: 225860.49ms, total: 226050.17ms Thread 5: query mix 2: 230884.93ms, total: 231067.61ms Thread 1: query mix 1: 237836.61ms, total: 237959.11ms Benchmark run completed in 237.985427s Scale factor: 284826 Explore Endpoints: 1 Update Endpoints: 1 Drilldown: on Number of warmup runs: 0 Number of clients: 8 Seed: 808080 Number of query mix runs (without warmups): 8 times min/max Querymix runtime: 195.7931s / 237.8366s Total runtime (sum): 1737.137 seconds Elapsed runtime: 1737.137 seconds QMpH: 121.016 query mixes per hour CQET: 217.142 seconds average runtime of query mix CQET (geom.): 216.603 seconds geometric mean runtime of query mix AQET (geom.): 2.156 seconds geometric mean runtime of query Throughput: 2178.285 BSBM-BI throughput: qph*scale BI Power: 1669.745 BSBM-BI Power: qph*scale (geom) 1000 Mt Single User 0: 608707.03ms, total: 608768ms Scale factor: 2848260 Explore Endpoints: 1 Update Endpoints: 1 Drilldown: on Number of warmup runs: 0 Seed: 808080 Number of query mix runs (without warmups): 1 times min/max Querymix runtime: 608.7070s / 608.7070s Elapsed runtime: 608.707 seconds QMpH: 5.914 query mixes per hour CQET: 608.707 seconds average runtime of query mix CQET (geom.): 608.707 seconds geometric mean runtime of query mix AQET (geom.): 5.167 seconds geometric mean runtime of query Throughput: 1064.552 BSBM-BI throughput: qph*scale BI Power: 6967.325 BSBM-BI Power: qph*scale (geom) 1000 Mt 8 User bsbm/testdriver -runs 8 -mt 8 -w 0 -idir /bs/10 -drill \ -ucf bsbm/usecases/businessIntelligence/sparql.txt \ -dg http://bsbm.org http://localhost:8604/sparql Thread 3: query mix 4: 2211275.25ms, total: 2211371.60ms Thread 4: query mix 0: 2212316.87ms, total: 2212417.99ms Thread 8: query mix 3: 2275942.63ms, total: 2276058.03ms Thread 5: query mix 5: 2441378.35ms, total: 2441448.66ms Thread 6: query mix 7: 2804001.05ms, total: 2804098.81ms Thread 2: query mix 2: 2808374.66ms, total: 2808473.71ms Thread 1: query mix 6: 2839407.12ms, total: 2839510.63ms Thread 7: query mix 1: 2889199.23ms, total: 2889263.17ms Benchmark run completed in 2889.302566s Scale factor: 2848260 Explore Endpoints: 1 Update Endpoints: 1 Drilldown: on Number of warmup runs: 0 Number of clients: 8 Seed: 808080 Number of query mix runs (without warmups): 8 times min/max Querymix runtime: 2211.2753s / 2889.1992s Total runtime (sum): 20481.895 seconds Elapsed runtime: 20481.895 seconds QMpH: 9.968 query mixes per hour CQET: 2560.237 seconds average runtime of query mix CQET (geom.): 2544.284 seconds geometric mean runtime of query mix AQET (geom.): 13.556 seconds geometric mean runtime of query Throughput: 1794.205 BSBM-BI throughput: qph*scale BI Power: 2655.678 BSBM-BI Power: qph*scale (geom) Metrics for Query: 1 Count: 8 times executed in whole run Time share 2.120884% of total execution time AQET: 54.299656 seconds (arithmetic mean) AQET(geom.): 34.607302 seconds (geometric mean) QPS: 0.13 Queries per second minQET/maxQET: 11.71547600s / 148.65379700s Metrics for Query: 2 Count: 8 times executed in whole run Time share 0.207382% of total execution time AQET: 5.309462 seconds (arithmetic mean) AQET(geom.): 2.737696 seconds (geometric mean) QPS: 1.34 Queries per second minQET/maxQET: 0.78729800s / 25.80948200s Metrics for Query: 3 Count: 8 times executed in whole run Time share 17.650472% of total execution time AQET: 451.893890 seconds (arithmetic mean) AQET(geom.): 410.481088 seconds (geometric mean) QPS: 0.02 Queries per second minQET/maxQET: 171.07262500s / 721.72939200s Metrics for Query: 5 Count: 32 times executed in whole run Time share 6.196565% of total execution time AQET: 39.661685 seconds (arithmetic mean) AQET(geom.): 6.849882 seconds (geometric mean) QPS: 0.18 Queries per second minQET/maxQET: 0.15696500s / 189.00906200s Metrics for Query: 6 Count: 8 times executed in whole run Time share 0.119916% of total execution time AQET: 3.070136 seconds (arithmetic mean) AQET(geom.): 2.056059 seconds (geometric mean) QPS: 2.31 Queries per second minQET/maxQET: 0.41524400s / 7.55655300s Metrics for Query: 7 Count: 40 times executed in whole run Time share 1.577963% of total execution time AQET: 8.079921 seconds (arithmetic mean) AQET(geom.): 1.342079 seconds (geometric mean) QPS: 0.88 Queries per second minQET/maxQET: 0.02205800s / 40.27761500s Metrics for Query: 8 Count: 40 times executed in whole run Time share 72.126818% of total execution time AQET: 369.323481 seconds (arithmetic mean) AQET(geom.): 114.431863 seconds (geometric mean) QPS: 0.02 Queries per second minQET/maxQET: 5.94377300s / 1824.57867400s The CPU for the multiuser runs stays above 1500% for the whole run. The CPU for the single user 100 Mt run is 630%; for the 1000 Mt run, this is 574%. This can be improved since the queries usually have a lot of data to work on. But final optimization is not our goal yet; we are just surveying the race track. The difference between a warm single user run and a cold single user run is about 15% with data on SSD; with data on disk, this would be more. The numbers shown are with warm cache. The single-user and multi-user Throughput difference, 1064 single-user vs. 1794 multi-user, is about what one would expect from the CPU utilization. With these numbers, the CPU does not appear badly memory-bound, else the increase would be less; also core multi-threading seems to bring some benefit. If the single-user run was at 800%, the Throughput would be 1488. The speed in excess of this may be attributed to core multi-threading, although we must remember that not every query mix is exactly the same length, so the figure is not exact. Core multi-threading does not seem to hurt, at the very least. Comparison of the same numbers with the column store will be interesting since it misses the cache a lot less and accordingly has better SMP scaling. The Intel Nehalem memory subsystem is really pretty good. For reference, we show a run with Virtuoso 6 at 100Mt. 0: 424754.40ms, total: 424829ms Scale factor: 284826 Explore Endpoints: 1 Update Endpoints: 1 Drilldown: on Number of warmup runs: 0 Seed: 808080 Number of query mix runs (without warmups): 1 times min/max Querymix runtime: 424.7544s / 424.7544s Elapsed runtime: 424.754 seconds QMpH: 8.475 query mixes per hour CQET: 424.754 seconds average runtime of query mix CQET (geom.): 424.754 seconds geometric mean runtime of query mix AQET (geom.): 1.097 seconds geometric mean runtime of query Throughput: 152.559 BSBM-BI throughput: qph*scale BI Power: 3281.150 BSBM-BI Power: qph*scale (geom) and 8 user Thread 5: query mix 3: 616997.86ms, total: 617042.83ms Thread 7: query mix 4: 625522.18ms, total: 625559.09ms Thread 3: query mix 7: 626247.62ms, total: 626304.96ms Thread 1: query mix 0: 629675.17ms, total: 629724.98ms Thread 4: query mix 6: 667633.36ms, total: 667670.07ms Thread 8: query mix 2: 674206.07ms, total: 674256.72ms Thread 6: query mix 5: 695020.21ms, total: 695052.29ms Thread 2: query mix 1: 701824.67ms, total: 701864.91ms Benchmark run completed in 701.909341s Scale factor: 284826 Explore Endpoints: 1 Update Endpoints: 1 Drilldown: on Number of warmup runs: 0 Number of clients: 8 Seed: 808080 Number of query mix runs (without warmups): 8 times min/max Querymix runtime: 616.9979s / 701.8247s Total runtime (sum): 5237.127 seconds Elapsed runtime: 5237.127 seconds QMpH: 41.031 query mixes per hour CQET: 654.641 seconds average runtime of query mix CQET (geom.): 653.873 seconds geometric mean runtime of query mix AQET (geom.): 2.557 seconds geometric mean runtime of query Throughput: 738.557 BSBM-BI throughput: qph*scale BI Power: 1408.133 BSBM-BI Power: qph*scale (geom) Having the numbers, let us look at the metric and its scaling. We take the geometric mean of the single-user Power and the multiuser Throughput. 100 Mt: sqrt ( 7771 * 2178 ); = 4114 1000 Mt: sqrt ( 6967 * 1794 ); = 3535 Scaling seems to work; the results are in the same general ballpark. The real times for the 1000 Mt run are a bit over 10x the times for the 100Mt run, as expected. The relative percentages of the queries are about the same on both scales, with the drill-down in Q8 alone being 77% and 72% respectively. The Q8 drill-down starts at the root of the product hierarchy. If we made this start one level from the top, its share would drop. This seems reasonable. Conversely, Q2 is out of place, with far too little share of the time. It takes a product as a starting point and shows a list of products with common features, sorted by descending count of common features. This would more appropriately be applied to a leaf product category instead, measuring how many of the products in the category have the top 20 features found in this category, to name an example. Also there should be more queries. At present it appears that BSBM-BI is definitely runnable, but a cursory look suffices to show that the workload needs more development and variety. We remember that I dreamt up the business questions last fall without much analysis, and that these questions were subsequently translated to SPARQL by FU Berlin. So, on one hand, BSBM-BI is of crucial importance because it is the first attempt at doing a benchmark with long running queries in SPARQL. On the other hand, BSBM-BI is not very good as a benchmark; TPC-H is a lot better. This stands to reason, as TPC-H has had years and years of development and participation by many people. Benchmark queries are trick questions: For example, TPC-H Q18 cannot be done without changing an IN into a JOIN with the IN subquery in the outer loop and doing streaming aggregation. Q13 cannot be done without a well-optimized HASH JOIN which besides must be partitioned at the larger scales. Having such trick questions in an important benchmark eventually results in everybody doing the optimizations that the benchmark clearly calls for. Making benchmarks thus entails a responsibility ultimately to the end user, because an irrelevant benchmark might in the worst case send developers chasing things that are beside the point. In the following, we will look at what BSBM-BI requires from the database and how these requirements can be further developed and extended. BSBM-BI does not have any clear trick questions, at least not premeditatedly. BSBM-BI just requires a cost model that can guess the fanout of a JOIN and the cardinality of a GROUP BY; it is enough to distinguish smaller from greater; the guess does not otherwise have to be very good. Further, the queries are written in the benchmark text so that joining from left to right would work, so not even a cost-based optimizer is strictly needed. I did however have to add some cardinality statistics to get reasonable JOIN order since we always reorder the query regardless of the source formulation. BSBM-BI does have variable selectivity from the drill-downs; thus these may call for different JOIN orders for different parameter values. I have not looked into whether this really makes a difference, though. There are places in BSBM-BI where using a HASH JOIN makes sense. We do not use HASH JOINs with RDF because there is an index for everything and making a HASH JOIN in the wrong place can have a large up-front cost, so one is more robust against cost model errors if one does not do HASH JOINs. This said, a HASH JOIN in the right place is a lot better than an index lookup. With TPC-H Q13, our best HASH JOIN is over 2x better than the best INDEX-based JOIN, both being well tuned. For questions like &quot;count the hairballs made in Germany reviewed by Japanese Hello Kitty fans,&quot; where two ends of a JOIN path are fairly selective doing the other as a HASH JOIN is good. This can, if the JOIN is always cardinality-reducing, even be merged inside an INDEX lookup. We have such capabilities since we have been for a while gearing up for the relational races, but are not using any of these with BSBM-BI, although they would be useful. Let us see the profile for a single user 100 Mt run. The database activity summary is -- select db_activity (0, &#39;http&#39;); 161.3MÂ rndÂ  210.2MÂ seqÂ  Â  Â  0Â sameÂ segÂ  Â 104.5MÂ sameÂ pgÂ  45.08MÂ sameÂ parÂ  Â  Â  0Â diskÂ  Â  Â Â 0Â specÂ diskÂ  Â  Â  0BÂ /Â  Â  Â  0Â messagesÂ  2.393KÂ fork See the post &quot;What Does BSBM Explore Measure&quot; for an explanation of the numbers. We see that there is more sequential access than random and the random has fair locality with over half on the same page as the previous and a lot of the rest falling under the same parent. Funnily enough, the explore mix has more locality. Running with a longer vector size would probably increase performance by getting better locality. There is an optimization that adjusts vector size on the fly if locality is not sufficient but this is not being used here. So we manually set vector size to 100000 instead of the default 10000. We get -- 172.4MÂ rndÂ  220.8MÂ seqÂ  Â  Â  0Â sameÂ segÂ  Â 149.6MÂ sameÂ pgÂ  10.99MÂ sameÂ parÂ  Â  Â 21Â diskÂ  Â  861Â specÂ diskÂ  Â  Â  0BÂ /Â  Â  Â  0Â messagesÂ  Â  Â 754Â fork The throughput goes from 1494 to 1779. We see more hits on the same page, as expected. We do not make this setting a default since it raises the cost for small queries; therefore the vector size must be self-adjusting -- besides, expecting a DBA to tune this is not reasonable. We will just have to correctly tune the self-adjust logic, and we have again clear gains. Let us now go back to the first run with vector size 10000. The top of the CPU oprofile is as follows: 722309 15.4507 cmpf_iri64n_iri64n 434791 9.3005 cmpf_iri64n_iri64n_anyn_iri64n 294712 6.3041 itc_next_set 273488 5.8501 itc_vec_split_search 203970 4.3631 itc_dive_transit 199687 4.2714 itc_page_rcf_search 181614 3.8848 dc_itc_append_any 173043 3.7015 itc_bm_vec_row_check 146727 3.1386 cmpf_int64n 128224 2.7428 itc_vec_row_check 113515 2.4282 dk_alloc 97296 2.0812 page_wait_access 62523 1.3374 qst_vec_get_int64 59014 1.2623 itc_next_set_parent 53589 1.1463 sslr_qst_get 48003 1.0268 ds_add 46641 0.9977 dk_free_tree 44551 0.9530 kc_var_col 43650 0.9337 page_col_cmp_1 35297 0.7550 cmpf_iri64n_iri64n_anyn_gt_lt 34589 0.7399 dv_compare 25864 0.5532 cmpf_iri64n_anyn_iri64n_iri64n_lte 23088 0.4939 dk_free The top 10 are all index traversal, with the key compare for two leading IRI keys in the lead, corresponding to a lookup with P and S given. The one after that is with all parts given, corresponding to an existence test. The existence tests could probably be converted to HASH JOIN lookups to good advantage. Aggregation and arithmetic are absent. We should probably add a query like TPC-H Q1 that does nothing but these two. Considering the overall profile, GROUP BY seems to be around 3%. We should probably put in a query that makes a very large number of groups and could make use of streaming aggregation, i.e., take advantage of a situation where aggregation input comes already grouped by the grouping columns. A BI use case should offer no problem with including arithmetic, but there are not that many numbers in the BSBM set. Some code sections in the queries with conditional execution and costly tests inside ANDs and ORs would be good. TPC-H has such in Q21 and Q19. An OR with existences where there would be gain from good guesses of a subquery&#39;s selectivity would be appropriate. Also, there should be conditional expressions somewhere with a lot of data, like the CASE-WHEN in TPC-H Q12. We can make BSBM-BI more interesting by putting in the above. Also we will have to see where we can profit from HASH JOIN, both small and large. There should be such places in the workload already so this is a matter of just playing a bit more. This post amounts to a cheat sheet for the BSBM-BI runs a bit farther down the road. By then we should be operational with the column store and Virtuoso 7 Cluster, though, so not everything is yet on the table. Benchmarks, Redux Series Benchmarks, Redux (part 1): On RDF Benchmarks Benchmarks, Redux (part 2): A Benchmarking Story Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs Benchmarks, Redux (part 6): BSBM and I/O, continued Benchmarks, Redux (part 7): What Does BSBM Explore Measure? Benchmarks, Redux (part 8): BSBM Explore and Update Benchmarks, Redux (part 9): BSBM With Cluster Benchmarks, Redux (part 10): LOD2 and the Benchmark Process Benchmarks, Redux (part 11): The Substance of Benchmarks Benchmarks, Redux (part 12): Our Own BSBM Results Report Benchmarks, Redux (part 13): BSBM-BI Modifications Benchmarks, Redux (part 14): BSBM-BI Mix (this post) Benchmarks, Redux (part 15): BSBM Test Driver Enhancements</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>In this post, we look at how we run the <a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x23be8d28">BSBM</a>-BI mix.  We consider the 100 Mt and 1000 Mt scales with <a class="auto-href" href="http://virtuoso.openlinksw.com" id="link-id0x23b69e40">Virtuoso</a> 7 using the same hardware and software as in the previous posts.  The changes to workload and metric are given in the previous post.</p>

<p>Our intent here is to look at whether the metric works, and to see what results will look like in general.  We are as much testing the benchmark as we are testing the system-under-test (SUT).  The results shown here will likely not be comparable with future ones because we will most likely change the composition of the workload since it seems a bit out of balance.  Anyway, for the sake of disclosure, we attach the query templates.  The test driver we used will be made available soon, so the interested may still try a comparison with their systems. If you practice with this workload for the coming races, the effort will surely not be wasted.</p>


<p>Once we have come up with a rules document, we will redo all that we have published so far by-the-book, and have it audited as part of the <a class="auto-href" href="http://lod2.eu/" id="link-id0x23a74c40">LOD2</a> service we plan for this (see previous posts in this series).  This will introduce comparability; but before we get that far with the BI workload, the workload needs to evolve a bit.</p>

<p>Below we show samples of test driver output; the whole output is <a href="http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/BenchmarksReduxSupportingFiles/br.tar.gz" id="link-id0x1b703ad8">downloadable</a>.</p>

<p>100 Mt Single User</p>

<blockquote>
 <code><pre>
bsbm/testdriver   -runs 1   -w 0 -idir /bs/1  -drill  \  
   -ucf bsbm/usecases/businessIntelligence/<a class="auto-href" href="http://dbpedia.org/resource/SPARQL" id="link-id0x247b7e08">sparql</a>.txt  \  
   -dg <a class="auto-href" href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x232a86b0">http</a>://bsbm.org http://localhost:8604/sparql
</pre>
 </code>
</blockquote>

<blockquote>
 <code><pre>
0: 43348.14ms, total: 43440ms

Scale factor:           284826
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Seed:                   808080
Number of query mix runs (without warmups): 1 times
min/max Querymix runtime:    43.3481s / 43.3481s
Elapsed runtime:        43.348 seconds
QMpH:                   83.049 query mixes per hour
CQET:                   43.348 seconds average runtime of query mix
CQET (geom.):           43.348 seconds geometric mean runtime of query mix
AQET (geom.):           0.492 seconds geometric mean runtime of query
Throughput:             1494.874 BSBM-BI throughput: qph*scale
BI Power:               7309.820 BSBM-BI Power: qph*scale (geom)
</pre>
 </code>
</blockquote>



<p>100 Mt 8 User </p>

<blockquote>
 <code><pre>
Thread 6: query mix 3: 195793.09ms, total: 196086.18ms
Thread 8: query mix 0: 197843.84ms, total: 198010.50ms
Thread 7: query mix 4: 201806.28ms, total: 201996.26ms
Thread 2: query mix 5: 221983.93ms, total: 222105.96ms
Thread 4: query mix 7: 225127.55ms, total: 225317.49ms
Thread 3: query mix 6: 225860.49ms, total: 226050.17ms
Thread 5: query mix 2: 230884.93ms, total: 231067.61ms
Thread 1: query mix 1: 237836.61ms, total: 237959.11ms
Benchmark run completed in 237.985427s

Scale factor:           284826
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Number of clients:      8
Seed:                   808080
Number of query mix runs (without warmups): 8 times
min/max Querymix runtime:    195.7931s / 237.8366s
Total runtime (sum):    1737.137 seconds
Elapsed runtime:        1737.137 seconds
QMpH:                   121.016 query mixes per hour
CQET:                   217.142 seconds average runtime of query mix
CQET (geom.):           216.603 seconds geometric mean runtime of query mix
AQET (geom.):           2.156 seconds geometric mean runtime of query
Throughput:             2178.285 BSBM-BI throughput: qph*scale
BI Power:               1669.745 BSBM-BI Power: qph*scale (geom)
</pre>
 </code>
</blockquote>


<p>1000 Mt Single User</p>

<blockquote>
 <code><pre>
0: 608707.03ms, total: 608768ms

Scale factor:           2848260
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Seed:                   808080
Number of query mix runs (without warmups): 1 times
min/max Querymix runtime:    608.7070s / 608.7070s
Elapsed runtime:        608.707 seconds
QMpH:                   5.914 query mixes per hour
CQET:                   608.707 seconds average runtime of query mix
CQET (geom.):           608.707 seconds geometric mean runtime of query mix
AQET (geom.):           5.167 seconds geometric mean runtime of query
Throughput:             1064.552 BSBM-BI throughput: qph*scale
BI Power:               6967.325 BSBM-BI Power: qph*scale (geom)
</pre>
 </code>
</blockquote>


<p>1000 Mt 8 User </p>

<blockquote>
 <code><pre>
bsbm/testdriver   -runs 8 -mt 8  -w 0 -idir /bs/10  -drill  \
   -ucf bsbm/usecases/businessIntelligence/sparql.txt   \
   -dg http://bsbm.org http://localhost:8604/sparql
</pre>
 </code>
</blockquote>

<blockquote>
 <code><pre>
Thread 3: query mix 4: 2211275.25ms, total: 2211371.60ms
Thread 4: query mix 0: 2212316.87ms, total: 2212417.99ms
Thread 8: query mix 3: 2275942.63ms, total: 2276058.03ms
Thread 5: query mix 5: 2441378.35ms, total: 2441448.66ms
Thread 6: query mix 7: 2804001.05ms, total: 2804098.81ms
Thread 2: query mix 2: 2808374.66ms, total: 2808473.71ms
Thread 1: query mix 6: 2839407.12ms, total: 2839510.63ms
Thread 7: query mix 1: 2889199.23ms, total: 2889263.17ms
Benchmark run completed in 2889.302566s

Scale factor:           2848260
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Number of clients:      8
Seed:                   808080
Number of query mix runs (without warmups): 8 times
min/max Querymix runtime:    2211.2753s / 2889.1992s
Total runtime (sum):    20481.895 seconds
Elapsed runtime:        20481.895 seconds
QMpH:                   9.968 query mixes per hour
CQET:                   2560.237 seconds average runtime of query mix
CQET (geom.):           2544.284 seconds geometric mean runtime of query mix
AQET (geom.):           13.556 seconds geometric mean runtime of query
Throughput:             1794.205 BSBM-BI throughput: qph*scale
BI Power:               2655.678 BSBM-BI Power: qph*scale (geom)

Metrics for Query:      1
Count:                  8 times executed in whole run
Time share              2.120884% of total execution time
AQET:                   54.299656 seconds (arithmetic mean)
AQET(geom.):            34.607302 seconds (geometric mean)
QPS:                    0.13 Queries per second
minQET/maxQET:          11.71547600s / 148.65379700s

Metrics for Query:      2
Count:                  8 times executed in whole run
Time share              0.207382% of total execution time
AQET:                   5.309462 seconds (arithmetic mean)
AQET(geom.):            2.737696 seconds (geometric mean)
QPS:                    1.34 Queries per second
minQET/maxQET:          0.78729800s / 25.80948200s

Metrics for Query:      3
Count:                  8 times executed in whole run
Time share              17.650472% of total execution time
AQET:                   451.893890 seconds (arithmetic mean)
AQET(geom.):            410.481088 seconds (geometric mean)
QPS:                    0.02 Queries per second
minQET/maxQET:          171.07262500s / 721.72939200s

Metrics for Query:      5
Count:                  32 times executed in whole run
Time share              6.196565% of total execution time
AQET:                   39.661685 seconds (arithmetic mean)
AQET(geom.):            6.849882 seconds (geometric mean)
QPS:                    0.18 Queries per second
minQET/maxQET:          0.15696500s / 189.00906200s

Metrics for Query:      6
Count:                  8 times executed in whole run
Time share              0.119916% of total execution time
AQET:                   3.070136 seconds (arithmetic mean)
AQET(geom.):            2.056059 seconds (geometric mean)
QPS:                    2.31 Queries per second
minQET/maxQET:          0.41524400s / 7.55655300s

Metrics for Query:      7
Count:                  40 times executed in whole run
Time share              1.577963% of total execution time
AQET:                   8.079921 seconds (arithmetic mean)
AQET(geom.):            1.342079 seconds (geometric mean)
QPS:                    0.88 Queries per second
minQET/maxQET:          0.02205800s / 40.27761500s

Metrics for Query:      8
Count:                  40 times executed in whole run
Time share              72.126818% of total execution time
AQET:                   369.323481 seconds (arithmetic mean)
AQET(geom.):            114.431863 seconds (geometric mean)
QPS:                    0.02 Queries per second
minQET/maxQET:          5.94377300s / 1824.57867400s
</pre>
 </code>
</blockquote>



<p>The <a class="auto-href" href="http://dbpedia.org/resource/Central_processing_unit" id="link-id0x249ce740">CPU</a> for the multiuser runs stays above 1500% for the whole run. The CPU for the single user 100 Mt run is 630%; for the 1000 Mt run, this is 574%. This can be improved since the queries usually have a lot of <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0x2871b1f0">data</a> to work on.  But final <a class="auto-href" href="http://dbpedia.org/resource/Program_optimization" id="link-id0x22c95b90">optimization</a> is not our goal yet; we are just surveying the race track. The difference between a warm single user run and a cold single user run is about 15% with data on SSD; with data on disk, this would be more.  The numbers shown are with warm <a class="auto-href" href="http://dbpedia.org/resource/Cache" id="link-id0x22ca4300">cache</a>.  The single-user and multi-user Throughput difference, 1064 single-user vs. 1794 multi-user, is about what one would expect from the CPU utilization.</p>

<p>With these numbers, the CPU does not appear badly memory-bound, else the increase would be less; also core multi-threading seems to bring some benefit.  If the single-user run was at 800%, the Throughput would be 1488.  The speed in excess of this may be attributed to core multi-threading, although we must remember that not every query mix is exactly the same length, so the figure is not exact.  Core multi-threading does not seem to hurt, at the very least.  Comparison of the same numbers with the column store will be interesting since it misses the cache a lot less and accordingly has better SMP scaling. The <a class="auto-href" href="http://dbpedia.org/resource/Intel_Corporation" id="link-id0x28814950">Intel</a> Nehalem memory subsystem is really pretty good.</p>
<p>




</p>
<p>For reference, we show a run with Virtuoso 6 at 100Mt. </p>

<blockquote>
 <code><pre>
0: 424754.40ms, total: 424829ms

Scale factor:           284826
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Seed:                   808080
Number of query mix runs (without warmups): 1 times
min/max Querymix runtime:    424.7544s / 424.7544s
Elapsed runtime:        424.754 seconds
QMpH:                   8.475 query mixes per hour
CQET:                   424.754 seconds average runtime of query mix
CQET (geom.):           424.754 seconds geometric mean runtime of query mix
AQET (geom.):           1.097 seconds geometric mean runtime of query
Throughput:             152.559 BSBM-BI throughput: qph*scale
BI Power:               3281.150 BSBM-BI Power: qph*scale (geom)
</pre>
 </code>
</blockquote>


<p>and 8 user </p>

<blockquote>
 <code><pre>
Thread 5: query mix 3: 616997.86ms, total: 617042.83ms
Thread 7: query mix 4: 625522.18ms, total: 625559.09ms
Thread 3: query mix 7: 626247.62ms, total: 626304.96ms
Thread 1: query mix 0: 629675.17ms, total: 629724.98ms
Thread 4: query mix 6: 667633.36ms, total: 667670.07ms
Thread 8: query mix 2: 674206.07ms, total: 674256.72ms
Thread 6: query mix 5: 695020.21ms, total: 695052.29ms
Thread 2: query mix 1: 701824.67ms, total: 701864.91ms
Benchmark run completed in 701.909341s

Scale factor:           284826
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Number of clients:      8
Seed:                   808080
Number of query mix runs (without warmups): 8 times
min/max Querymix runtime:    616.9979s / 701.8247s
Total runtime (sum):    5237.127 seconds
Elapsed runtime:        5237.127 seconds
QMpH:                   41.031 query mixes per hour
CQET:                   654.641 seconds average runtime of query mix
CQET (geom.):           653.873 seconds geometric mean runtime of query mix
AQET (geom.):           2.557 seconds geometric mean runtime of query
Throughput:             738.557 BSBM-BI throughput: qph*scale
BI Power:               1408.133 BSBM-BI Power: qph*scale (geom)
</pre>
 </code>
</blockquote>




<p>Having the numbers, let us look at the metric and its scaling.  We take the geometric mean of the single-user Power and the multiuser Throughput.</p>


<blockquote>
 <code><pre>
 100 Mt: sqrt ( 7771 * 2178 ); = 4114

1000 Mt: sqrt ( 6967 * 1794 ); = 3535
</pre>
 </code>
</blockquote>


<p>Scaling seems to work; the results are in the same general ballpark.  The real times for the 1000 Mt run are a bit over 10x the times for the 100Mt run, as expected. The relative percentages of the queries are about the same on both scales, with the drill-down in Q8 alone being 77% and 72% respectively. The Q8 drill-down starts at the root of the product hierarchy.  If we made this start one level from the top, its share would drop.  This seems reasonable.</p>

<p>Conversely, Q2 is out of place, with far too little share of the time. It takes a product as a starting point and shows a list of products with common features, sorted by descending count of common features. This would more appropriately be applied to a leaf product category instead, measuring how many of the products in the category have the top 20 features found in this category, to name an example.</p>

<p>Also there should be more queries.</p>

<p>At present it appears that BSBM-BI is definitely runnable, but a cursory look suffices to show that the workload needs more development and variety.  We remember that I dreamt up the business questions last fall without much analysis, and that these questions were subsequently translated to SPARQL by FU Berlin.  So, on one hand, BSBM-BI is of crucial importance because it is the first attempt at doing a benchmark with long running queries in SPARQL.  On the other hand, BSBM-BI is not very good as a benchmark; <a class="auto-href" href="http://www.tpc.org/" id="link-id0x23227ce0">TPC</a>-<a class="auto-href" href="http://dbpedia.org/resource/TPC-H" id="link-id0x279c6700">H</a> is a lot better.  This stands to reason, as TPC-H has had years and years of development and participation by many people.</p>

<p>Benchmark queries are trick questions: For example, TPC-H Q18 cannot be done without changing an <code>IN</code> into a <code>JOIN</code> with the <code>IN</code> subquery in the outer loop and doing streaming aggregation.  Q13 cannot be done without a well-optimized <code><a class="auto-href" href="http://dbpedia.org/resource/Hash_join" id="link-id0x238cbf88">HASH JOIN</a></code> which besides must be partitioned at the larger scales.</p>

<p>Having such trick questions in an important benchmark eventually results in everybody doing the optimizations that the benchmark clearly calls for.  Making benchmarks thus entails a responsibility ultimately to the end user, because an irrelevant benchmark might in the worst case send developers chasing things that are beside the point.</p>


<p>In the following, we will look at what BSBM-BI requires from the database and how these requirements can be further developed and extended.</p>

<p>BSBM-BI does not have any clear trick questions, at least not premeditatedly. BSBM-BI just requires a cost model that can guess the fanout of a <code>JOIN</code> and the cardinality of a <code>GROUP BY</code>; it is enough to distinguish smaller from greater; the guess does not otherwise have to be very good. Further, the queries are written in the benchmark text so that joining from left to right would work, so not even a cost-based optimizer is strictly needed.  I did however have to add some cardinality statistics to get reasonable <code>JOIN</code> order since we always reorder the query regardless of the source formulation.</p>

<p>BSBM-BI does have variable selectivity from the drill-downs; thus these may call for different <code>JOIN</code> orders for different parameter values.  I have not looked into whether this really makes a difference, though.</p>

<p>There are places in BSBM-BI where using a <code>HASH JOIN</code> makes sense.  We do not use <code>HASH JOINs</code> with <a class="auto-href" href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x235d8d88">RDF</a> because there is an index for everything and making a <code>HASH JOIN</code> in the wrong place can have a large up-front cost, so one is more robust against cost model errors if one does not do <code>HASH JOINs</code>.  This said, a <code>HASH JOIN</code> in the right place is a lot better than an index lookup.  With TPC-H Q13, our best <code>HASH JOIN</code> is over 2x better than the best <code>INDEX</code>-based <code>JOIN</code>, both being well tuned.  For questions like &quot;count the hairballs made in <a class="auto-href" href="http://dbpedia.org/resource/Germany" id="link-id0x2358ae60">Germany</a> reviewed by Japanese Hello Kitty fans,&quot; where two ends of a <code>JOIN</code> path are fairly selective doing the other as a <code>HASH JOIN</code> is good.  This can, if the <code>JOIN</code> is always cardinality-reducing, even be merged inside an <code>INDEX</code> lookup.  We have such capabilities since we have been for a while gearing up for the relational races, but are not using any of these with BSBM-BI, although they would be useful.</p>
 

<p>Let us see the profile for a single user 100 Mt run.</p>

<p>The database activity summary is --</p>

<p>
<code>select db_activity (0, &#39;http&#39;);</code>
</p>

<p>
<code> 161.3MÂ rndÂ  210.2MÂ seqÂ  Â  Â  0Â sameÂ segÂ  Â 104.5MÂ sameÂ pgÂ  45.08MÂ sameÂ parÂ  Â  Â  0Â diskÂ  Â  Â Â 0Â specÂ diskÂ  Â  Â  0BÂ /Â  Â  Â  0Â messagesÂ  2.393KÂ fork</code>
</p>


<p>See the post &quot;<a href="http://www.openlinksw.com/weblog/oerling/?id=1671" id="link-id0x1b1f3068">What Does BSBM Explore Measure</a>&quot; for an explanation of the numbers.  We see that there is more sequential access than random and the random has fair locality with over half on the same page as the previous and a lot of the rest falling under the same parent. Funnily enough, the explore mix has more locality.  Running with a longer vector size would probably increase performance by getting better locality.  There is an optimization that adjusts vector size on the fly if locality is not sufficient but this is not being used here. So we manually set vector size to 100000 instead of the default 10000. We get --</p>

<p>
<code> 172.4MÂ rndÂ  220.8MÂ seqÂ  Â  Â  0Â sameÂ segÂ  Â 149.6MÂ sameÂ pgÂ  10.99MÂ sameÂ parÂ  Â  Â 21Â diskÂ  Â  861Â specÂ diskÂ  Â  Â  0BÂ /Â  Â  Â  0Â messagesÂ  Â  Â 754Â fork</code>
</p>


<p>The throughput goes from 1494 to 1779.  We see more hits on the same page, as expected.  We do not make this setting a default since it raises the cost for small queries; therefore the vector size must be self-adjusting -- besides, expecting a DBA to tune this is not reasonable. We will just have to correctly tune the self-adjust logic, and we have again clear gains.</p>

<p>Let us now go back to the first run with vector size 10000.</p>

<p>The top of the CPU <code>oprofile</code> is as follows:</p>

<blockquote>
 <code><pre>
722309   15.4507  cmpf_iri64n_iri64n
434791    9.3005  cmpf_iri64n_iri64n_anyn_iri64n
294712    6.3041  itc_next_set
273488    5.8501  itc_vec_split_search
203970    4.3631  itc_dive_transit
199687    4.2714  itc_page_rcf_search
181614    3.8848  dc_itc_append_any
173043    3.7015  itc_bm_vec_row_check
146727    3.1386  cmpf_int64n
128224    2.7428  itc_vec_row_check
113515    2.4282  dk_alloc
97296     2.0812  page_wait_access
62523     1.3374  qst_vec_get_int64
59014     1.2623  itc_next_set_parent
53589     1.1463  sslr_qst_get
48003     1.0268  ds_add
46641     0.9977  dk_free_tree
44551     0.9530  kc_var_col
43650     0.9337  page_col_cmp_1
35297     0.7550  cmpf_iri64n_iri64n_anyn_gt_lt
34589     0.7399  dv_compare
25864     0.5532  cmpf_iri64n_anyn_iri64n_iri64n_lte
23088     0.4939  dk_free
</pre>
 </code>
</blockquote>

<p>The top 10 are all index traversal, with the key compare for two leading IRI keys in the lead, corresponding to a lookup with <code>P</code> and <code>S</code> given.  The one after that is with all parts given, corresponding to an existence test.  The existence tests could probably be converted to <code>HASH JOIN</code> lookups to good advantage.  Aggregation and arithmetic are absent.  We should probably add a query like TPC-H Q1 that does nothing but these two.  Considering the overall profile, <code>GROUP BY</code> seems to be around 3%.  We should probably put in a query that makes a very large number of groups and could make use of streaming aggregation, i.e., take advantage of a situation where aggregation input comes already grouped by the grouping columns.</p>

<p>A BI use case should offer no problem with including arithmetic, but there are not that many numbers in the BSBM set.  Some code sections in the queries with conditional execution and costly tests inside <code>ANDs</code> and <code>ORs</code> would be good.  TPC-H has such in Q21 and Q19.  An <code>OR</code> with existences where there would be gain from good guesses of a subquery&#39;s selectivity would be appropriate.  Also, there should be conditional expressions somewhere with a lot of data, like the <code>CASE-WHEN</code> in TPC-H Q12.</p>

<p>We can make BSBM-BI more interesting by putting in the above.  Also we will have to see where we can profit from <code>HASH JOIN</code>, both small and large.  There should be such places in the workload already so this is a matter of just playing a bit more.</p>

<p>This post amounts to a cheat sheet for the BSBM-BI runs a bit farther down the road. By then we should be operational with the column store and Virtuoso 7 Cluster, though, so not everything is yet on the table.</p>



<h3>
<i>Benchmarks, Redux</i> Series</h3>
<ul>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1658" id="link-id0x1fd1d4e0">Benchmarks, Redux (part 1): On RDF Benchmarks</a>
</li>

<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x1d5b07d8">Benchmarks, Redux (part 2): A Benchmarking Story</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1663" id="link-id0x1dfe6c48">Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1665" id="link-id0x197fce30">Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1667" id="link-id0x1fbf4210">Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1669" id="link-id0x1beeb1e0">Benchmarks, Redux (part 6): BSBM and I/O, continued</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1671" id="link-id0x1d7e1818">Benchmarks, Redux (part 7): What Does BSBM Explore Measure?</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1673" id="link-id0x1dfc1730">Benchmarks, Redux (part 8): BSBM Explore and Update </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1675" id="link-id0x1ea819a8">Benchmarks, Redux (part 9): BSBM With Cluster</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1677" id="link-id0x1ec73da0">Benchmarks, Redux (part 10): LOD2 and the Benchmark Process</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1678" id="link-id0x1fbdce90">Benchmarks, Redux (part 11): The Substance of Benchmarks</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x19928618">Benchmarks, Redux (part 12): Our Own BSBM Results Report</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1f3d8710">Benchmarks, Redux (part 13): BSBM-BI Modifications </a>
</li>
<li>
Benchmarks, Redux (part 14): BSBM-BI Mix  <i>(this post)</i>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1e627400">Benchmarks, Redux (part 15): BSBM Test Driver Enhancements </a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2011-03-22#1682">
  <rss:title>Benchmarks, Redux (part 13): BSBM BI Modifications</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-03-22T22:30:44Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">In this post we introduce changes to the BSBM BI queries and metric. These changes are motivated by prevailing benchmark practice and by our experiences in optimizing for the BSBM BI workload. We will publish results according to the definitions given here and recommend that any interested parties do likewise. The rationales are given in the text. Query Mix We have removed Q4 from the mix because it is quadratic to the scale factor. The other queries are roughly n * log (n). Parameter Substitution All queries that take a product type as parameter are run in flights of several query invocations where the product type goes from broader to more specific. The initial product type specifies either the root product type or an immediate subtype of this, and the last in the drill-down is a leaf type. The rationale for this is that the choice of product type may make several orders of magnitude difference in the run time of a query. In order to make consecutive query mixes roughly comparable in execution time, all mixes should have a predictable number of query invocations with product types of each level. Query Order In the BI mix, when running multiple concurrent clients, each query mix is submitted in a random order. Queries which do drill-downs always have the steps of the drill-down as consecutive in the session, but the query templates are permuted. This is done so as to make less likely that there were two concurrent queries accessing exactly the same data. In this way, scans cannot be trivially shared between queries -- but there are still opportunities for reuse of results and adapting execution to working set, e.g., starting with what is in memory. Metrics We use a TPC-H-like metric. This metric consists of a single-user part and a multi-user part, called respectively Power and Throughput. The Power metric is a geometric mean of query run-time. The Throughput is the total run-time divided by the number of queries completed. After taking the mean, the time is converted into queries-per-hour. This time is then multiplied by the scale factor divided by the scale factor for 100 Mt. In other words, we consider the 100 Mt data set as the unit scale. The Power is defined as ( scale_factor / 284826 ) * 3600 / ( ( t1 * t1 * ... * tn ) ^ ( 1 / n ) ) The Throughput is defined as ( scale_factor / 284826 ) * 3600 / ( ( t1 + t2 + ... + tn ) / n ) The magic number 284826 is the scale that generates approximately 100 million triples (100 Mt). We consider this scale &quot;one&quot;. The reason for the multiplication is that scores at different scales should get similar numbers; otherwise 10x larger scale would result roughly in 10x lower throughput with the BI queries. The Composite metric is the geometric mean of the Power and Throughput metrics. A complete report shows both Power and Throughput metrics, as well as individual query times for all queries. The rationale for using a geometric mean is to give an equal importance to long and short queries. Halving the execution time of either a long query or a short query will have the same effect on the metric. This is good for encouraging research into all aspects of query processing. On the other hand, real-life users are more interested in halving the time of queries that take one hour than of queries that take one second; therefore, the throughput metric considers run times. Taking the geometric mean of the two metrics gives more weight to the lower of the two than an arithmetic mean, hence we pay more attention to the worse of the two. Single-user and multi-user metrics are separate because of the relative importance of intra-query parallelization in BI workloads: There may not be large numbers of concurrent users, yet queries are still complex, and it is important to have maximum parallelization. Therefore the metric rewards single-user performance. In the next post we will look at the use of this metric and the actual content of BSBM BI. Benchmarks, Redux Series Benchmarks, Redux (part 1): On RDF Benchmarks Benchmarks, Redux (part 2): A Benchmarking Story Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs Benchmarks, Redux (part 6): BSBM and I/O, continued Benchmarks, Redux (part 7): What Does BSBM Explore Measure? Benchmarks, Redux (part 8): BSBM Explore and Update Benchmarks, Redux (part 9): BSBM With Cluster Benchmarks, Redux (part 10): LOD2 and the Benchmark Process Benchmarks, Redux (part 11): The Substance of Benchmarks Benchmarks, Redux (part 12): Our Own BSBM Results Report Benchmarks, Redux (part 13): BSBM BI Modifications (this post) Benchmarks, Redux (part 14): BSBM BI Mix Benchmarks, Redux (part 15): BSBM Test Driver Enhancements</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[
<p>In this post we introduce changes to the <a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x23cb6710">BSBM</a> BI queries and metric. These changes are motivated by prevailing benchmark practice and by our experiences in optimizing for the BSBM BI workload.</p>

<p>We will publish results according to the definitions given here and recommend that any interested parties do likewise.  The rationales are given in the text.</p>


<h3>Query Mix</h3>

<p>We have removed Q4 from the mix because it is quadratic to the scale factor.  The other queries are roughly <code>n * log (n)</code>.  </p>


<h3>Parameter Substitution </h3>

<p>All queries that take a product type as parameter are run in flights of several query invocations where the product type goes from broader to more specific.  The initial product type specifies either the root product type or an immediate subtype of this, and the last in the drill-down is a leaf type.</p>

<p>The rationale for this is that the choice of product type may make several orders of magnitude difference in the run time of a query.  In order to make consecutive query mixes roughly comparable in execution time, all mixes should have a predictable number of query invocations with product types of each level.</p>


<h3>Query Order </h3>

<p>In the BI mix, when running multiple concurrent clients, each query mix is submitted in a random order.  Queries which do drill-downs always have the steps of the drill-down as consecutive in the session, but the query templates are permuted.  This is done so as to make less likely that there were two concurrent queries accessing exactly the same <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0x244a3d88">data</a>.  In this way, scans cannot be trivially shared between queries -- but there are still opportunities for reuse of results and adapting execution to working set, e.g., starting with what is in memory.</p>


<h3>Metrics </h3>

<p>We use a <a class="auto-href" href="http://www.tpc.org/" id="link-id0x23880db8">TPC</a>-<a class="auto-href" href="http://dbpedia.org/resource/TPC-H" id="link-id0x29201c58">H</a>-like metric.  This metric consists of a single-user part and a multi-user part, called respectively <i>Power</i> and <i>Throughput.</i>  The <i>Power</i> metric is a geometric mean of query run-time.  The <i>Throughput</i> is the total run-time divided by the number of queries completed.  After taking the mean, the time is converted into queries-per-hour.  This time is then multiplied by the scale factor divided by the scale factor for 100 Mt. In other words, we consider the 100 Mt data set as the unit scale.</p>

<p>The <i>Power</i> is defined as</p>
<blockquote>( scale_factor / 284826 ) *  3600 / ( ( t1 * t1 * ... * tn ) ^ ( 1 / n ) ) </blockquote>
<p>The <i>Throughput</i> is defined as</p>
<blockquote>( scale_factor / 284826 ) *  3600 / ( ( t1 + t2 + ... + tn ) / n ) </blockquote>
<p>The magic number <b><code>284826</code></b> is the scale that generates approximately 100 million triples (100 Mt).  We consider this scale &quot;one&quot;.  The reason for the multiplication is that scores at different scales should get similar numbers; otherwise 10x larger scale would result roughly in 10x lower throughput with the BI queries.</p>


<p>The <i>Composite</i> metric is the geometric mean of the <i>Power</i> and <i>Throughput</i> metrics.  A complete report shows both <i>Power</i> and <i>Throughput</i> metrics, as well as individual query times for all queries.  The rationale for using a geometric mean is to give an equal importance to long and short queries.  Halving the execution time of either a long query or a short query will have the same effect on the metric.  This is good for encouraging research into all aspects of query processing.  On the other hand, real-life users are more interested in halving the time of queries that take one hour than of queries that take one second; therefore, the throughput metric considers run times.</p>

<p>Taking the geometric mean of the two metrics gives more weight to the lower of the two than an arithmetic mean, hence we pay more attention to the worse of the two.</p>

<p>Single-user and multi-user metrics are separate because of the relative importance of intra-query parallelization in BI workloads: There may not be large numbers of concurrent users, yet queries are still complex, and it is important to have maximum parallelization. Therefore the metric rewards single-user performance.</p>


<p>In the next post we will look at the use of this metric and the actual content of BSBM BI.</p>



<h3>
<i>Benchmarks, Redux</i> Series</h3>
<ul>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1658" id="link-id0x1b02d528">Benchmarks, Redux (part 1): On RDF Benchmarks</a>
</li>

<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x1d65f740">Benchmarks, Redux (part 2): A Benchmarking Story</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1663" id="link-id0x1a797860">Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1665" id="link-id0x1d3538e0">Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1667" id="link-id0x1e566f60">Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1669" id="link-id0x1dedffd8">Benchmarks, Redux (part 6): BSBM and I/O, continued</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1671" id="link-id0x1eb11528">Benchmarks, Redux (part 7): What Does BSBM Explore Measure?</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1673" id="link-id0x1db46c38">Benchmarks, Redux (part 8): BSBM Explore and Update </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1675" id="link-id0x1c8174e8">Benchmarks, Redux (part 9): BSBM With Cluster</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1677" id="link-id0x1dfa9338">Benchmarks, Redux (part 10): LOD2 and the Benchmark Process</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1678" id="link-id0x1e6dd7b0">Benchmarks, Redux (part 11): The Substance of Benchmarks</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1d154bb0">Benchmarks, Redux (part 12): Our Own BSBM Results Report</a>
</li>
<li>
Benchmarks, Redux (part 13): BSBM BI Modifications <i>(this post)</i>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1f242ae0">Benchmarks, Redux (part 14): BSBM BI Mix </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1ebf2f98">Benchmarks, Redux (part 15): BSBM Test Driver Enhancements </a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2011-03-10#1678">
  <rss:title>Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-03-10T23:30:11Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Let us talk about what ought to be benchmarked in the context of RDF. A point that often gets brought up by RDF-ers when talking about benchmarks is that there already exist systems which perform very well at TPC-H and similar workloads, and therefore there is no need for RDF to go there. It is, as it were, somebody else&#39;s problem; besides, it is a solved one. On the other hand, being able to express what is generally expected of a query language might not be a core competence or a competitive edge, but it certainly is a checklist item. BSBM seems to be adopted as a de facto RDF benchmark, as there indeed is almost nothing else. But we should not lose sight of the fact that this is in fact a relational schema and workload that has just been straightforwardly transformed to RDF. BSBM was made, after all, in part for measuring RDB to RDF mapping. Thus BSBM is no more RDF-ish than a trivially RDF-ized TPC-H would be. TPC-H is however a bit more difficult if also a better thought out benchmark than the BSBM BI Mix proposal. But I do not expect an RDF audience to have any enthusiasm for this as this is indeed a very tough race by now, and besides one in which RDB and SQL will keep some advantage. However, using this as a validation test is meaningful, as there exists a validation dataset and queries that we already have RDF-ized. We could publish these and call this &quot;RDF-H&quot;. In the following I will outline what would constitute an RDF-friendly, scientifically interesting benchmark. The points are in part based on discussions with Peter Boncz of CWI. The Social Network Intelligence Benchmark (SNIB) takes the social web Facebook-style schema Ivan Mikhailov and I made last year under the name of Botnet BM. In LOD2, CWI is presently working on this. The data includes DBpedia as a base component used for providing conversation topics, information about geographical locales of simulated users, etc. DBpedia is not very large, around 200M-300M triples, but it is diverse enough. The data will have correlations, e.g., people who talk about sports tend to know other people who talk about the same sport, and they are more likely to know people from their geographical area than from elsewhere. The bulk of the data consists of a rich history of interactions including messages to individuals and groups, linking to people, dropping links, joining and leaving groups, and so forth. The messages are tagged using real-world concepts from DBpedia, and there is correlation between tagging and textual content since both are generated from Dbpedia articles. Since there is such correlation, NLP techniques like entity and relationship extraction can be used with the data even though this is not the primary thrust of SNIB. There is variation in frequency of online interaction, and this interaction consist of sessions. For example, one could analyze user behavior per time of day for online ad placement. The data probably should include propagating memes, fashions, and trends that travel on the social network. With this, one could query about their origin and speed of propagation. There should probably be cases of duplicate identities in the data, i.e., one real person using many online accounts to push an agenda. Resolving duplicate identities makes for nice queries. Ragged data with half-filled profiles and misspelled identifiers like person and place names are a natural part of the social web use case. The data generator should take this into account. Distribution of popularity and activity should follow a power-law-like pattern; actual measures of popularity can be sampled from existing social networks even though large quantities of data cannot easily be extracted. The dataset should be predictably scalable. For the workload considered, the relative importance of the queries or other measured tasks should not change dramatically with the scale. For example some queries are logarithmic to data size (e.g., find connections to a person), some are linear (e.g., find average online time of sports fans on Sundays), and some are quadratic or worse (e.g., find two extremists of the same ideology that are otherwise unrelated). Making a single metric from such parts may not be meaningful. Therefore, SNIB might be structured into different workloads. The first would be an online mix with typically short lookups and updates, around O ( log ( n ) ). The Business Intelligence Mix would be composed of queries around OO ( n log ( n ) ). Even so, with real data, choice of parameters will provide dramatic changes in query run-time. Therefore a run should be specified to have a predictable distribution of &quot;hard&quot; and &quot;easy&quot; parameter choices. In the BSBM BI mix modification, I did this by defining some to be drill downs from a more general to a more specific level of a hierarchy. This could be done here too in some cases; other cases would have to be defined with buckets of values. Both the real world and LOD2 are largely concerned with data integration. The SNIB workload can have aspects of this, for example, in resolving duplicate identities. These operations are more complex than typical database queries, as the attributes used for joining might not even match in the initial data. One characteristic of these is the production of sometimes large intermediate results that need to be materialized. Doing these operations in practice requires procedural control. Further, running algorithms like network analytics (e.g., Page rank, centrality, etc.) involves aggregation of intermediate results that is not very well expressible in a query language. Some basic graph operations like shortest path are expressible but then are not in unextended SPARQL 1.1; as these would for example involve returning paths, which are explicitly excluded from the spec. These are however the areas where we need to go for a benchmark that is more than a repackaging of a relational BI workload. We find that such a workload will have procedural sections either in application code or stored procedures. Map-reduce is sometimes used for scaling these. As one would expect, many cluster databases have their own version of these control structures. Therefore some of the SNIB workload could even be implemented as map-reduce jobs alongside parallel database implementations. We might here touch base with the LarKC map-reduce work to see if it could be applied to SNIB workloads. We see a three-level structure emerging. There is an Online mix which is a bit like the BSBM Explore mix, and an Analytics mix which is on the same order of complexity as TPC-H. These may have a more-or-less fixed query formulation and test driver. Beyond these, yet working on the same data, we have a set of Predefined Tasks which the test sponsor may implement in a manner of their choice. We would finally get to the &quot;raging conflict&quot; between the &quot;declarativists&quot; and the &quot;map reductionists.&quot; Last year&#39;s VLDB had a lot of map-reduce papers. I know of comparisons between Vertica and map reduce for doing a fairly simple SQL query on a lot of data, but here we would be talking about much more complex jobs on more interesting (i.e., less uniform) data. We might even interest some of the cluster RDBMS players (Teradata, Vertica, Greenplum, Oracle Exadata, ParAccel, and/or Aster Data, to name a few) in running this workload using their map-reduce analogs. We see that as we get to topics beyond relational BI, we do not find ourselves in an RDF-only world but very much at a crossroads of many technologies, e.g., map-reduce and its database analogs, various custom built databases, graph libraries, data integration and cleaning tools, and so forth. There is not, nor ought there to be, a sheltered, RDF-only enclave. RDF will have to justify itself in a world of alternatives. This must be reflected in our benchmark development, so relational BI is not irrelevant; in fact, it is what everybody does. RDF cannot be a total failure at this, even if this were not RDF&#39;s claim to fame. The claim to fame comes after we pass this stage, which is what we intend to explore in SNIB. Benchmarks, Redux Series Benchmarks, Redux (part 1): On RDF Benchmarks Benchmarks, Redux (part 2): A Benchmarking Story Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs Benchmarks, Redux (part 6): BSBM and I/O, continued Benchmarks, Redux (part 7): What Does BSBM Explore Measure? Benchmarks, Redux (part 8): BSBM Explore and Update Benchmarks, Redux (part 9): BSBM With Cluster Benchmarks, Redux (part 10): LOD2 and the Benchmark Process Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks (this post) Benchmarks, Redux (part 12): Our Own BSBM Results Report Benchmarks, Redux (part 13): BSBM BI Modifications Benchmarks, Redux (part 14): BSBM BI Mix Benchmarks, Redux (part 15): BSBM Test Driver Enhancements</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Let us talk about what ought to be benchmarked in the context of <a class="auto-href" href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x29979188">RDF</a>.</p>

<p>A point that often gets brought up by RDF-ers when talking about benchmarks is that there already exist systems which perform very well at <a class="auto-href" href="http://www.tpc.org/" id="link-id0x2a7082d0">TPC</a>-<a class="auto-href" href="http://dbpedia.org/resource/TPC-H" id="link-id0x29997988">H</a> and similar workloads, and therefore there is no need for RDF to go there.  It is, as it were, somebody else&#39;s problem; besides, it is a solved one.</p>

<p>On the other hand, being able to express what is generally expected of a query language might not be a core competence or a competitive edge, but it certainly is a checklist item.</p>

<p>
<a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x2b164128">BSBM</a> seems to be adopted as a de facto RDF benchmark, as there indeed is almost nothing else.  But we should not lose sight of the fact that this is in fact a relational <a class="auto-href" href="http://dbpedia.org/resource/Database_schema" id="link-id0x2a84d3c0">schema</a> and workload that has just been straightforwardly transformed to RDF.  BSBM was made, after all, in part for measuring RDB to RDF mapping.  Thus BSBM is no more RDF-ish than a trivially RDF-ized TPC-H would be.  TPC-H is however a bit more difficult if also a better thought out benchmark than the BSBM BI Mix proposal.  But I do not expect an RDF audience to have any enthusiasm for this as this is indeed a very tough race by now, and besides one in which RDB and <a class="auto-href" href="http://dbpedia.org/resource/SQL" id="link-id0x2aba65b0">SQL</a> will keep some advantage.  However, using this as a validation test is meaningful, as there exists a validation dataset and queries that we already have RDF-ized.  We could publish these and call this &quot;RDF-H&quot;.  </p>

<p>In the following I will outline what would constitute an RDF-friendly, scientifically interesting benchmark.  The points are in part based on discussions with <a class="auto-href" href="http://nl.linkedin.com/in/peterboncz" id="link-id0x29c81a40">Peter Boncz</a> of <a class="auto-href" href="http://dbpedia.org/resource/National_Research_Institute_for_Mathematics_and_Computer_Science" id="link-id0x2a0c8190">CWI</a>.</p>

<p>The <a class="auto-href" href="http://www.w3.org/wiki/Social_Network_Intelligence_BenchMark" id="link-id0x29ca15c0">Social Network Intelligence Benchmark</a> (<a class="auto-href" href="http://www.w3.org/wiki/Social_Network_Intelligence_BenchMark" id="link-id0x2990a6b8">SNIB</a>) takes the social web Facebook-style schema Ivan Mikhailov and I made last year under the name of Botnet BM.  In <a class="auto-href" href="http://lod2.eu/" id="link-id0x2a2e5338">LOD2</a>, CWI is presently working on this.</p>

<p>The <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0x2a650cc0">data</a> includes <a class="auto-href" href="http://dbpedia.org/resource/DBpedia" id="link-id0x2a2e5808">DBpedia</a> as a base component used for providing conversation topics, <a class="auto-href" href="http://dbpedia.org/resource/Information" id="link-id0x29a19570">information</a> about geographical locales of simulated users, etc.  DBpedia is not very large, around 200M-300M triples, but it is diverse enough.</p>

<p>The data will have correlations, e.g., people who talk about sports tend to know other people who talk about the same sport, and they are more likely to know people from their geographical area than from elsewhere.  </p>

<p>The bulk of the data consists of a rich history of interactions including messages to individuals and groups, linking to people, dropping links, joining and leaving groups, and so forth.  The messages are tagged using real-world concepts from DBpedia, and there is correlation between tagging and textual content since both are generated from Dbpedia articles.  Since there is such correlation, <a class="auto-href" href="http://dbpedia.org/resource/Natural_language_processing" id="link-id0x29995600">NLP</a> techniques like <a class="auto-href" href="http://dbpedia.org/resource/Entity" id="link-id0x29910c58">entity</a> and relationship extraction can be used with the data even though this is not the primary thrust of SNIB.</p>

<p>There is variation in frequency of online interaction, and this interaction consist of sessions.  For example, one could analyze user behavior per time of day for online ad placement.</p>

<p>The data probably should include propagating memes, fashions, and trends that travel on the social network.  With this, one could query about their origin and speed of propagation.</p>

<p>There should probably be cases of duplicate identities in the data, i.e., one real person using many online accounts to push an agenda. Resolving duplicate identities makes for nice queries.</p>

<p>Ragged data with half-filled profiles and misspelled identifiers like person and place names are a natural part of the social web use case. The data generator should take this into account.</p>

<ul>
<li>
  <p>Distribution of popularity and activity should follow a power-law-like pattern; actual measures of popularity can be sampled from existing social networks even though large quantities of data cannot easily be extracted.</p>
</li>

<li>
  <p>The dataset should be predictably scalable.  For the workload considered, the relative importance of the queries or other measured tasks should not change dramatically with the scale.</p>
</li>
</ul>

<p>For example some queries are logarithmic to data size (e.g., find connections to a person), some are linear (e.g., find average online time of sports fans on Sundays), and some are quadratic or worse (e.g., find two extremists of the same ideology that are otherwise unrelated).  Making a single metric from such parts may not be meaningful.  Therefore, SNIB might be structured into different workloads.</p>

<p>The first would be an online mix with typically short lookups and updates, around <code>O ( log ( n ) )</code>.  </p>

<p>The Business Intelligence Mix would be composed of queries around <code>OO ( n log ( n ) )</code>.  Even so, with real data, choice of parameters will provide dramatic changes in query run-time.  Therefore a run should be specified to have a predictable distribution of &quot;hard&quot; and &quot;easy&quot; parameter choices.  In the BSBM BI mix modification, I did this by defining some to be drill downs from a more general to a more specific level of a hierarchy.  This could be done here too in some cases; other cases would have to be defined with buckets of values. </p>

<p>Both the real world and LOD2 are largely concerned with data integration.  The SNIB workload can have aspects of this, for example, in resolving duplicate identities.  These operations are more complex than typical database queries, as the attributes used for joining might not even match in the initial data.</p>

<p>One characteristic of these is the production of sometimes large intermediate results that need to be materialized.  Doing these operations in practice requires procedural control.  Further, running algorithms like network analytics (e.g., Page rank, centrality, etc.) involves aggregation of intermediate results that is not very well expressible in a query language.  Some basic graph operations like shortest path are expressible but then are not in unextended <a class="auto-href" href="http://dbpedia.org/resource/SPARQL" id="link-id0x2a628e60">SPARQL</a> 1.1; as these would for example involve returning paths, which are explicitly excluded from the spec.</p>

<p>These are however the areas where we need to go for a benchmark that is more than a repackaging of a relational BI workload.</p>

<p>We find that such a workload will have procedural sections either in application code or stored procedures.  Map-reduce is sometimes used for scaling these.  As one would expect, many cluster databases have their own version of these control structures.  Therefore some of the SNIB workload could even be implemented as map-reduce jobs alongside parallel database implementations.  We might here touch base with the <a class="auto-href" href="http://www.larkc.eu/" id="link-id0x29ab4860">LarKC</a> map-reduce work to see if it could be applied to SNIB workloads. </p>

<p>We see a three-level structure emerging.  There is an <i>Online</i> mix which is a bit like the BSBM <i>Explore</i> mix, and an <i>Analytics</i> mix which is on the same order of complexity as TPC-H.  These may have a more-or-less fixed query formulation and test driver.  Beyond these, yet working on the same data, we have a set of <i>Predefined Tasks</i> which the test sponsor may implement in a manner of their choice.</p>

<p>We would finally get to the &quot;raging conflict&quot; between the &quot;declarativists&quot; and  the &quot;map reductionists.&quot;  Last year&#39;s VLDB had a lot of map-reduce papers.  I know of comparisons between <a class="auto-href" href="http://www.vertica.com/" id="link-id0x29bd5828">Vertica</a> and map reduce for doing a fairly simple SQL query on a lot of data, but here we would be talking about much more complex jobs on more interesting (i.e., less uniform) data.</p>

<p>We might even interest some of the cluster <a class="auto-href" href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x29d49c18">RDBMS</a> players (<a class="auto-href" href="http://www.teradata.com/" id="link-id0x29d49c40">Teradata</a>, Vertica, <a class="auto-href" href="http://dbpedia.org/resource/Greenplum" id="link-id0x2bba2248">Greenplum</a>, <a class="auto-href" href="http://dbpedia.org/page/Oracle_Exadata" id="link-id0x2bba2270">Oracle Exadata</a>, <a class="auto-href" href="http://www.paraccel.com/" id="link-id0x2ac756d0">ParAccel</a>, and/or <a class="auto-href" href="http://www.asterdata.com/" id="link-id0x2ac756f8">Aster Data</a>, to name a few) in running this workload using their map-reduce analogs.</p>


<p>We see that as we get to topics beyond relational BI, we do not find ourselves in an RDF-only world but very much at a crossroads of many technologies, e.g., map-reduce and its database analogs, various custom built databases, graph libraries, data integration and cleaning tools, and so forth.</p>

<p>There is not, nor ought there to be, a sheltered, RDF-only enclave.  RDF will have to justify itself in a world of alternatives.</p>

<p>This must be reflected in our benchmark development, so relational BI is not irrelevant; in fact, it is what everybody does.  RDF cannot be a total failure at this, even if this were not RDF&#39;s claim to fame. The claim to fame comes after we pass this stage, which is what we intend to explore in SNIB.</p>



<h3>
<i>Benchmarks, Redux</i> Series</h3>
<ul>
<li>  <a href="http://www.openlinksw.com/weblog/oerling/?id=1658" id="link-id0x1c9f7ab8">Benchmarks, Redux (part 1): On RDF Benchmarks</a>
</li>

<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x1dd17b28">Benchmarks, Redux (part 2): A Benchmarking Story</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1663" id="link-id0x1eb20620">Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1665" id="link-id0x1f8a5ae8">Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1667" id="link-id0x1ac14a08">Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1669" id="link-id0x1d1f8d58">Benchmarks, Redux (part 6): BSBM and I/O, continued</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1671" id="link-id0x1ea83308">Benchmarks, Redux (part 7): What Does BSBM Explore Measure?</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1673" id="link-id0x1b548028">Benchmarks, Redux (part 8): BSBM Explore and Update </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1675" id="link-id0x1c3d9c58">Benchmarks, Redux (part 9): BSBM With Cluster</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1677" id="link-id0x1f5e6978">Benchmarks, Redux (part 10): LOD2 and the Benchmark Process</a>
</li>
<li>
Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks <i>(this post)</i>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1c082a28">Benchmarks, Redux (part 12): Our Own BSBM Results Report</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1ec73578">Benchmarks, Redux (part 13): BSBM BI Modifications </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1eb25d48">Benchmarks, Redux (part 14): BSBM BI Mix </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1b261958">Benchmarks, Redux (part 15): BSBM Test Driver Enhancements </a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2011-03-10#1677">
  <rss:title>Benchmarks, Redux (part 10): LOD2 and the Benchmark Process</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-03-10T23:29:41Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">I have in the previous posts generally argued for and demonstrated the usefulness of benchmarks. Here I will talk about how this could be organized in a way that is tractable, and takes vendor and end user interests into account. These are my views on the subject and do not represent a LOD2 members consensus, but have been discussed in the consortium. My colleague Ivan Mikhailov once proposed that the only way to get benchmarks run right is to package them as a single script that does everything, like instant noodles -- just add water! But even instant noodles can be abused: Cook too long, add too much water, maybe forget to light the stove, and complain that the result is unsatisfyingly hard and brittle, lacking the suppleness one has grown to expect from this delicacy. No, the answer lies at the other end of the culinary spectrum, in gourmet cooking. Let the best cooks show what they can do, and let them work at it; let those who in fact have capacity and motivation for creating le chef d&#39;oeuvre culinaire (&quot;the culinary masterpiece&quot;) create it. Even so, there are many value points along the dimensions of preparation time, cost, and esthetic layout, not to forget taste and nutritional values. Indeed, an intimate knowledge de la vie secrete du canard (&quot;the secret life of duck&quot;) is required in order to liberate the aroma that it might take flight and soar. In the previous, I have shed some light on how we prepare le canard, and if le canard be such then la dinde (turkey) might in some ways be analogous; who is to say? In other words, as a vendor, we want to have complete control over the benchmarking process, and have it take place in our environment at a time of our choice. In exchange for this, we are ready to document and observe possibly complicated rules, document how the runs are made, and let others monitor and repeat them on the equipment on which the results are obtained. This is the TPC (Transaction Processing Performance Council) model. Another culture of doing benchmarks is the periodic challenge model used in TREC, the Billion Triples Challenge, the Semantic Search Challenge and others. In this model, vendors prepare the benchmark submission and agree to joint publication. A third party performing benchmarks by itself is uncommon in databases. Licenses even often explicitly prohibit this, for understandable reasons. The LOD2 project has an outreach activity called Publink where we offer to help owners of data to publish it as Linked Data. Similarly, since FP 7s are supposed to offer a visible service to their communities, I proposed that LOD2 offer to serve a role in disseminating and auditing RDF store benchmarks. One representative of an RDF store vendor I talked to, in relation to setting up a benchmark configuration of their product, told me that we could do this and that they would give some advice but that such an exercise was by its nature fundamentally flawed and could not possibly produce worthwhile results. The reason for this was that OpenLink engineers could not possibly learn enough about the other products nor unlearn enough of their own to make this a meaningful comparison. Isn&#39;t this the very truth? Let the chefs mix their own spices. This does not mean that there would not be comparability of results. If the benchmarks and processes are well defined, documented, and checked by a third party, these can be considered legitimate and not just one-off best-case results without further import. In order to stretch the envelope, which is very much a LOD2 goal, this benchmarking should be done on a variety of equipment -- whatever works best at the scale in question. Increasing the scale remains a stated objective. LOD2 even promised to run things with a trillion triples in another 3 years. Imagine that the unimpeachably impartial Berliners made house calls. Would this debase Justice to be a servant of mere show-off? Or would this on the contrary combine strict Justice with edifying Charity? Who indeed is in greater need of the light of objective evaluation than the vendor whose very nature makes a being of bias and prejudice? Even better, CWI, with its stellar database pedigree, agreed in principle to audit RDF benchmarks in LOD2. In this way one could get a stamp of approval for one&#39;s results regardless of when they were produced, and be free of the arbitrary schedule of third party benchmarking runs. On the relational side this is a process of some cost and complexity, but since the RDF side is still young and more on mutually friendly terms, the process can be somewhat lighter here. I did promise to draft some extra descriptions of process and result disclosure so that we could see how this goes. We could even do this unilaterally -- just publish Virtuoso results according to a predefined reporting and verification format. If others wished to publish by the same rules, LOD2 could use some of the benchmarking funds for auditing the proceedings. This could all take place over the net, so we are not talking about any huge cost or prohibitive amount of trouble. It would be in the FP7 spirit that LOD2 provide this service for free, naturally within reason. Then there is the matter of the BSBM Business Intelligence (BI) mix. At present, it seems everybody has chosen to defer the matter to another round of BSBM runs in the summer. This seems to fit the pattern of a public challenge with a few months given for contenders to prepare their submissions. Here we certainly should look at bigger scales and more diverse hardware than in the Berlin runs published this time around. The BI workload is in fact fairly cluster friendly, with big joins and aggregations that parallelize well. There it would definitely make sense to reserve an actual cluster, and have all contenders set up their gear on it. If all have access to the run environment and to monitoring tools, we can be reasonably sure that things will be done in a transparent manner. (I will talk about the BI mix in more detail in part 13 and part 14 of this series.) Once the BI mix has settled and there are a few interoperable implementations, likely in the summer, we could pass from the challenge model to a situation where vendors may publish results as they become available, with LOD2 offering its services for audit. Of course, this could be done even before then, but the content of the mix might not be settled. We likely need to check it on a few implementations first. For equipment, people can use their own, or LOD2 partners might on a case-by-case basis make some equipment available for running on the same hardware on which say the Virtuoso results were obtained. For example, FU Berlin could give people a login to get their recently published results fixed. Now this might or might not happen, so I will not hold my breath waiting for this but instead close with a proposal. As a unilateral diplomatic overture I put forth the following: If other vendors are interested in 1:1 comparison of their results with our publications, we can offer them a login to the same equipment. They can set up and tune their systems, and perform the runs. We will just watch. As an extra quid pro quo, they can try Virtuoso as configured for the results we have published, with the same data. Like this, both parties get to see the others&#39; technology with proper tuning and installation. What, if anything, is reported about this activity is up to the owner of the technology being tested. We will publish a set of benchmark rules that can serve as a guideline for mutually comparable reporting, but we cannot force anybody to use these. This all will function as a catalyst for technological advance, all to the ultimate benefit of the end user. If you wish to take advantage of this offer, you may contact Hugh Williams at OpenLink Software, and we will see how this can be arranged in practice. The next post will talk about the actual content of benchmarks. The milestone after this will be when we publish the measurement and reporting protocols. Benchmarks, Redux Series Benchmarks, Redux (part 1): On RDF Benchmarks Benchmarks, Redux (part 2): A Benchmarking Story Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs Benchmarks, Redux (part 6): BSBM and I/O, continued Benchmarks, Redux (part 7): What Does BSBM Explore Measure? Benchmarks, Redux (part 8): BSBM Explore and Update Benchmarks, Redux (part 9): BSBM With Cluster Benchmarks, Redux (part 10): LOD2 and the Benchmark Process (this post) Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks Benchmarks, Redux (part 12): Our Own BSBM Results Report Benchmarks, Redux (part 13): BSBM BI Modifications Benchmarks, Redux (part 14): BSBM BI Mix Benchmarks, Redux (part 15): BSBM Test Driver Enhancements</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>I have in the previous posts generally argued for and demonstrated the usefulness of benchmarks.</p>

<p>Here I will talk about how this could be organized in a way that is tractable, and takes vendor and end user interests into account. These are my views on the subject and do not represent a <a class="auto-href" href="http://lod2.eu/" id="link-id0x1d4999d0">LOD2</a> members consensus, but have been discussed in the consortium. </p>

<p>My colleague Ivan Mikhailov once proposed that the only way to get benchmarks run right is to package them as a single script that does everything, like instant noodles -- just add water!  But even instant noodles can be abused: Cook too long, add too much water, maybe forget to light the stove, and complain that the result is unsatisfyingly hard and brittle, lacking the suppleness one has grown to expect from this delicacy. No, the answer lies at the other end of the culinary spectrum, in gourmet cooking.  Let the best cooks show what they can do, and let them work at it; let those who in fact have capacity and motivation for creating <i>le chef d&#39;oeuvre culinaire</i> (&quot;the culinary masterpiece&quot;) create it.  Even so, there are many value points along the dimensions of preparation time, cost, and esthetic layout, not to forget taste and nutritional values.  Indeed, an intimate <a class="auto-href" href="http://dbpedia.org/resource/Knowledge" id="link-id0x1a63a168">knowledge</a> <i>de la vie secrete du canard</i> (&quot;the secret life of duck&quot;) is required in order to liberate the aroma that it might take flight and soar.  In the previous, I have shed some light on how we prepare <i>le canard</i>, and if <i>le canard</i> be such then <i>la dinde</i> (turkey) might in some ways be analogous; who is to say?</p>

<p>In other words, as a vendor, we want to have complete control over the benchmarking process, and have it take place in our environment at a time of our choice.  In exchange for this, we are ready to document and observe possibly complicated rules, document how the runs are made, and let others monitor and repeat them on the equipment on which the results are obtained.  This is the <a class="auto-href" href="http://www.tpc.org/" id="link-id0x1d676280">TPC</a> (Transaction Processing Performance Council) model.</p>

<p>Another culture of doing benchmarks is the periodic challenge model used in TREC, the <a class="auto-href" href="http://challenge.semanticweb.org/" id="link-id0x1c222020">Billion Triples Challenge</a>, the Semantic Search
Challenge and others. In this model, vendors prepare the benchmark submission and agree to joint publication.</p>

<p>A third party performing benchmarks by itself is uncommon in databases.  Licenses even often explicitly prohibit this, for understandable reasons.</p>

<p>The LOD2 project has an outreach activity called Publink where we offer to help owners of <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0x112c3dc0">data</a> to publish it as <a class="auto-href" href="http://dbpedia.org/resource/Linked_Data" id="link-id0x1d1b7078">Linked Data</a>. Similarly, since FP 7s are supposed to offer a visible service to their communities, I proposed that LOD2 offer to serve a role in disseminating and auditing <a class="auto-href" href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x11c0ff08">RDF</a> store benchmarks.</p>

<p>One representative of an RDF store vendor I talked to, in relation to setting up a benchmark configuration of their product, told me that we could do this and that they would give some advice but that such an exercise was by its nature fundamentally flawed and could not possibly produce worthwhile results.  The reason for this was that OpenLink engineers could not possibly learn enough about the other products nor unlearn enough of their own to make this a meaningful comparison.</p>

<p>Isn&#39;t this the very truth?   Let the chefs  mix their own spices.</p>

<p>This does not mean that there would not be comparability of results. If the benchmarks and processes are well defined, documented, and checked by a third party, these can be considered legitimate and not just one-off best-case results without further import.</p>

<p>In order to stretch the envelope, which is very much a LOD2 goal, this benchmarking should be done on a variety of equipment -- whatever works best at the scale in question.  Increasing the scale remains a stated objective.  LOD2 even promised to run things with a trillion triples in another 3 years.  </p>

<p>Imagine that the unimpeachably impartial Berliners made house calls. Would this debase Justice to be a servant of mere show-off?  Or would this on the contrary combine strict Justice with edifying Charity?  Who indeed is in greater need of the light of objective evaluation than the vendor whose very nature makes a being of bias and prejudice?</p>

<p>Even better, <a class="auto-href" href="http://dbpedia.org/resource/National_Research_Institute_for_Mathematics_and_Computer_Science" id="link-id0x1c369958">CWI</a>, with its <a href="http://monetdb.cwi.nl/Development/Research/Articles/" id="link-id0x1d6479d0">stellar database pedigree</a>, agreed in principle to audit RDF benchmarks in LOD2. </p>

<p>In this way one could get a stamp of approval for one&#39;s results regardless of when they were produced, and be free of the arbitrary schedule of third party benchmarking runs.  On the relational side this is a process of some cost and complexity, but since the RDF side is still young and more on mutually friendly terms, the process can be somewhat lighter here.  I did promise to draft some extra descriptions of process and result disclosure so that we could see how this goes.</p>

<p>We could even do this unilaterally -- just publish <a class="auto-href" href="http://virtuoso.openlinksw.com" id="link-id0x1e0c0690">Virtuoso</a> results according to a predefined reporting and verification format.  If others wished to publish by the same rules, LOD2 could use some of the benchmarking funds for auditing the proceedings.  This could all take place over the <a class="auto-href" href="http://dbpedia.org/resource/.NET_Framework" id="link-id0x1ed4b9d8">net</a>, so we are not talking about any huge cost or prohibitive amount of trouble.  It would be in the FP7 spirit that LOD2 provide this service for free, naturally within reason.</p>

<p>Then there is the matter of the <a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x1ea75360">BSBM</a> Business Intelligence (BI) mix.  At present, it seems everybody has chosen to defer the matter to another round of BSBM runs in the summer.  This seems to fit the pattern of a public challenge with a few months given for contenders to prepare their submissions.  Here we certainly should look at bigger scales and more diverse hardware than in the Berlin runs published this time around.  The BI workload is in fact fairly cluster friendly, with big joins and aggregations that parallelize well.  There it would definitely make sense to reserve an actual cluster, and have all contenders set up their gear on it.  If all have access to the run environment and to monitoring tools, we can be reasonably sure that things will be done in a transparent manner.  </p>

<p>(I will talk about the BI mix in more detail in <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1dfcc038">part 13</a> and <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1edaa388">part 14</a> of this series.)</p>

<p>Once the BI mix has settled and there are a few interoperable implementations, likely in the summer, we could pass from the challenge model to a situation where vendors may publish results as they become available, with LOD2 offering its services for audit. </p>

<p>Of course, this could be done even before then, but the content of the mix might not be settled.  We likely need to check it on a few implementations first.</p>

<p>For equipment, people can use their own, or LOD2 partners might on a case-by-case basis make some equipment available for running on the same hardware on which say the Virtuoso results were obtained.  For example, FU Berlin could give people a login to get their recently published results fixed.  Now this might or might not happen, so I will not hold my breath waiting for this but instead close with a proposal.</p>

<p>As a unilateral diplomatic overture I put forth the following: If other vendors are interested in 1:1 comparison of their results with our publications, we can offer them a login to the same equipment.  They can set up and tune their systems, and perform the runs.  We will just watch.  As an extra quid pro quo, they can try Virtuoso as configured for the results we have published, with the same data.  Like this, both parties get to see the others&#39; technology with proper tuning and installation.  What, if anything, is reported about this activity is up to the owner of the technology being tested.  We will publish a set of benchmark rules that can serve as a guideline for mutually comparable reporting, but we cannot force anybody to use these.  This all will function as a catalyst for technological advance, all to the ultimate benefit of the end user.  If you wish to take advantage of this offer, you may contact <a href="mailto:hwilliams@openlinksw.com?subject=Collaborative RDF Benchmark" id="link-id0x1c071100">Hugh Williams at OpenLink Software, and we will see how this can be arranged in practice.</a>
</p>

<p>The next post will talk about the <a href="http://www.openlinksw.com/weblog/oerling/?id=1678" id="link-id0x19933fd8">actual content of benchmarks</a>.  The milestone after this will be when we publish the measurement and reporting protocols.</p>


<h3>
<i>Benchmarks, Redux</i> Series</h3>
<ul>
<li>  <a href="http://www.openlinksw.com/weblog/oerling/?id=1658" id="link-id0x1c554800">Benchmarks, Redux (part 1): On RDF Benchmarks</a>
</li>

<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x1ec159e8">Benchmarks, Redux (part 2): A Benchmarking Story</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1663" id="link-id0x1dd5eb10">Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1665" id="link-id0x18f05940">Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1667" id="link-id0x1ed5ef10">Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1669" id="link-id0x1e9cb130">Benchmarks, Redux (part 6): BSBM and I/O, continued</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1671" id="link-id0x1dfa79d8">Benchmarks, Redux (part 7): What Does BSBM Explore Measure?</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1673" id="link-id0x1eb6f478">Benchmarks, Redux (part 8): BSBM Explore and Update </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1675" id="link-id0x1de5a918">Benchmarks, Redux (part 9): BSBM With Cluster</a>
</li>
<li>
Benchmarks, Redux (part 10): LOD2 and the Benchmark Process <i>(this post)</i>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1678" id="link-id0x1dae9060">Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1f45fa10">Benchmarks, Redux (part 12): Our Own BSBM Results Report</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1f49d2b8">Benchmarks, Redux (part 13): BSBM BI Modifications </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1e68e4c8">Benchmarks, Redux (part 14): BSBM BI Mix </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1e353858">Benchmarks, Redux (part 15): BSBM Test Driver Enhancements </a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2011-03-09#1675">
  <rss:title>Benchmarks, Redux (part 9): BSBM With Cluster</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-03-09T22:54:50Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">This post is dedicated to our brothers in horizontal partitioning (or sharding), Garlik and Bigdata. At first sight, the BSBM Explore mix appears very cluster-unfriendly, as it contains short queries that access data at random. There is every opportunity for latency and few opportunities for parallelism. For this reason we had not even run the BSBM mix with Virtuoso Cluster. We were not surprised to learn that Garlik hadn&#39;t run BSBM either. We have understood from Systap that their Bigdata BSBM experiments were on a single-process configuration. But the 4Store results in the recent Berlin report were with a distributed setup, as 4Store always runs a multiprocess configuration, even on a single server, so it seemed interesting to us to compare how Virtuoso Cluster compares with Virtuoso Single with this workload. These tests were run on a different box than the recent BSBM tests, so those 4Store figures are not directly comparable. The setup here consists of 8 partitions, each managed by its own process, all running on the same box. Any of these processes can have its HTTP and SQL listener and can provide the same service. Most access to data goes over the interconnect, except when the data is co-resident in the process which is coordinating the query. The interconnect is Unix domain sockets since all 8 processes are on the same box. 6 Cluster - Load Rates and Times Scale Rate (quads per second) Load time (seconds) Checkpoint time (seconds) 100 Mt 119,204 749 89 200 Mt 121,607 1486 157 1000 Mt 102,694 8737 979 6 Single - Load Rates and Times Scale Rate (quads per second) Load time (seconds) Checkpoint time (seconds) 100 Mt 74,713 1192 145 The load times are systematically better than for 6 Single. This is also not bad compared to the 7 Single vectored load rates of 220 Kt/s or so. We note that loading is a cluster friendly operation, going at a steady 1400+% CPU utilization with an aggregate message throughput of 40MB/s. 7 Single is faster because of vectoring at the index level, not because the clusters were hitting communication overheads. 6 Cluster is faster than 6 Single because scale-out in this case diminishes contention, even on a single box. Throughput is as follows: 6 Cluster - Throughput (QMpH, query mixes per hour) Scale Single User 16 User 100 Mt 7318 43120 200 Mt 6222 29981 1000 Mt 2526 11156 6 Single - Throughput (QMpH, query mixes per hour) Scale Single User 16 User 100 Mt 7641 29433 200 Mt 6017 13335 1000 Mt 1770 2487 Below is a snapshot of status during the 6 Cluster 100 Mt run. Cluster 8 nodes, 15 s. 25784 m/s 25682 KB/s 1160% cpu 0% read 740% clw threads 18r 0w 10i buffers 1133459 12 d 4 w 0 pfs cl 1: 10851 m/s 3911 KB/s 597% cpu 0% read 668% clw threads 17r 0w 10i buffers 143992 4 d 0 w 0 pfs cl 2: 2194 m/s 7959 KB/s 107% cpu 0% read 9% clw threads 1r 0w 0i buffers 143616 3 d 2 w 0 pfs cl 3: 2186 m/s 7818 KB/s 107% cpu 0% read 9% clw threads 0r 0w 0i buffers 140787 0 d 0 w 0 pfs cl 4: 2174 m/s 2804 KB/s 77% cpu 0% read 10% clw threads 0r 0w 0i buffers 140654 0 d 2 w 0 pfs cl 5: 2127 m/s 1612 KB/s 71% cpu 0% read 9% clw threads 0r 0w 0i buffers 140949 1 d 0 w 0 pfs cl 6: 2060 m/s 544 KB/s 66% cpu 0% read 10% clw threads 0r 0w 0i buffers 141295 2 d 0 w 0 pfs cl 7: 2072 m/s 517 KB/s 65% cpu 0% read 11% clw threads 0r 0w 0i buffers 141111 1 d 0 w 0 pfs cl 8: 2105 m/s 522 KB/s 66% cpu 0% read 10% clw threads 0r 0w 0i buffers 141055 1 d 0 w 0 pfs The main meters for cluster execution are the messages-per-second (m/s), the message volume (KB/s), and the total CPU% of the processes. We note that CPU utilization is highly uneven and messages are short, about 1K on the average, compared to about 100K during the load. CPU would be evenly divided between the nodes if each got a share of the HTTP requests. We changed the test driver to round-robin requests between multiple end points. The work does then get evenly divided, but the speed is not affected. Also, this does not improve the message sizes since the workload consists mostly of short lookups. However, with the processes spread over multiple servers, the round-robin would be essential for CPU and especially for interconnect throughput. Then we try 6 Cluster at 1000 Mt. For Single User, we get 1180 m/s, 6955 KB/s, and 173% cpu. For 16 User, this is 6573 m/s, 44366 KB/s, 1470% cpu. This is a lot better than the figures with 6 Single, due to lower contention on the index tree, as discussed in A Benchmarking Story. Also Single User throughput on 6 Cluster outperforms 6 Single, due to the natural parallelism of doing the Q5 joins in parallel in each partition. The larger the scale, the more weight this has in the metric. We see this also in the average message size, i.e., the KB/s throughput is almost double while the messages/s is a bit under a third. The small-scale 6 Cluster run is about even with the 6 Single figure. Looking at the details, we see that the qps for Q1 in 6 Cluster is half of that on 6 Single, whereas the qps for Q5 on 6 Cluster is about double that of the 6 Single. This is as one might expect; longer queries are favored, and single row lookups are penalized. Looking further at the 6 Cluster status we see the cluster wait (clw) to be 740%. For 16 Users, this means that about half of the execution real time is spent waiting for responses from other partitions. A high figure means uneven distribution between partitions; a low figure means even. This is as expected, since many queries are concerned with just one S and its related objects. We will update this section once 7 Cluster is ready. This will implement vectored execution and column store inside the cluster nodes. Benchmarks, Redux Series Benchmarks, Redux (part 1): On RDF Benchmarks Benchmarks, Redux (part 2): A Benchmarking Story Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs Benchmarks, Redux (part 6): BSBM and I/O, continued Benchmarks, Redux (part 7): What Does BSBM Explore Measure? Benchmarks, Redux (part 8): BSBM Explore and Update Benchmarks, Redux (part 9): BSBM With Cluster (this post) Benchmarks, Redux (part 10): LOD2 and the Benchmark Process Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks Benchmarks, Redux (part 12): Our Own BSBM Results Report Benchmarks, Redux (part 13): BSBM BI Modifications Benchmarks, Redux (part 14): BSBM BI Mix Benchmarks, Redux (part 15): BSBM Test Driver Enhancements</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>This post is dedicated to our brothers in horizontal partitioning (or sharding), <a class="auto-href" href="http://freebase.com/guid/9202a8c04000641f8000000005c908d6" id="link-id0x113d92b0">Garlik</a> and <a class="auto-href" href="http://www.systap.com/bigdata.htm" id="link-id0x1cca0090">Bigdata</a>.</p>

<p>At first sight, the <a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x1c1c1330">BSBM</a> <i>Explore</i> mix appears very cluster-unfriendly, as it contains short queries that access <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0xa2e1940">data</a> at random. There is every opportunity for latency and few opportunities for parallelism.</p>

<p>For this reason we had not even run the BSBM mix with <a class="auto-href" href="http://virtuoso.openlinksw.com" id="link-id0x1e734de0">Virtuoso</a> Cluster. We were not surprised to learn that <a href="http://steveharris.tumblr.com/post/3453040647/bsbm-v3-post-mortem" id="link-id0x1c4ef8d8">Garlik hadn&#39;t run BSBM either</a>. We have understood from <a class="auto-href" href="http://www.systap.com/" id="link-id0x1c579da0">Systap</a> that their Bigdata BSBM experiments were on a single-process configuration.</p>

<p>But the 4Store results in the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/V6/index.html" id="link-id0x1f8090f8">recent Berlin report</a> were with a distributed setup, as 4Store always runs a multiprocess configuration, even on a single server, so it seemed interesting to us to compare how Virtuoso Cluster compares with Virtuoso Single with this workload. These tests were run on a different box than the recent BSBM tests, so those 4Store figures are not directly comparable.</p>

<p>The setup here consists of 8 partitions, each managed by its own process, all running on the same box. Any of these processes can have its <a class="auto-href" href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x1bbcd560">HTTP</a> and <a class="auto-href" href="http://dbpedia.org/resource/SQL" id="link-id0x1ea554c0">SQL</a> listener and can provide the same service. Most access to data goes over the interconnect, except when the data is co-resident in the process which is coordinating the query. The interconnect is Unix domain sockets since all 8 processes are on the same box.</p>

<table border="1" cellspacing="2" cellpadding="2" align="center" width="90%">
	<tr>
		<th colspan="4" align="center">6 Cluster - Load Rates and Times</th>
	</tr>
	<tr>
		<th align="center">Scale</th>
		<th align="center">Rate <br /> (quads per second)</th>
		<th align="center">Load time <br /> (seconds)</th>
		<th align="center">Checkpoint time <br /> (seconds)</th>
	</tr>
	<tr>
		<th align="center">100 Mt</th>
		<td align="center"> 119,204 </td>
		<td align="center"> 749 </td>
		<td align="center"> 89 </td>
	</tr>
	<tr>
		<th align="center">200 Mt</th>
		<td align="center"> 121,607 </td>
		<td align="center"> 1486 </td>
		<td align="center"> 157 </td>
	</tr>
	<tr>
		<th align="center">1000 Mt</th>
		<td align="center"> 102,694 </td>
		<td align="center"> 8737 </td>
		<td align="center"> 979 </td>
	</tr>
</table>
<br />
<table border="1" cellspacing="2" cellpadding="2" align="center" width="90%">
	<tr>
		<th colspan="4" align="center">6 Single - Load Rates and Times</th>
	</tr>
	<tr>
		<th align="center">Scale</th>
		<th align="center">Rate <br /> (quads per second)</th>
		<th align="center">Load time <br /> (seconds)</th>
		<th align="center">Checkpoint time <br /> (seconds)</th>
	</tr>
	<tr>
		<th align="center">100 Mt</th>
		<td align="center"> 74,713 </td>
		<td align="center"> 1192 </td>
		<td align="center"> 145 </td>
	</tr>
</table>



<p>The load times are systematically better than for 6 Single. This is also not bad compared to the 7 Single vectored load rates of 220 Kt/s or so. We note that loading is a cluster friendly operation, going at a steady 1400+% <a class="auto-href" href="http://dbpedia.org/resource/Central_processing_unit" id="link-id0x1c55c8f0">CPU</a> utilization with an aggregate message throughput of 40MB/s. 7 Single is faster because of vectoring at the index level, not because the clusters were hitting communication overheads. 6 Cluster is faster than 6 Single because scale-out in this case diminishes contention, even on a single box.</p>

<p>Throughput is as follows:</p>

<table border="1" cellspacing="2" cellpadding="2" align="center" width="90%">
	<tr>
		<th colspan="3" align="center"> 6 Cluster - Throughput <br /> (QMpH, query mixes per hour) </th>
	</tr>
	<tr>
		<th align="center">Scale</th>
		<th align="center"> Single User </th>
		<th align="center"> 16 User </th>
	</tr>
	<tr>
		<th align="center">100 Mt</th>
		<td align="center"> 7318 </td>
		<td align="center"> 43120 </td>
	</tr>
	<tr>
		<th align="center">200 Mt</th>
		<td align="center"> 6222 </td>
		<td align="center"> 29981 </td>
	</tr>
	<tr>
		<th align="center">1000 Mt</th>
		<td align="center"> 2526 </td>
		<td align="center"> 11156 </td>
	</tr>
</table>
<br />
<table border="1" cellspacing="2" cellpadding="2" align="center" width="90%">
	<tr>
		<th colspan="3" align="center"> 6 Single - Throughput <br /> (QMpH, query mixes per hour) </th>
	</tr>
	<tr>
		<th align="center">Scale</th>
		<th align="center"> Single User </th>
		<th align="center"> 16 User </th>
	</tr>
	<tr>
		<th align="center">100 Mt</th>
		<td align="center"> 7641 </td>
		<td align="center"> 29433 </td>
	</tr>
	<tr>
		<th align="center">200 Mt</th>
		<td align="center"> 6017 </td>
		<td align="center"> 13335 </td>
	</tr>
	<tr>
		<th align="center">1000 Mt</th>
		<td align="center"> 1770 </td>
		<td align="center"> 2487 </td>
	</tr>
</table>


<p>Below is a snapshot of status during the 6 Cluster 100 Mt run.</p>

<blockquote>
 <code><pre>
Cluster 8 nodes, 15 s.
       25784 m/s  25682 KB/s  1160% cpu  0% read  740% clw  threads 18r 0w 10i  buffers 1133459  12 d  4 w  0 pfs
cl 1:  10851 m/s   3911 KB/s   597% cpu  0% read  668% clw  threads 17r 0w 10i  buffers  143992   4 d  0 w  0 pfs
cl 2:   2194 m/s   7959 KB/s   107% cpu  0% read    9% clw  threads  1r 0w  0i  buffers  143616   3 d  2 w  0 pfs
cl 3:   2186 m/s   7818 KB/s   107% cpu  0% read    9% clw  threads  0r 0w  0i  buffers  140787   0 d  0 w  0 pfs
cl 4:   2174 m/s   2804 KB/s    77% cpu  0% read   10% clw  threads  0r 0w  0i  buffers  140654   0 d  2 w  0 pfs
cl 5:   2127 m/s   1612 KB/s    71% cpu  0% read    9% clw  threads  0r 0w  0i  buffers  140949   1 d  0 w  0 pfs
cl 6:   2060 m/s    544 KB/s    66% cpu  0% read   10% clw  threads  0r 0w  0i  buffers  141295   2 d  0 w  0 pfs
cl 7:   2072 m/s    517 KB/s    65% cpu  0% read   11% clw  threads  0r 0w  0i  buffers  141111   1 d  0 w  0 pfs
cl 8:   2105 m/s    522 KB/s    66% cpu  0% read   10% clw  threads  0r 0w  0i  buffers  141055   1 d  0 w  0 pfs
</pre>
 </code>
</blockquote>


<p>The main meters for cluster execution are the messages-per-second (m/s), the message volume (KB/s), and the total CPU% of the processes. </p>

<p>We note that CPU utilization is highly uneven and messages are short, about 1K on the average, compared to about 100K during the load. CPU would be evenly divided between the nodes if each got a share of the HTTP requests. We changed the test driver to round-robin requests between multiple end points. The work does then get evenly divided, but the speed is not affected. Also, this does not improve the message sizes since the workload consists mostly of short lookups. However, with the processes spread over multiple servers, the round-robin would be essential for CPU and especially for interconnect throughput. </p>


<p>Then we try 6 Cluster at 1000 Mt. For Single User, we get 1180 m/s, 6955 KB/s, and 173% cpu. For 16 User, this is 6573 m/s, 44366 KB/s, 1470% cpu.</p>

<p>This is a lot better than the figures with 6 Single, due to lower contention on the index tree, as discussed in <i><a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x1e9a0b58">A Benchmarking Story</a></i>. Also Single User throughput on 6 Cluster outperforms 6 Single, due to the natural parallelism of doing the Q5 joins in parallel in each partition. The larger the scale, the more weight this has in the metric. We see this also in the average message size, i.e., the KB/s throughput is almost double while the messages/s is a bit under a third.</p>


<p>The small-scale 6 Cluster run is about even with the 6 Single figure. Looking at the details, we see that the qps for Q1 in 6 Cluster is half of that on 6 Single, whereas the qps for Q5 on 6 Cluster is about double that of the 6 Single. This is as one might expect; longer queries are favored, and single row lookups are penalized.</p>

<p>Looking further at the 6 Cluster status we see the cluster wait (<code>clw</code>) to be 740%. For 16 Users, this means that about half of the execution real time is spent waiting for responses from other partitions. A high figure means uneven distribution between partitions; a low figure means even. This is as expected, since many queries are concerned with just one S and its related objects.</p>


<p>We will update this section once 7 Cluster is ready. This will implement vectored execution and column store inside the cluster nodes.</p>



<h3>
<i>Benchmarks, Redux</i> Series</h3>
<ul>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1658" id="link-id0x1d7894d0">Benchmarks, Redux (part 1): On RDF Benchmarks</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x1e434888">Benchmarks, Redux (part 2): A Benchmarking Story</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1663" id="link-id0x1f6b5260">Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1665" id="link-id0x1dd29460">Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1667" id="link-id0x1f0d78b8">Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs </a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1669" id="link-id0x1f9a9670">Benchmarks, Redux (part 6): BSBM and I/O, continued</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1671" id="link-id0x1c055370">Benchmarks, Redux (part 7): What Does BSBM Explore Measure?</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1673" id="link-id0x1dc06cd0">Benchmarks, Redux (part 8): BSBM Explore and Update </a>
</li>
<li>
Benchmarks, Redux (part 9): BSBM With Cluster <i>(this post)</i>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1677" id="link-id0x18f04db0">Benchmarks, Redux (part 10): LOD2 and the Benchmark Process</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1678" id="link-id0x1ee729b8">Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1e2e76b8">Benchmarks, Redux (part 12): Our Own BSBM Results Report</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1d75ef48">Benchmarks, Redux (part 13): BSBM BI Modifications </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1ee518c0">Benchmarks, Redux (part 14): BSBM BI Mix </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1d9244b0">Benchmarks, Redux (part 15): BSBM Test Driver Enhancements </a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2011-03-09#1673">
  <rss:title>Benchmarks, Redux (part 8): BSBM Explore and Update </rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-03-09T17:32:47Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We will here look at the Explore and Update scenario of BSBM. This presents us with a novel problem as the specification does not address any aspect of ACID. A transaction benchmark ought to have something to say about this. The SPARUL (also known as SPARQL/Update) language does not say anything about transactionality, but I suppose it is in the spirit of the SPARUL protocol to promise atomicity and durability. We begin by running Virtuoso 7 Single, with Single User and 16 User, each at scales of 100 Mt, 200 Mt, and 1000 Mt. The transactionality is default, meaning SERIALIZABLE isolation between INSERTs and DELETEs, and READ COMMITTED isolation between READ and any UPDATE transaction. (Figures for Virtuoso 6 will also be presented here in the near future, as they are the currently shipping production versions.) Virtuoso 7 Single, Full ACID (QMpH, query mixes per hour) Scale Single User 16 User 100 Mt 9,969 65,537 200 Mt 8,646 40,527 1000 Mt 5,512 17,293 Virtuoso 6 Cluster, Full ACID (QMpH, query mixes per hour) Scale Single User 16 User 100 Mt 5604.520 34079.019 1000 Mt 2866.616 10028.325 Virtuoso 6 Single, Full ACID (QMpH, query mixes per hour) Scale Single User 16 User 100 Mt 7,152 21,065 200 Mt 5,862 16,895 1000 Mt 1,542 4,548 Each run is preceded by a warm-up of 500 or 300 mixes (the exact number is not material), resulting in a warm cache; see previous post on read-ahead for details. All runs do 1000 Explore and Update mixes. The initial database is in the state following the Explore only runs. The results are in line with the Explore results. There is a fair amount of variability between consecutive runs; the 16 User run at 1000 Mt varies between 14K and 19K QMpH depending on the measurement. The smaller runs exhibit less variability. In the following we will look at transactions and at how the definition of the workload and reporting could be made complete. Full ACID means serializable semantic of concurrent insert and delete of the same quad. Non-transactional means that on concurrent insert and delete of overlapping sets of quads the result is undefined. Further if one logged such &quot;transactions,&quot; the replay would give serialization although the initial execution did not, hence further confusing the issue. Considering the hypothetical use case of an e-commerce information portal, there is little chance of deletes and inserts actually needing serialization. An insert-only workload does not need serializability because an insert cannot fail. If the data already exists the insert does nothing, if the quad does not previously exist it is created. The same applies to deletes alone. If a delete and insert overlap, serialization would be needed but the semantics implicit in the use case make this improbable. Read-only transactions (i.e., the Explore mix in the Explore and Update scenario) will be run as READ COMMITTED. These do not see uncommitted data and never block for lock wait. The reads may not be repeatable. Our first point of call is to determine the cost of ACID. We run 1000 mixes of Explore and Update at 1000 Mt. The throughput is 19214 after a warm-up of 500 mixes. This is pretty good in comparison with the diverse read-only results at this scale. We look at the pertinent statistics: SELECT TOP 5 * FROM sys_l_stat ORDER BY waits DESC; KEY_TABLE INDEX_NAME LOCKS WAITS WAIT_PCT DEADLOCKS LOCK_ESC WAIT_MSECS =============== ============= ====== ===== ======== ========= ======== ========== DB.DBA.RDF_QUAD RDF_QUAD_POGS 179205 934 0 0 0 35164 DB.DBA.RDF_IRI RDF_IRI 20752 217 1 0 0 16445 DB.DBA.RDF_QUAD RDF_QUAD_SP 9244 3 0 0 0 235 We see 934 waits with a total duration of 35 seconds on the index with the most contention. The run was 187 seconds, real time. The lock wait time is not real time since this is the total elapsed wait time summed over all threads. The lock wait frequency is a little over one per query mix, meaning a little over one per five locking transactions. We note that we do not get deadlocks since all inserts and deletes are in ascending key order due to vectoring. This guarantees the absence of deadlocks for single insert transactions, as long as the transaction stays within the vector size. This is always the case since the inserts are a few hundred triples at the maximum. The waits concentrate on POGS, because this is a bitmap index where the locking resolution is less than a row, and the values do not correlate with insert order. The locking behavior could be better with the column store, where we would have row level locking also for this index. This is to be seen. The column store would otherwise tend to have higher cost per random insert. Considering these results it does not seem crucial to &quot;drop ACID,&quot; though doing so would save some time. We will now run measurements for all scales with 16 Users and ACID. Let us now see what the benchmark writes: SELECT TOP 10 * FROM sys_d_stat ORDER BY n_dirty DESC; KEY_TABLE INDEX_NAME TOUCHES READS READ_PCT N_DIRTY N_BUFFERS =========================== ============================ ========= ======= ======== ======= ========= DB.DBA.RDF_QUAD RDF_QUAD_POGS 763846891 237436 0 58040 228606 DB.DBA.RDF_QUAD RDF_QUAD 213282706 1991836 0 30226 1940280 DB.DBA.RDF_OBJ RO_VAL 15474 17837 115 13438 17431 DB.DBA.RO_START RO_START 10573 11195 105 10228 11227 DB.DBA.RDF_IRI RDF_IRI 61902 125711 203 7705 121300 DB.DBA.RDF_OBJ RDF_OBJ 23809053 3205963 13 636 3072517 DB.DBA.RDF_IRI DB_DBA_RDF_IRI_UNQC_RI_ID 3237687 504486 15 340 488797 DB.DBA.RDF_QUAD RDF_QUAD_SP 89995 70446 78 99 68340 DB.DBA.RDF_QUAD RDF_QUAD_OP 19440 47541 244 66 45583 DB.DBA.VTLOG_DB_DBA_RDF_OBJ VTLOG_DB_DBA_RDF_OBJ 3014 1 0 11 11 DB.DBA.RDF_QUAD RDF_QUAD_GS 1261 801 63 10 751 DB.DBA.RDF_PREFIX RDF_PREFIX 14 168 1120 1 153 DB.DBA.RDF_PREFIX DB_DBA_RDF_PREFIX_UNQC_RP_ID 1807 200 11 1 200 The most dirty pages are on the POGS index, which is reasonable; values are spread out at random. After this we have the PSOG index, likely because of random deletes. New IRIs tend to get consecutive numbers and do not make many dirty pages. Literals come next, with the index from leading string or hash of the literal to id leading, as one would expect, again because of values being distributed at random. After this come IRIs. The distribution of updates is generally as one would expect. * * * Going back to BSBM, at least the following aspects of the benchmark have to be further specified: Disclosure of ACID properties. If the benchmark required full ACID many would not run this at all. Besides full ACID is not necessarily an absolute requirement based on the hypothetical usage scenario of the benchmark. However, when publishing numbers the guarantees that go with the numbers must be made explicit. This includes logging, checkpoint frequency or equivalent etc. Steady state. The working set of the Update mix is different from that of the Explore mixes. This touches more indices than Explore. The Explore warm-up is in part good but does not represent steady state. Checkpoint and sustained throughput. Benchmarks involving update generally have rules for checkpointing the state and for sustained throughput. In specific, the throughput of an update benchmark cannot rely on never flushing to persistent storage. Even bulk load must be timed with a checkpoint guaranteeing durability at the end. A steady update stream should be timed with a test interval of sufficient length involving a few checkpoints; for example, a minimum duration of 30 minutes with no less than 3 completed checkpoints in the interval with at least 9 minutes between the end of one and the start of the next. Not all DBMSs work with logs and checkpoints, but if an alternate scheme is used then this needs to be described. Memory and warm-up issues.We have seen the test data generator run out of memory when trying to generate update streams of meaningful length. Also the test driver should allow running updates in timed and non-timed mode (warm-up). With an update benchmark, many more things need to be defined, and the set-up becomes more system specific, than with a read-only workload. We will address these shortcomings in the measurement rules proposal to come. Especially with update workloads, the vendors need to provide tuning expertise; however, this will not happen if the benchmark does not properly set the expectations. If benchmarks serve as a catalyst for clearly defining how things are to be set up, then they will have served the end user. Benchmarks, Redux Series Benchmarks, Redux (part 1): On RDF Benchmarks Benchmarks, Redux (part 2): A Benchmarking Story Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs Benchmarks, Redux (part 6): BSBM and I/O, continued Benchmarks, Redux (part 7): What Does BSBM Explore Measure? Benchmarks, Redux (part 8): BSBM Explore and Update (this post) Benchmarks, Redux (part 9): BSBM With Cluster Benchmarks, Redux (part 10): LOD2 and the Benchmark Process Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks Benchmarks, Redux (part 12): Our Own BSBM Results Report Benchmarks, Redux (part 13): BSBM BI Modifications Benchmarks, Redux (part 14): BSBM BI Mix Benchmarks, Redux (part 15): BSBM Test Driver Enhancements</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We will here look at the <i>Explore and Update</i> scenario of <a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x2a621a00">BSBM</a>. This presents us with a novel problem as the specification does not address any aspect of <a class="auto-href" href="http://dbpedia.org/resource/ACID" id="link-id0x2a2d4310">ACID</a>.</p>

<p>A transaction benchmark ought to have something to say about this. The <a class="auto-href" href="http://dbpedia.org/page/SPARUL" id="link-id0x27cf4478">SPARUL</a> (also known as <a class="auto-href" href="http://dbpedia.org/resource/SPARQL" id="link-id0x29bb7f80">SPARQL</a>/<a class="auto-href" href="http://dbpedia.org/page/SPARUL" id="link-id0x2978e570">Update</a>) language does not say anything about transactionality, but I suppose it is in the spirit of the SPARUL protocol to promise atomicity and durability.</p>

<p>We begin by running <a class="auto-href" href="http://virtuoso.openlinksw.com" id="link-id0x27a2b9f0">Virtuoso</a> 7 Single, with Single User and 16 User, each at scales of 100 Mt, 200 Mt, and 1000 Mt. The transactionality is default, meaning <code>SERIALIZABLE</code> isolation between <code>INSERTs</code> and <code>DELETEs</code>, and <code>READ COMMITTED</code> isolation between <code>READ</code> and any <code>UPDATE</code> transaction. (Figures for Virtuoso 6 will also be presented here in the near future, as they are the currently shipping production versions.)</p>


<table border="1" cellspacing="2" cellpadding="2" align="center" width="90%">
	<tr>
		<th colspan="3" align="center"> Virtuoso 7 Single, Full ACID <br /> (QMpH, query mixes per hour) </th>
	</tr>
	<tr>
		<th align="center">Scale</th>
		<th align="center"> Single User </th>
		<th align="center"> 16 User </th>
	</tr>
	<tr>
		<th align="center">100 Mt</th>
		<td align="center"> 9,969 </td>
		<td align="center"> 65,537 </td>
	</tr>
	<tr>
		<th align="center">200 Mt</th>
		<td align="center"> 8,646 </td>
		<td align="center"> 40,527 </td>
	</tr>
	<tr>
		<th align="center">1000 Mt</th>
		<td align="center"> 5,512 </td>
		<td align="center"> 17,293 </td>
	</tr>
</table>
<br />
<table border="1" cellspacing="2" cellpadding="2" align="center" width="90%">
	<tr>
		<th colspan="3" align="center"> Virtuoso 6 Cluster, Full ACID <br /> (QMpH, query mixes per hour) </th>
	</tr>
	<tr>
		<th align="center"> Scale </th>
		<th align="center"> Single User </th>
		<th align="center"> 16 User </th>
	</tr>
	<tr>
		<th align="center"> 100 Mt </th>
		<td align="center"> 5604.520 </td>
		<td align="center"> 34079.019 </td>
	</tr>
	<tr>
		<th align="center"> 1000 Mt </th>
		<td align="center"> 2866.616 </td>
		<td align="center"> 10028.325 </td>
	</tr>
</table>
<br />
<table border="1" cellspacing="2" cellpadding="2" align="center" width="90%">
	<tr>
		<th colspan="3" align="center"> Virtuoso 6 Single, Full ACID <br /> (QMpH, query mixes per hour) </th>
	</tr>
	<tr>
		<th align="center">Scale</th>
		<th align="center"> Single User </th>
		<th align="center"> 16 User </th>
	</tr>
	<tr>
		<th align="center">100 Mt</th>
		<td align="center"> 7,152 </td>
		<td align="center"> 21,065 </td>
	</tr>
	<tr>
		<th align="center">200 Mt</th>
		<td align="center"> 5,862 </td>
		<td align="center"> 16,895 </td>
	</tr>
	<tr>
		<th align="center">1000 Mt</th>
		<td align="center"> 1,542 </td>
		<td align="center"> 4,548 </td>
	</tr>
</table>



<p>Each run is preceded by a warm-up of 500 or 300 mixes (the exact number is not material), resulting in a warm <a class="auto-href" href="http://dbpedia.org/resource/Cache" id="link-id0x298139d0">cache</a>; see <a href="http://www.openlinksw.com/weblog/oerling/?id=1669" id="link-id0x1f8ac510">previous post on read-ahead</a> for details. All runs do 1000 <i>Explore and Update</i> mixes. The initial database is in the state following the <i>Explore</i> only runs.</p>

<p>The results are in line with the <i>Explore</i> results. There is a fair amount of variability between consecutive runs; the 16 User run at 1000 Mt varies between 14K and 19K QMpH depending on the measurement. The smaller runs exhibit less variability.</p>

<p>In the following we will look at transactions and at how the definition of the workload and reporting could be made complete.</p>


<p>Full ACID means serializable semantic of concurrent insert and delete of the same quad. Non-transactional means that on concurrent insert and delete of overlapping sets of quads the result is undefined. Further if one logged such &quot;transactions,&quot; the replay would give serialization although the initial execution did not, hence further confusing the issue. Considering the hypothetical use case of an e-commerce information portal, there is little chance of deletes and inserts actually needing serialization. An insert-only workload does not need serializability because an insert cannot fail. If the <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0x29a2f220">data</a> already exists the insert does nothing, if the quad does not previously exist it is created. The same applies to deletes alone. If a delete and insert overlap, serialization would be needed but the semantics implicit in the use case make this improbable.</p>


<p>Read-only transactions (i.e., the <i>Explore</i> mix in the <i>Explore and Update</i> scenario) will be run as <code>READ COMMITTED</code>. These do not see uncommitted data and never block for lock wait. The reads may not be repeatable.</p>

<p>Our first point of call is to determine the cost of ACID. We run 1000 mixes of <i>Explore and Update</i> at 1000 Mt. The throughput is 19214 after a warm-up of 500 mixes. This is pretty good in comparison with the diverse read-only results at this scale.</p>

<p>We look at the pertinent statistics:</p>

<p>
<code></code>
</p>
<pre>
SELECT TOP 5 * FROM sys_l_stat ORDER BY waits DESC;
</pre>

<blockquote>
 <code><pre>
KEY_TABLE         INDEX_NAME       LOCKS   WAITS   WAIT_PCT   DEADLOCKS   LOCK_ESC   WAIT_MSECS
===============   =============   ======   =====   ========   =========   ========   ==========
DB.DBA.<a class="auto-href" href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x299432f0">RDF</a>_QUAD   RDF_QUAD_POGS   179205     934          0           0          0        35164
DB.DBA.RDF_IRI    RDF_IRI          20752     217          1           0          0        16445
DB.DBA.RDF_QUAD   RDF_QUAD_SP       9244       3          0           0          0          235
</pre>
 </code>
</blockquote>

<p>We see 934 waits with a total duration of 35 seconds on the index with the most contention. The run was 187 seconds, real time. The lock wait time is not real time since this is the total elapsed wait time summed over all threads. The lock wait frequency is a little over one per query mix, meaning a little over one per five locking transactions. </p>

<p>We note that we do not get deadlocks since all inserts and deletes are in ascending key order due to vectoring. This guarantees the absence of deadlocks for single insert transactions, as long as the transaction stays within the vector size. This is always the case since the inserts are a few hundred triples at the maximum. The waits concentrate on POGS, because this is a bitmap index where the locking resolution is less than a row, and the values do not correlate with insert order. The locking behavior could be better with the column store, where we would have row level locking also for this index. This is to be seen. The column store would otherwise tend to have higher cost per random insert.</p>

<p>Considering these results it does not seem crucial to &quot;drop ACID,&quot; though doing so would save <i>some</i> time. We will now run measurements for all scales with 16 Users and ACID. </p>

<p>Let us now see what the benchmark writes:</p>

<p>
<code></code>
</p>
<pre>
SELECT TOP 10 * FROM sys_d_stat ORDER BY n_dirty DESC;
</pre>

<blockquote>
 <code><pre>
KEY_TABLE                     INDEX_NAME                       TOUCHES     READS   READ_PCT   N_DIRTY   N_BUFFERS
===========================   ============================   =========   =======   ========   =======   =========
DB.DBA.RDF_QUAD               RDF_QUAD_POGS                  763846891    237436          0     58040      228606
DB.DBA.RDF_QUAD               RDF_QUAD                       213282706   1991836          0     30226     1940280
DB.DBA.RDF_OBJ                RO_VAL                             15474     17837        115     13438       17431
DB.DBA.RO_START               RO_START                           10573     11195        105     10228       11227
DB.DBA.RDF_IRI                RDF_IRI                            61902    125711        203      7705      121300
DB.DBA.RDF_OBJ                RDF_OBJ                         23809053   3205963         13       636     3072517
DB.DBA.RDF_IRI                DB_DBA_RDF_IRI_UNQC_RI_ID        3237687    504486         15       340      488797
DB.DBA.RDF_QUAD               RDF_QUAD_SP                        89995     70446         78        99       68340
DB.DBA.RDF_QUAD               RDF_QUAD_OP                        19440     47541        244        66       45583
DB.DBA.VTLOG_DB_DBA_RDF_OBJ   VTLOG_DB_DBA_RDF_OBJ                3014         1          0        11          11
DB.DBA.RDF_QUAD               RDF_QUAD_GS                         1261       801         63        10         751
DB.DBA.RDF_PREFIX             RDF_PREFIX                            14       168       1120         1         153
DB.DBA.RDF_PREFIX             DB_DBA_RDF_PREFIX_UNQC_RP_ID        1807       200         11         1         200
</pre>
 </code>
</blockquote>


<p>The most dirty pages are on the <code>POGS</code> index, which is reasonable; values are spread out at random. After this we have the <code>PSOG</code> index, likely because of random deletes. New IRIs tend to get consecutive numbers and do not make many dirty pages. Literals come next, with the index from leading string or hash of the literal to id leading, as one would expect, again because of values being distributed at random. After this come IRIs. The distribution of updates is generally as one would expect.</p>

<p align="center">* * *</p>

<p>Going back to BSBM, at least the following aspects of the benchmark have to be further specified:</p>

<ul>
<li>
  <p>
    <b>Disclosure of ACID properties.</b> If the benchmark required full ACID many would not run this at all. Besides full ACID is not necessarily an absolute requirement based on the hypothetical usage scenario of the benchmark. However, when publishing numbers the guarantees that go with the numbers must be made explicit. This includes logging, checkpoint frequency or equivalent etc.</p>
</li>

<li>
  <p>
    <b>Steady state.</b> The working set of the <i>Update</i> mix is different from that of the <i>Explore</i> mixes. This touches more indices than <i>Explore</i>. The <i>Explore</i> warm-up is in part good but does not represent steady state.</p>
</li>

<li>
  <p>
    <b>Checkpoint and sustained throughput.</b> Benchmarks involving update generally have rules for checkpointing the state and for sustained throughput. In specific, the throughput of an update benchmark cannot rely on never flushing to persistent storage. Even bulk load must be timed with a checkpoint guaranteeing durability at the end. A steady update stream should be timed with a test interval of sufficient length involving a few checkpoints; for example, a minimum duration of 30 minutes with no less than 3 completed checkpoints in the interval with at least 9 minutes between the end of one and the start of the next. Not all DBMSs work with logs and checkpoints, but if an alternate scheme is used then this needs to be described.</p>
</li>

<li>
  <p>
    <b>Memory and warm-up issues.</b>We have seen the test data generator run out of memory when trying to generate update streams of meaningful length. Also the test driver should allow running updates in timed and non-timed mode (warm-up).</p>
</li>
</ul>


<p>With an update benchmark, many more things need to be defined, and the set-up becomes more system specific, than with a read-only workload. We will address these shortcomings in the measurement rules proposal to come. Especially with update workloads, the vendors need to provide tuning expertise; however, this will not happen if the benchmark does not properly set the expectations. If benchmarks serve as a catalyst for clearly defining how things are to be set up, then they will have served the end user.</p>


<h3>
<i>Benchmarks, Redux</i> Series</h3>
<ul>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1658" id="link-id0x1de61db8">Benchmarks, Redux (part 1): On RDF Benchmarks</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x1f9f96f8">Benchmarks, Redux (part 2): A Benchmarking Story</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1663" id="link-id0x1f89eeb0">Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1665" id="link-id0x1ad83f30">Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1667" id="link-id0x1de62178">Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1669" id="link-id0x1b2ec018">Benchmarks, Redux (part 6): BSBM and I/O, continued</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1671" id="link-id0x1ae6f028">Benchmarks, Redux (part 7): What Does BSBM Explore Measure?</a>
</li>
<li>
Benchmarks, Redux (part 8): BSBM Explore and Update <i>(this post)</i>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1675" id="link-id0x132605c0">Benchmarks, Redux (part 9): BSBM With Cluster</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1677" id="link-id0x1a9871b0">Benchmarks, Redux (part 10): LOD2 and the Benchmark Process</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1678" id="link-id0x1baa20f8">Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1e25a840">Benchmarks, Redux (part 12): Our Own BSBM Results Report</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1b53db20">Benchmarks, Redux (part 13): BSBM BI Modifications </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1e7ce520">Benchmarks, Redux (part 14): BSBM BI Mix </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1b18f400">Benchmarks, Redux (part 15): BSBM Test Driver Enhancements </a>
</li>

</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2011-03-07#1671">
  <rss:title>Benchmarks, Redux (part 7): What Does BSBM Explore Measure?</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-03-07T23:39:22Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We will here analyze what the BSBM Explore workload does. This is necessary in order to compare benchmark results at different scales. Historically, BSBM had a Query 6 whose share of the metric approached 100% as scale increased. The present mix does not have this query, but different queries still have different relative importance at different scales. We will here look at database-running statistics for BSBM at different scales. Finally, we look at CPU profiles. But first, let us see what BSBM reads in general. The system is in steady state after around 1500 query mixes; after this the working set does not shift much. After several thousand query mixes, we have: SELECT TOP 10 * FROM sys_d_stat ORDER BY reads DESC; KEY_TABLE INDEX_NAME TOUCHES READS READ_PCT N_DIRTY N_BUFFERS ================= ============================ ========== ======= ======== ======= ========= DB.DBA.RDF_OBJ RDF_OBJ 114105938 3302150 2 0 3171275 DB.DBA.RDF_QUAD RDF_QUAD 977426773 2041156 0 0 1970712 DB.DBA.RDF_IRI DB_DBA_RDF_IRI_UNQC_RI_ID 8250414 509239 6 15 491631 DB.DBA.RDF_QUAD RDF_QUAD_POGS 3677233812 183860 0 0 175386 DB.DBA.RDF_IRI RDF_IRI 32 99710 302151 5 95353 DB.DBA.RDF_QUAD RDF_QUAD_OP 30597 51593 168 0 48941 DB.DBA.RDF_QUAD RDF_QUAD_SP 265474 47210 17 0 46078 DB.DBA.RDF_PREFIX DB_DBA_RDF_PREFIX_UNQC_RP_ID 6020 212 3 0 212 DB.DBA.RDF_PREFIX RDF_PREFIX 0 167 16700 0 157 The first column is the table, then the index, then the number of times a row was found. The fourth number is the count of disk pages read. The last number is the count of 8K buffer pool pages in use for caching pages of the index in question. Note that the index is clustered, i.e., there is no table data structure separate from the index. Most of the reads are for strings or RDF literals. After this comes the PSOG index for getting a property value given the subject. After this, but much lower, we have lookups of IRI strings given the ID. The index from object value to subject is used the most but the number of pages is small; only a few properties seem to be concerned. The rest is minimal in comparison. Now let us reset the counts and see what the steady state I/O profile is. SELECT key_stat (key_table, name_part (key_name, 2), &#39;reset&#39;) FROM sys_keys WHERE key_migrate_to IS NULL; SELECT TOP 10 * FROM sys_d_stat ORDER BY reads DESC; KEY_TABLE INDEX_NAME TOUCHES READS READ_PCT N_DIRTY N_BUFFERS ================= ============================ ========== ======= ======== ======= ========= DB.DBA.RDF_OBJ RDF_OBJ 30155789 79659 0 0 3191391 DB.DBA.RDF_QUAD RDF_QUAD 259008064 8904 0 0 1948707 DB.DBA.RDF_QUAD RDF_QUAD_SP 68002 7730 11 0 53360 DB.DBA.RDF_IRI RDF_IRI 12 5415 41653 6 98804 DB.DBA.RDF_QUAD RDF_QUAD_POGS 975147136 1597 0 0 173459 DB.DBA.RDF_IRI DB_DBA_RDF_IRI_UNQC_RI_ID 2213525 1286 0 17 485093 DB.DBA.RDF_QUAD RDF_QUAD_OP 7999 904 11 0 48568 DB.DBA.RDF_PREFIX DB_DBA_RDF_PREFIX_UNQC_RP_ID 1494 1 0 0 213 Literal strings dominate. The SP index is used only for situations where the P is not specified, i.e., the DESCRIBE query. Based on this, I/O seems to be attributable mostly to this. The first RDF_IRI represents translations from string to IRI id; the second represents translations from IRI id to string. The touch count for the first RDF_IRI is not properly recorded, hence the miss % is out of line. We see SP missing the cache the most since its use is infrequent in the mix. We will next look at query processing statistics. For this we introduce a new meter. The db_activity SQL function provides a session-by-session cumulative statistic of activity. The fields are: rnd - Count of random index lookups. Each first row of a select or insert counts as one, regardless of whether something was found. seq - Count of sequential rows. Every move to next row on a cursor counts as 1, regardless of whether conditions match. same seg - For column store only; counts how many times the next row in a vectored join using an index falls in the same segment as the previous random access. A segment is the stretch of rows between entries in the sparse top level index on the column projection. same pg - Counts how many times a vectored index join finds the next match on the same page as the previous one. same par - Counts how many times the next lookup in a vectored index join falls on a different page than the previous but still under the same parent. disk - Counts how many disk reads were made, including any speculative reads initiated. spec disk - Counts speculative disk reads. messages - Counts cluster interconnect messages B (KB, MB, GB) - is the total length of the cluster interconnect messages. fork - Counts how many times a thread was forked (started) for query parallelization. The numbers are given with 4 significant digits and a scale suffix. G is 10^9 (1,000,000,000); M is 10^6 (1,000,000), K is 10^3 (1,000). We run 2000 query mixes with 16 Users. The special http account keeps a cumulative account of all activity on web server threads. SELECT db_activity (2, &#39;http&#39;); 1.674GÂ rndÂ  3.223GÂ seqÂ  Â  Â  0Â sameÂ segÂ  1.286GÂ sameÂ pgÂ  314.8MÂ sameÂ parÂ  6.186MÂ diskÂ  6.461MÂ specÂ diskÂ  Â  Â  0BÂ / Â  Â  0Â messagesÂ  298.6KÂ fork We see that random access dominates. The seq number is about twice the rnd number, meaning that the average random lookup gets two rows. Getting a row at random obviously takes more time than getting the next row. Since the index used is row-wise, the same seg is 0; the same pg indicates that 77% of the random accesses fall on the same page as the previous random access; most of the remaining random accesses fall under the same parent as the previous one. There are more speculative reads than disk reads which is an artifact of counting some concurrently speculated reads twice. This does indicate that speculative reads dominate. This is because a large part of the run was in the warm-up state with aggressive speculative reading. We reset the counts and run another 2000 mixes. Now let us look at the same reading after 2000 mixes, 16 user at 100Mt. 234.3MÂ rndÂ  420.5MÂ seqÂ  Â  Â  0Â sameÂ segÂ  Â 188.8MÂ sameÂ pgÂ  29.09MÂ sameÂ parÂ  808.9KÂ diskÂ  919.9KÂ specÂ diskÂ  Â  Â  0BÂ /Â  Â  Â  0Â messagesÂ  76KÂ fork We note that the ratios between the random and sequential and same page/parent counts are about the same. The sequential number looks to be even a bit smaller in proportion. The count of random accesses for the 100Mt run is 14% of the count for the 1000Mt run. The count of query parallelization threads is also much lower since it is worthwhile to schedule a new thread only if there are at least a few thousand operations to perform on it. The precise criterion for making a thread is that according to the cost model guess, the thread must have at least 5ms worth of work. We note that the 100 Mt throughput is a little over three-times that of the 1000 Mt throughput, as reported before. We might justifiably ask why the 100 Mt run is not seven-times faster instead, for this much less work. We note that for one-off random access, it makes no real difference whether the tree has 100 M or 1000 M rows; this translates to roughly 27 vs 30 comparisons, so the depth of the tree is not a factor per se. Besides, vectoring makes the tree often look only one or two levels deep, so the total row count matters even less there. To elucidate this last question, we look at the CPU profiles. We take an oprofile of 100 Single User mixes at both scales. For 100 Mt: 61161 10.1723 cmpf_iri64n_iri64n_anyn_gt_lt 31321 5.2093 box_equal 19027 3.1646 sqlo_parse_tree_has_node 15905 2.6453 dk_alloc 15647 2.6024 itc_next_set_neq 12702 2.1126 itc_vec_split_search 12487 2.0768 itc_dive_transit 11450 1.9044 itc_bm_vec_row_check 10646 1.7706 itc_page_rcf_search 9223 1.5340 id_hash_get 9215 1.5326 gen_qsort 8867 1.4748 sqlo_key_part_best 8807 1.4648 itc_param_cmp 8062 1.3409 cmpf_iri64n_iri64n 6820 1.1343 sqlo_in_list 6005 0.9987 dc_iri_id_cmp 5905 0.9821 dk_free_tree 5801 0.9648 box_hash 5509 0.9163 dks_esc_write 5444 0.9054 sql_tree_hash_1 For 1000 Mt 754331 31.4149 cmpf_iri64n_iri64n_anyn_gt_lt 146165 6.0872 itc_vec_split_search 144795 6.0301 itc_next_set_neq 131671 5.4836 itc_dive_transit 110870 4.6173 itc_page_rcf_search 66780 2.7811 gen_qsort 66434 2.7667 itc_param_cmp 58450 2.4342 itc_bm_vec_row_check 55213 2.2994 dk_alloc 47793 1.9904 cmpf_iri64n_iri64n 44277 1.8440 dc_iri_id_cmp 39489 1.6446 cmpf_int64n 36880 1.5359 dc_append_bytes 36601 1.5243 dv_compare 31286 1.3029 dc_any_value_prefetch 25457 1.0602 itc_next_set 20852 0.8684 box_equal 19895 0.8285 dk_free_tree 19698 0.8203 itc_page_insert_search 19367 0.8066 dc_copy The top function in both is the compare for an equality of two leading IRIs and a range for the trailing any. This corresponds to the range check in Q5. At the larger scale this is three times more important. At the smaller scale, the share of query optimization is about 6.5 times greater. The top function in this category is box_equal with 5.2% vs 0.87%. The remaining SQL compiler functions are all in proportion to this, totaling 14.3% of the 100 Mt top-20 profile. From this sample it appears ten times more scale is seven times more database operations. This is not taken into account in the metric. Query compilation is significant at the small end, and no longer significant at 1000 Mt. From these numbers, we could say that Virtuoso is about two times more efficient in terms of database operation throughput at 1000 Mt than at 100 Mt. We may conclude that different BSBM scales measure different things. The TPC workloads are relatively better in that they have a balance between metric components that stay relatively constant across a large range of scales. This is not necessarily something that should be fixed in the BSBM Explore mix. We must however take these factors better into account in developing the BI mix. Let us also remember that BSBM Explore is a relational workload. Future posts in this series will outline how we propose to make RDF-friendlier benchmarks. Benchmarks, Redux Series Benchmarks, Redux (part 1): On RDF Benchmarks Benchmarks, Redux (part 2): A Benchmarking Story Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs Benchmarks, Redux (part 6): BSBM and I/O, continued Benchmarks, Redux (part 7): What Does BSBM Explore Measure? (this post) Benchmarks, Redux (part 8): BSBM Explore and Update Benchmarks, Redux (part 9): BSBM With Cluster Benchmarks, Redux (part 10): LOD2 and the Benchmark Process Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks Benchmarks, Redux (part 12): Our Own BSBM Results Report Benchmarks, Redux (part 13): BSBM BI Modifications Benchmarks, Redux (part 14): BSBM BI Mix Benchmarks, Redux (part 15): BSBM Test Driver Enhancements</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We will here analyze what the <a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x1d1a4e30">BSBM</a> Explore workload does. This is necessary in order to compare benchmark results at different scales. Historically, BSBM had a Query 6 whose share of the metric approached 100% as scale increased. The present mix does not have this query, but different queries still have different relative importance at different scales.</p>

<p>We will here look at database-running statistics for BSBM at different scales. Finally, we look at <a class="auto-href" href="http://dbpedia.org/resource/Central_processing_unit" id="link-id0x1c1df888">CPU</a> profiles.</p>


<p>But first, let us see what BSBM reads in general. The system is in steady state after around 1500 query mixes; after this the working set does not shift much. After several thousand query mixes, we have:</p>

<p>
<code>SELECT TOP 10 * FROM sys_d_stat ORDER BY reads DESC;</code>
</p>

<blockquote>
 <code><pre>
KEY_TABLE          INDEX_NAME                       TOUCHES    READS  READ_PCT  N_DIRTY  N_BUFFERS
=================  ============================  ==========  =======  ========  =======  =========
DB.DBA.<a class="auto-href" href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1c585f20">RDF</a>_OBJ     RDF_OBJ                        114105938  3302150         2        0    3171275
DB.DBA.RDF_QUAD    RDF_QUAD                       977426773  2041156         0        0    1970712
DB.DBA.RDF_IRI     DB_DBA_RDF_IRI_UNQC_RI_ID        8250414   509239         6       15     491631
DB.DBA.RDF_QUAD    RDF_QUAD_POGS                 3677233812   183860         0        0     175386
DB.DBA.RDF_IRI     RDF_IRI                               32    99710    302151        5      95353
DB.DBA.RDF_QUAD    RDF_QUAD_OP                        30597    51593       168        0      48941
DB.DBA.RDF_QUAD    RDF_QUAD_SP                       265474    47210        17        0      46078
DB.DBA.RDF_PREFIX  DB_DBA_RDF_PREFIX_UNQC_RP_ID        6020      212         3        0        212
DB.DBA.RDF_PREFIX  RDF_PREFIX                             0      167     16700        0        157
</pre>
 </code>
</blockquote>


<p>The first column is the table, then the index, then the number of times a row was found. The fourth number is the count of disk pages read. The last number is the count of 8K buffer pool pages in use for caching pages of the index in question. Note that the index is clustered, i.e., there is no table <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0x1ac97430">data</a> structure separate from the index. Most of the reads are for strings or RDF literals. After this comes the <code>PSOG</code> index for getting a property value given the subject. After this, but much lower, we have lookups of IRI strings given the ID. The index from object value to subject is used the most but the number of pages is small; only a few properties seem to be concerned. The rest is minimal in comparison.</p>

<p>Now let us reset the counts and see what the steady state I/O profile is.</p>

<p>
<code>SELECT key_stat (key_table, name_part (key_name, 2), &#39;reset&#39;) FROM sys_keys WHERE key_migrate_to IS NULL;</code>
</p>
<p>
<code>SELECT TOP 10 * FROM sys_d_stat ORDER BY reads DESC;</code>
</p>

<blockquote>
 <code><pre>
KEY_TABLE          INDEX_NAME                       TOUCHES    READS  READ_PCT  N_DIRTY  N_BUFFERS
=================  ============================  ==========  =======  ========  =======  =========
DB.DBA.RDF_OBJ     RDF_OBJ                         30155789    79659         0        0    3191391
DB.DBA.RDF_QUAD    RDF_QUAD                       259008064     8904         0        0    1948707
DB.DBA.RDF_QUAD    RDF_QUAD_SP                        68002     7730        11        0      53360
DB.DBA.RDF_IRI     RDF_IRI                               12     5415     41653        6      98804
DB.DBA.RDF_QUAD    RDF_QUAD_POGS                  975147136     1597         0        0     173459
DB.DBA.RDF_IRI     DB_DBA_RDF_IRI_UNQC_RI_ID        2213525     1286         0       17     485093
DB.DBA.RDF_QUAD    RDF_QUAD_OP                         7999      904        11        0      48568
DB.DBA.RDF_PREFIX  DB_DBA_RDF_PREFIX_UNQC_RP_ID        1494        1         0        0        213
</pre>
 </code>
</blockquote>



<p>Literal strings dominate. The <code>SP</code> index is used only for situations where the <code>P</code> is not specified, i.e., the <code>DESCRIBE</code> query. Based on this, I/O seems to be attributable mostly to this. The first <code>RDF_IRI</code> represents translations from string to IRI id; the second represents translations from IRI id to string. The touch count for the first <code>RDF_IRI</code> is not properly recorded, hence the miss % is out of line. We see <code>SP</code> missing the <a class="auto-href" href="http://dbpedia.org/resource/Cache" id="link-id0x11b80700">cache</a> the most since its use is infrequent in the mix.</p>


<p>We will next look at query processing statistics. For this we introduce a new meter.</p>

<p>The <code>db_activity</code> <a class="auto-href" href="http://dbpedia.org/resource/SQL" id="link-id0x1ddd4bf0">SQL</a> function provides a session-by-session cumulative statistic of activity. The fields are: </p>

<ul>
<li>
  <b><code>rnd</code>
  </b> - Count of <i>random index lookups</i>. Each first row of a select or insert counts as one, regardless of whether something was found.</li>
<li>
  <b><code>seq</code>
  </b> - Count of <i>sequential rows</i>. Every move to next row on a cursor counts as 1, regardless of whether conditions match.</li>
<li>
  <b><code>same seg</code>
  </b> - For column store only; counts how many times the next row in a vectored join using an index falls in the <i>same segment</i> as the previous random access. A segment is the stretch of rows between entries in the sparse top level index on the column projection.</li>
<li>
  <b><code>same pg</code>
  </b> - Counts how many times a vectored index join finds the next match on the <i>same page</i> as the previous one.</li>
<li>
  <b><code>same par</code>
  </b> - Counts how many times the next lookup in a vectored index join falls on a different page than the previous but still under the <i>same parent</i>.</li>
<li>
  <b><code>disk</code>
  </b> - Counts how many <i>disk reads</i> were made, including any speculative reads initiated.</li>
<li>
  <b><code>spec disk</code>
  </b> - Counts <i>speculative disk reads</i>.</li>
<li>
  <b><code>messages</code>
  </b> - Counts <i>cluster interconnect messages</i> </li>
<li>
  <b><code>B (KB, MB, GB)</code>
  </b> - is the <i>total length</i> of the cluster interconnect messages.</li>
<li>
  <b><code>fork</code>
  </b> - Counts how many times a <i>thread was forked (started)</i> for query parallelization.</li>
</ul>

<p>The numbers are given with 4 significant digits and a scale suffix. G is 10^9 (1,000,000,000); M is 10^6 (1,000,000), K is 10^3 (1,000).</p>

<p>We run 2000 query mixes with 16 Users. The special <code><a class="auto-href" href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x1ebdef18">http</a></code> account keeps a cumulative account of all activity on web server threads.</p>

<blockquote>
<p>
  <code>SELECT db_activity (2, &#39;http&#39;);</code>
</p>
<p>
  <code>1.674GÂ rndÂ  3.223GÂ seqÂ  Â  Â  0Â sameÂ segÂ   1.286GÂ sameÂ pgÂ  314.8MÂ sameÂ parÂ  6.186MÂ diskÂ  6.461MÂ specÂ diskÂ  Â  Â  0BÂ / Â  Â  0Â messagesÂ  298.6KÂ fork</code>
</p>
</blockquote>

<p>We see that random access dominates. The <code>seq</code> number is about twice the <code>rnd</code> number, meaning that the average random lookup gets two rows. Getting a row at random obviously takes more time than getting the next row. Since the index used is row-wise, the <code>same seg</code> is 0; the <code>same pg</code> indicates that 77% of the random accesses fall on the same page as the previous random access; most of the remaining random accesses fall under the same parent as the previous one.</p>

<p>There are more speculative reads than disk reads which is an artifact of counting some concurrently speculated reads twice. This does indicate that speculative reads dominate. This is because a large part of the run was in the warm-up state with aggressive speculative reading. We reset the counts and run another 2000 mixes.</p>

<p>Now let us look at the same reading after 2000 mixes, 16 user at 100Mt.</p>

<blockquote>
<p>
  <code>234.3MÂ rndÂ  420.5MÂ seqÂ  Â  Â  0Â sameÂ segÂ  Â 188.8MÂ sameÂ pgÂ  29.09MÂ sameÂ parÂ  808.9KÂ diskÂ  919.9KÂ specÂ diskÂ  Â  Â  0BÂ /Â  Â  Â  0Â messagesÂ     76KÂ fork</code>
</p>
</blockquote>


<p>We note that the ratios between the random and sequential and same page/parent counts are about the same. The sequential number looks to be even a bit smaller in proportion. The count of random accesses for the 100Mt run is 14% of the count for the 1000Mt run. The count of query parallelization threads is also much lower since it is worthwhile to schedule a new thread only if there are at least a few thousand operations to perform on it. The precise criterion for making a thread is that according to the cost model guess, the thread must have at least 5ms worth of work.</p>

<p>We note that the 100 Mt throughput is a little over three-times that of the 1000 Mt throughput, as reported before. We might justifiably ask why the 100 Mt run is not seven-times faster instead, for this much less work. </p>

<p>We note that for one-off random access, it makes no real difference whether the tree has 100 M or 1000 M rows; this translates to roughly 27 vs 30 comparisons, so the depth of the tree is not a factor <i>per se</i>. Besides, vectoring makes the tree often look only one or two levels deep, so the total row count matters even less there.</p>

<p>To elucidate this last question, we look at the CPU profiles. We take an <a href="http://oprofile.sourceforge.net/about/" id="link-id0x1efb3360">oprofile</a> of 100 Single User mixes at both scales.</p>

For 100 Mt:

<blockquote>
 <code><pre>
61161    10.1723  cmpf_iri64n_iri64n_anyn_gt_lt
31321     5.2093  box_equal
19027     3.1646  sqlo_parse_tree_has_node
15905     2.6453  dk_alloc
15647     2.6024  itc_next_set_neq
12702     2.1126  itc_vec_split_search
12487     2.0768  itc_dive_transit
11450     1.9044  itc_bm_vec_row_check
10646     1.7706  itc_page_rcf_search
 9223     1.5340  id_hash_get
 9215     1.5326  gen_qsort
 8867     1.4748  sqlo_key_part_best
 8807     1.4648  itc_param_cmp
 8062     1.3409  cmpf_iri64n_iri64n
 6820     1.1343  sqlo_in_list
 6005     0.9987  dc_iri_id_cmp
 5905     0.9821  dk_free_tree
 5801     0.9648  box_hash
 5509     0.9163  dks_esc_write
 5444     0.9054  sql_tree_hash_1
</pre>
 </code>
</blockquote>


For 1000 Mt

<blockquote>
 <code><pre>
754331   31.4149  cmpf_iri64n_iri64n_anyn_gt_lt
146165    6.0872  itc_vec_split_search
144795    6.0301  itc_next_set_neq
131671    5.4836  itc_dive_transit
110870    4.6173  itc_page_rcf_search
 66780    2.7811  gen_qsort
 66434    2.7667  itc_param_cmp
 58450    2.4342  itc_bm_vec_row_check
 55213    2.2994  dk_alloc
 47793    1.9904  cmpf_iri64n_iri64n
 44277    1.8440  dc_iri_id_cmp
 39489    1.6446  cmpf_int64n
 36880    1.5359  dc_append_bytes
 36601    1.5243  dv_compare
 31286    1.3029  dc_any_value_prefetch
 25457    1.0602  itc_next_set
 20852    0.8684  box_equal
 19895    0.8285  dk_free_tree
 19698    0.8203  itc_page_insert_search
 19367    0.8066  dc_copy
</pre>
 </code>
</blockquote>


<p>The top function in both is the compare for an equality of two leading IRIs and a range for the trailing any. This corresponds to the range check in Q5. At the larger scale this is three times more important. At the smaller scale, the share of query <a class="auto-href" href="http://dbpedia.org/resource/Program_optimization" id="link-id0x1ea77ab8">optimization</a> is about 6.5 times greater. The top function in this category is <code>box_equal</code> with 5.2% vs 0.87%. The remaining SQL compiler functions are all in proportion to this, totaling 14.3% of the 100 Mt top-20 profile.</p>

<p>From this sample it appears ten times more scale is seven times more database operations. This is not taken into account in the metric. Query compilation is significant at the small end, and no longer significant at 1000 Mt. From these numbers, we could say that <a class="auto-href" href="http://virtuoso.openlinksw.com" id="link-id0x1c819148">Virtuoso</a> is about two times more efficient in terms of database operation throughput at 1000 Mt than at 100 Mt.</p>



<p>We may conclude that different BSBM scales measure different things. The <a class="auto-href" href="http://www.tpc.org/" id="link-id0x1f0ffb40">TPC</a> workloads are relatively better in that they have a balance between metric components that stay relatively constant across a large range of scales.</p>


<p>This is not necessarily something that should be fixed in the BSBM Explore mix. We must however take these factors better into account in developing the BI mix.</p>

<p>Let us also remember that BSBM Explore is a relational workload. Future posts in this series will outline how we propose to make RDF-friendlier benchmarks. </p>


<h3>
<i>Benchmarks, Redux</i> Series</h3>
<ul>
<li> <a href="http://www.openlinksw.com/weblog/oerling/?id=1658" id="link-id0x1a9bcff8">Benchmarks, Redux (part 1): On RDF Benchmarks</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x1d3e5470">Benchmarks, Redux (part 2): A Benchmarking Story</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1663" id="link-id0x1de94770">Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1665" id="link-id0x1ea66470">Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1667" id="link-id0x1f1118d8">Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs </a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1669" id="link-id0x1d1c0cd8">Benchmarks, Redux (part 6): BSBM and I/O, continued</a>
</li>
<li>
 Benchmarks, Redux (part 7): What Does BSBM Explore Measure? <i>(this post)</i>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1673" id="link-id0x1aaf4180">Benchmarks, Redux (part 8): BSBM Explore and Update </a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1675" id="link-id0x1a957610">Benchmarks, Redux (part 9): BSBM With Cluster</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1677" id="link-id0x127e75c8">Benchmarks, Redux (part 10): LOD2 and the Benchmark Process</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1678" id="link-id0x1c9400f0">Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1d2c1d68">Benchmarks, Redux (part 12): Our Own BSBM Results Report</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1ea1fb40">Benchmarks, Redux (part 13): BSBM BI Modifications </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1c073a10">Benchmarks, Redux (part 14): BSBM BI Mix </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1c5541e8">Benchmarks, Redux (part 15): BSBM Test Driver Enhancements </a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2011-03-07#1669">
  <rss:title>Benchmarks, Redux (part 6): BSBM and I/O, continued</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-03-07T22:36:24Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">In the words of Jim Gray, disks have become tapes. By this he means that a disk is really only good for sequential access. For this reason, the SSD extent read ahead was incomparably better. We note that in the experiment, every page in the general area of the database the experiment touched would in time be touched, and that the whole working set would end up in memory. Therefore no speculative read would be wasted. Therefore it stands to reason to read whole extents. So I changed the default behavior to use a very long window for triggering read-ahead as long as the buffer pool was not full. After the initial filling of the buffer pool, the read ahead would require more temporal locality before kicking in. Still, the scheme was not really good since the rest of the extent would go for background-read and the triggering read would be done right then, leading to extra seeks. Well, this is good for latency but bad for throughput. So I changed this too, going to an &quot;elevator only&quot; scheme where reads that triggered read-ahead would go with the read-ahead batch. Reads that did not trigger read-ahead would still be done right in place, thus favoring latency but breaking any sequentiality with its attendant 10+ ms penalty. We keep in mind that the test we target is BSBM warm-up time, which is purely a throughput business. One could have timeouts and could penalize queries that sacrificed too much latency to throughput. We note that even for this very simple metric, just reading the allocated database pages from start to end is not good since a large number of pages in fact never get read during a run. We further note that the vectored read-ahead without any speculation will be useful as-is for cases with few threads and striping, since at least one thread&#39;s random I/Os get to go to multiple threads. The benefit is less in multiuser situations where disks are randomly busy anyhow. In the previous I/O experiments, we saw that with vectored read ahead and no speculation, there were around 50 pages waiting for I/O at all times. With an easily-triggered extent read-ahead, there were around 4000 pages waiting. The more pages are waiting for I/O, the greater the benefit from the elevator algorithm of servicing I/O in order of file offset. In Virtuoso 5 we had a trick that would, if the buffer pool was not full, speculatively read every uncached sibling of every index tree node it visited. This filled the cache quite fast, but was useless after the cache was full. The extent read ahead first implemented in 6 was less aggressive, but would continue working with full cache and did in fact help with shifts in the working set. The next logical step is to combine the vector and extent read-ahead modes. We see what pages we will be getting, then take the distinct extents; if we have been to this extent within the time window, we just add all the uncached allocated pages of the extent to the batch. With this setting, especially at the start of the run, we get large read-ahead batches and maintain I/O queues of 5000 to 20000 pages. The SSD starting time drops to about 120 seconds from cold start to reach 1200% CPU. We see transfer rates of up to 150 MB/s per SSD. With HDDs, we see transfer rates around 14 MB/s per drive, mostly reading chunks of an average of seventy-one (71) 8K pages. The BSBM workload does not offer better possibilities for optimization, short of pre-reading the whole database, which is not practical at large scales. Some Details First we start from cold disk, with and without mandatory read of the whole extent on the touch. Without any speculation but with vectored read-ahead, here are the times for the first 11 query mixes: 0: 151560.82 ms, total: 151718 ms 1: 179589.08 ms, total: 179648 ms 2: 71974.49 ms, total: 72017 ms 3: 102701.73 ms, total: 102729 ms 4: 58834.41 ms, total: 58856 ms 5: 65926.34 ms, total: 65944 ms 6: 68244.69 ms, total: 68274 ms 7: 39197.15 ms, total: 39215 ms 8: 45654.93 ms, total: 45674 ms 9: 34850.30 ms, total: 34878 ms 10: 100061.30 ms, total: 100079 ms The average CPU during this time was 5%. The best read throughput was 2.5 MB/s; the average was 1.35 MB/s. The average disk read was 16 ms. With vectored read-ahead and full extents only, i.e., max speculation: 0: 178854.23 ms, total: 179034 ms 1: 110826.68 ms, total: 110887 ms 2: 19896.11 ms, total: 19941 ms 3: 36724.43 ms, total: 36753 ms 4: 21253.70 ms, total: 21285 ms 5: 18417.73 ms, total: 18439 ms 6: 21668.92 ms, total: 21690 ms 7: 12236.49 ms, total: 12267 ms 8: 14922.74 ms, total: 14945 ms 9: 11502.96 ms, total: 11523 ms 10: 15762.34 ms, total: 15792 ms ... 90: 1747.62 ms, total: 1761 ms 91: 1701.01 ms, total: 1714 ms 92: 1300.62 ms, total: 1318 ms 93: 1873.15 ms, total: 1886 ms 94: 1508.24 ms, total: 1524 ms 95: 1748.15 ms, total: 1761 ms 96: 2076.92 ms, total: 2090 ms 97: 2199.38 ms, total: 2212 ms 98: 2305.75 ms, total: 2319 ms 99: 1771.91 ms, total: 1784 ms Scale factor: 2848260 Number of warmup runs: 0 Seed: 808080 Number of query mix runs (without warmups): 100 times min/max Querymix runtime: 1.3006s / 178.8542s Elapsed runtime: 872.993 seconds QMpH: 412.374 query mixes per hour The peak throughput is 91 MB/s, with average around 50 MB/s; CPU average around 50%. We note that the latency of the first query mix is hardly greater than in the non-speculative run, but starting from mix 3 the speed is clearly better. Then the same with cold SSDs. First with no speculation: 0: 5177.68 ms, total: 5302 ms 1: 2570.16 ms, total: 2614 ms 2: 1353.06 ms, total: 1391 ms 3: 1957.63 ms, total: 1978 ms 4: 1371.13 ms, total: 1386 ms 5: 1765.55 ms, total: 1781 ms 6: 1658.23 ms, total: 1673 ms 7: 1273.87 ms, total: 1289 ms 8: 1355.19 ms, total: 1380 ms 9: 1152.78 ms, total: 1167 ms 10: 1787.91 ms, total: 1802 ms ... 90: 1116.25 ms, total: 1128 ms 91: 989.50 ms, total: 1001 ms 92: 833.24 ms, total: 844 ms 93: 1137.83 ms, total: 1150 ms 94: 969.47 ms, total: 982 ms 95: 1138.04 ms, total: 1149 ms 96: 1155.98 ms, total: 1168 ms 97: 1178.15 ms, total: 1193 ms 98: 1120.18 ms, total: 1132 ms 99: 1013.16 ms, total: 1025 ms Scale factor: 2848260 Number of warmup runs: 0 Seed: 808080 Number of query mix runs (without warmups): 100 times min/max Querymix runtime: 0.8201s / 5.1777s Elapsed runtime: 127.555 seconds QMpH: 2822.321 query mixes per hour The peak I/O is 45 MB/s, with average 28.3 MB/s; CPU average is 168%. Now, SSDs with max speculation. 0: 44670.34 ms, total: 44809 ms 1: 18490.44 ms, total: 18548 ms 2: 7306.12 ms, total: 7353 ms 3: 9452.66 ms, total: 9485 ms 4: 5648.56 ms, total: 5668 ms 5: 5493.21 ms, total: 5511 ms 6: 5951.48 ms, total: 5970 ms 7: 3815.59 ms, total: 3834 ms 8: 4560.71 ms, total: 4579 ms 9: 3523.74 ms, total: 3543 ms 10: 4724.04 ms, total: 4741 ms ... 90: 673.53 ms, total: 685 ms 91: 534.62 ms, total: 545 ms 92: 730.81 ms, total: 742 ms 93: 1358.14 ms, total: 1370 ms 94: 1098.64 ms, total: 1110 ms 95: 1232.20 ms, total: 1243 ms 96: 1259.57 ms, total: 1273 ms 97: 1298.95 ms, total: 1310 ms 98: 1156.01 ms, total: 1166 ms 99: 1025.45 ms, total: 1034 ms Scale factor: 2848260 Number of warmup runs: 0 Seed: 808080 Number of query mix runs (without warmups): 100 times min/max Querymix runtime: 0.4725s / 44.6703s Elapsed runtime: 269.323 seconds QMpH: 1336.683 query mixes per hour The peak I/O is 339 MB/s, with average 192 MB/s; average CPU is 121%. The above was measured with the read-ahead thread doing single-page reads. We repeated the test with merging reads with small differences. The max IO was 353 MB/s, and average 173 MB/s; average CPU 113%. We see that the start latency is quite a bit longer than without speculation and the CPU % is lower due to higher latency of individual I/O. The I/O rate is fair. We would expect more throughput however. We find that a supposedly better use of the API, doing single requests of up to 100 pages instead of consecutive requests of 1 page, does not make a lot of difference. The peak I/O is a bit higher; overall throughput is a bit lower. We will have to retry these experiments with a better controller. We have at no point seen anything like the 50K 4KB random I/Os promised for the SSDs by the manufacturer. We know for a fact that the controller gives about 700 MB/s sequential read with cat file /dev/null and two drives busy. With 4 drives busy, this does not get better. The best 30 second stretch we saw in a multiuser BSBM warm-up was 590 MB/s, which is consistent with the cat to /dev/null figure. We will later test with 8 SSDs with better controllers. Note that the average I/O and CPU are averages over 30 second measurement windows; thus for short running tests, there is some error from the window during which the activity ended. Let us now see if we can make a BSBM instance warm up from disk in a reasonable time. We run 16 users with max speculation. We note that after reading 7,500,000 buffers we are not entirely free of disk. The max speculation read-ahead filled the cache in 17 minutes, with an average of 58 MB/s. After the cache is filled, the system shifts to a more conservative policy on extent read-ahead; one which in fact never gets triggered with the BSBM Explore in steady state. The vectored read-ahead is kept on since this by itself does not read pages that are not needed. However, the vectored read-ahead does not run either, because the data that is accessed in larger batches is already in memory. Thus there remains a trickle of an average 0.49 MB/s from disk. This keeps CPU around 350%. With SSDs, the trickle is about 1.5 MB/s and CPU is around 1300% in steady state. Thus SSDs give approximately triple the throughput in a situation where there is a tiny amount of continuous random disk access. The disk access in question is 80% for retrieving RDF literal strings, presumably on behalf of the DESCRIBE query in the mix. This query touches things no other query touches and does so one subject at a time, in a way that can neither be anticipated nor optimized. The Virtuoso 7 column store will deal with this better because it is more space efficient overall. If we apply stream-compression to literals, these will go in under half the space, while quads will go in maybe one-quarter the space. Thus 3000 Mt all from memory should be possible with 72 GB RAM. 1000 Mt row-wise did fit in in 72 GB RAM except for the random literals accessed by the the DESCRIBE. This alone drops throughput to under a third of the memory-only throughput if using HDDs. SSDs, on the other hand, can largely neutralize this effect. Conclusions We have looked at basics of I/O. SSDs have been found to be a readily available solution to I/O bottlenecks without need for reconfiguration or complex I/O policies. We have been able to get a decent read rate under conditions of server warm-up or shift of working set even with HDDs. More advanced I/O matters will be covered with the column store. We note that the techniques discussed here apply identically to rows and columns. As concerns BSBM, it seems appropriate to include a warm-up time. In practice, this means that the store just must eagerly pre-read. This is not hard to do and can be quite useful. Benchmarks, Redux Series Benchmarks, Redux (part 1): On RDF Benchmarks Benchmarks, Redux (part 2): A Benchmarking Story Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs Benchmarks, Redux (part 6): BSBM and I/O, continued (this post) Benchmarks, Redux (part 7): What Does BSBM Explore Measure? Benchmarks, Redux (part 8): BSBM Explore and Update Benchmarks, Redux (part 9): BSBM With Cluster Benchmarks, Redux (part 10): LOD2 and the Benchmark Process Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks Benchmarks, Redux (part 12): Our Own BSBM Results Report Benchmarks, Redux (part 13): BSBM BI Modifications Benchmarks, Redux (part 14): BSBM BI Mix Benchmarks, Redux (part 15): BSBM Test Driver Enhancements</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>In the words of Jim Gray, disks have become tapes. By this he means that a disk is really only good for sequential access. For this reason, the SSD extent read ahead was incomparably better. We note that in the experiment, every page in the general area of the database the experiment touched would in time be touched, and that the whole working set would end up in memory. Therefore no speculative read would be wasted. Therefore it stands to reason to read whole extents.</p>

<p>So I changed the default behavior to use a very long window for triggering read-ahead as long as the buffer pool was not full. After the initial filling of the buffer pool, the read ahead would require more temporal locality before kicking in. </p>

<p>Still, the scheme was not really good since the rest of the extent would go for background-read and the triggering read would be done right then, leading to extra seeks. Well, this is good for latency but bad for throughput. So I changed this too, going to an &quot;elevator only&quot; scheme where reads that triggered read-ahead would go with the read-ahead batch. Reads that did not trigger read-ahead would still be done right in place, thus favoring latency but breaking any sequentiality with its attendant 10+ ms penalty.</p>


<p>We keep in mind that the test we target is <a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x1beac870">BSBM</a> warm-up time, which is purely a throughput business. One could have timeouts and could penalize queries that sacrificed too much latency to throughput.</p>

<p>We note that even for this very simple metric, just reading the allocated database pages from start to end is not good since a large number of pages in fact never get read during a run.</p>

<p>We further note that the vectored read-ahead without any speculation will be useful as-is for cases with few threads and striping, since at least one thread&#39;s random I/Os get to go to multiple threads. The benefit is less in multiuser situations where disks are randomly busy anyhow. </p>

<p>In the previous I/O experiments, we saw that with vectored read ahead and no speculation, there were around 50 pages waiting for I/O at all times. With an easily-triggered extent read-ahead, there were around 4000 pages waiting. The more pages are waiting for I/O, the greater the benefit from the elevator algorithm of servicing I/O in order of file offset. </p>

<p>In <a class="auto-href" href="http://virtuoso.openlinksw.com" id="link-id0x117fd820">Virtuoso</a> 5 we had a trick that would, if the buffer pool was not full, speculatively read every uncached sibling of every index tree node it visited. This filled the <a class="auto-href" href="http://dbpedia.org/resource/Cache" id="link-id0x1de04030">cache</a> quite fast, but was useless after the cache was full. The extent read ahead first implemented in 6 was less aggressive, but would continue working with full cache and did in fact help with shifts in the working set.</p>

<p>The next logical step is to combine the vector and extent read-ahead modes. We see what pages we will be getting, then take the distinct extents; if we have been to this extent within the time window, we just add all the uncached allocated pages of the extent to the batch.</p>

<p>With this setting, especially at the start of the run, we get large read-ahead batches and maintain I/O queues of 5000 to 20000 pages. The SSD starting time drops to about 120 seconds from cold start to reach 1200% <a class="auto-href" href="http://dbpedia.org/resource/Central_processing_unit" id="link-id0x180fa748">CPU</a>. We see transfer rates of up to 150 MB/s per SSD. With HDDs, we see transfer rates around 14 MB/s per drive, mostly reading chunks of an average of seventy-one (71) 8K pages.</p>

<p>The BSBM workload does not offer better possibilities for <a class="auto-href" href="http://dbpedia.org/resource/Program_optimization" id="link-id0x1e0f7078">optimization</a>, short of pre-reading the whole database, which is not practical at large scales. </p>

<h2>Some Details</h2>

<p>First we start from cold disk, with and without mandatory read of the whole extent on the touch.</p>

<p>Without any speculation but with vectored read-ahead, here are the times for the first 11 query mixes:</p>

<blockquote>
 <code><pre>
 0: 151560.82 ms, total: 151718 ms
 1: 179589.08 ms, total: 179648 ms
 2:  71974.49 ms, total:  72017 ms
 3: 102701.73 ms, total: 102729 ms
 4:  58834.41 ms, total:  58856 ms
 5:  65926.34 ms, total:  65944 ms
 6:  68244.69 ms, total:  68274 ms
 7:  39197.15 ms, total:  39215 ms
 8:  45654.93 ms, total:  45674 ms
 9:  34850.30 ms, total:  34878 ms
10: 100061.30 ms, total: 100079 ms
</pre>
 </code>
</blockquote>

<p>The average CPU during this time was 5%. The best read throughput was 2.5 MB/s; the average was 1.35 MB/s. The average disk read was 16 ms. </p>

<p>With vectored read-ahead and full extents only, i.e., max speculation:</p>

<blockquote>
 <code><pre>
 0: 178854.23 ms, total: 179034 ms
 1: 110826.68 ms, total: 110887 ms
 2:  19896.11 ms, total:  19941 ms
 3:  36724.43 ms, total:  36753 ms
 4:  21253.70 ms, total:  21285 ms
 5:  18417.73 ms, total:  18439 ms
 6:  21668.92 ms, total:  21690 ms
 7:  12236.49 ms, total:  12267 ms
 8:  14922.74 ms, total:  14945 ms
 9:  11502.96 ms, total:  11523 ms
10:  15762.34 ms, total:  15792 ms
...

90:   1747.62 ms, total:   1761 ms
91:   1701.01 ms, total:   1714 ms
92:   1300.62 ms, total:   1318 ms
93:   1873.15 ms, total:   1886 ms
94:   1508.24 ms, total:   1524 ms
95:   1748.15 ms, total:   1761 ms
96:   2076.92 ms, total:   2090 ms
97:   2199.38 ms, total:   2212 ms
98:   2305.75 ms, total:   2319 ms
99:   1771.91 ms, total:   1784 ms

Scale factor:              2848260
Number of warmup runs:     0
Seed:                      808080
Number of query mix runs 
  (without warmups):       100 times
min/max Querymix runtime:  1.3006s / 178.8542s
Elapsed runtime:           872.993 seconds
QMpH:                      412.374 query mixes per hour
</pre>
 </code>
</blockquote>


<p>The peak throughput is 91 MB/s, with average around 50 MB/s; CPU average around 50%.</p>

<p>We note that the latency of the first query mix is hardly greater than in the non-speculative run, but starting from mix 3 the speed is clearly better. </p>



<p>Then the same with cold SSDs. First with no speculation:</p>

<blockquote>
 <code><pre>
 0:   5177.68 ms, total:   5302 ms
 1:   2570.16 ms, total:   2614 ms
 2:   1353.06 ms, total:   1391 ms
 3:   1957.63 ms, total:   1978 ms
 4:   1371.13 ms, total:   1386 ms
 5:   1765.55 ms, total:   1781 ms
 6:   1658.23 ms, total:   1673 ms
 7:   1273.87 ms, total:   1289 ms
 8:   1355.19 ms, total:   1380 ms
 9:   1152.78 ms, total:   1167 ms
10:   1787.91 ms, total:   1802 ms
...

90:   1116.25 ms, total:   1128 ms
91:    989.50 ms, total:   1001 ms
92:    833.24 ms, total:    844 ms
93:   1137.83 ms, total:   1150 ms
94:    969.47 ms, total:    982 ms
95:   1138.04 ms, total:   1149 ms
96:   1155.98 ms, total:   1168 ms
97:   1178.15 ms, total:   1193 ms
98:   1120.18 ms, total:   1132 ms
99:   1013.16 ms, total:   1025 ms

Scale factor:              2848260
Number of warmup runs:     0
Seed:                      808080
Number of query mix runs 
  (without warmups):       100 times
min/max Querymix runtime:  0.8201s / 5.1777s
Elapsed runtime:           127.555 seconds
QMpH:                      2822.321 query mixes per hour
</pre>
 </code>
</blockquote>


<p>The peak I/O is 45 MB/s, with average 28.3 MB/s; CPU average is 168%.</p>

<p>Now, SSDs with max speculation.</p>

<blockquote>
 <code><pre>
 0:  44670.34 ms, total:  44809 ms
 1:  18490.44 ms, total:  18548 ms
 2:   7306.12 ms, total:   7353 ms
 3:   9452.66 ms, total:   9485 ms
 4:   5648.56 ms, total:   5668 ms
 5:   5493.21 ms, total:   5511 ms
 6:   5951.48 ms, total:   5970 ms
 7:   3815.59 ms, total:   3834 ms
 8:   4560.71 ms, total:   4579 ms
 9:   3523.74 ms, total:   3543 ms
10:   4724.04 ms, total:   4741 ms
...

90:    673.53 ms, total:    685 ms
91:    534.62 ms, total:    545 ms
92:    730.81 ms, total:    742 ms
93:   1358.14 ms, total:   1370 ms
94:   1098.64 ms, total:   1110 ms
95:   1232.20 ms, total:   1243 ms
96:   1259.57 ms, total:   1273 ms
97:   1298.95 ms, total:   1310 ms
98:   1156.01 ms, total:   1166 ms
99:   1025.45 ms, total:   1034 ms

Scale factor:              2848260
Number of warmup runs:     0
Seed:                      808080
Number of query mix runs 
  (without warmups):       100 times
min/max Querymix runtime:  0.4725s / 44.6703s
Elapsed runtime:           269.323 seconds
QMpH:                      1336.683 query mixes per hour
</pre>
 </code>
</blockquote>


<p>The peak I/O is 339 MB/s, with average 192 MB/s; average CPU is 121%.</p>

<p>The above was measured with the read-ahead thread doing single-page reads. We repeated the test with merging reads with small differences. The max IO was 353 MB/s, and average 173 MB/s; average CPU 113%.</p>

<p>We see that the start latency is quite a bit longer than without speculation and the CPU % is lower due to higher latency of individual I/O. The I/O rate is fair. We would expect more throughput however. </p>

<p>We find that a supposedly better use of the API, doing single requests of up to 100 pages instead of consecutive requests of 1 page, does not make a lot of difference. The peak I/O is a bit higher; overall throughput is a bit lower.</p>



<p>We will have to retry these experiments with a better controller. We have at no point seen anything like the 50K 4KB random I/Os promised for the SSDs by the manufacturer. We know for a fact that the controller gives about 700 MB/s sequential read with <code>cat file /dev/null</code> and two drives busy. With 4 drives busy, this does not get better. The best 30 second stretch we saw in a multiuser BSBM warm-up was 590 MB/s, which is consistent with the <code>cat</code> to <code>/dev/null</code> figure. We will later test with 8 SSDs with better controllers. </p>

<p>Note that the average I/O and CPU are averages over 30 second measurement windows; thus for short running tests, there is some error from the window during which the activity ended. </p>


<p>Let us now see if we can make a BSBM instance warm up from disk in a reasonable time. We run 16 users with max speculation. We note that after reading 7,500,000 buffers we are not entirely free of disk. The max speculation read-ahead filled the cache in 17 minutes, with an average of 58 MB/s. After the cache is filled, the system shifts to a more conservative policy on extent read-ahead; one which in fact never gets triggered with the BSBM <i>Explore</i> in steady state. The vectored read-ahead is kept on since this by itself does not read pages that are not needed. However, the vectored read-ahead does not run either, because the <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0x1d7674e8">data</a> that is accessed in larger batches is already in memory. Thus there remains a trickle of an average 0.49 MB/s from disk. This keeps CPU around 350%. With SSDs, the trickle is about 1.5 MB/s and CPU is around 1300% in steady state. Thus SSDs give approximately triple the throughput in a situation where there is a tiny amount of continuous random disk access. The disk access in question is 80% for retrieving <a class="auto-href" href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1c4adda0">RDF</a> literal strings, presumably on behalf of the <code>DESCRIBE</code> query in the mix. This query touches things no other query touches and does so one subject at a time, in a way that can neither be anticipated nor optimized.</p>

<p>The Virtuoso 7 column store will deal with this better because it is more space efficient overall. If we apply stream-compression to literals, these will go in under half the space, while quads will go in maybe one-quarter the space. Thus 3000 Mt all from memory should be possible with 72 GB RAM. 1000 Mt row-wise did fit in in 72 GB RAM except for the random literals accessed by the the <code>DESCRIBE</code>. This alone drops throughput to under a third of the memory-only throughput if using HDDs. SSDs, on the other hand, can largely neutralize this effect.</p>

 
<h2>Conclusions</h2>


<p>We have looked at basics of I/O. SSDs have been found to be a readily available solution to I/O bottlenecks without need for reconfiguration or complex I/O policies. We have been able to get a decent read rate under conditions of server warm-up or shift of working set even with HDDs.</p>

<p>More advanced I/O matters will be covered with the column store. We note that the techniques discussed here apply identically to rows and columns.</p>

<p>As concerns BSBM, it seems appropriate to include a warm-up time. In practice, this means that the store just must eagerly pre-read. This is not hard to do and can be quite useful.</p>


<h3>
<i>Benchmarks, Redux</i> Series</h3>
<ul>
<li> <a href="http://www.openlinksw.com/weblog/oerling/?id=1658" id="link-id0x1b4342b0">Benchmarks, Redux (part 1): On RDF Benchmarks</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x1d3e7388">Benchmarks, Redux (part 2): A Benchmarking Story</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1663" id="link-id0x153c7ba8">Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1665" id="link-id0x1da11d98">Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1667" id="link-id0x1d25d630">Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs</a>
</li>
<li>
 Benchmarks, Redux (part 6): BSBM and I/O, continued <i>(this post)</i>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1671" id="link-id0x1f1f5ee8">Benchmarks, Redux (part 7): What Does BSBM Explore Measure?</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1673" id="link-id0x1cd44938">Benchmarks, Redux (part 8): BSBM Explore and Update </a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1675" id="link-id0x1d51f848">Benchmarks, Redux (part 9): BSBM With Cluster</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1677" id="link-id0x13d333c0">Benchmarks, Redux (part 10): LOD2 and the Benchmark Process</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1678" id="link-id0x1e77a5e8">Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1ea1fb40">Benchmarks, Redux (part 12): Our Own BSBM Results Report</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1e7786c8">Benchmarks, Redux (part 13): BSBM BI Modifications </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1f8a37f8">Benchmarks, Redux (part 14): BSBM BI Mix </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1c69e018">Benchmarks, Redux (part 15): BSBM Test Driver Enhancements </a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2011-03-07#1667">
  <rss:title>Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs </rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-03-07T19:17:36Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">In the context of database benchmarks we cannot ignore I/O, as pretty much has been done so far by BSBM. There are two approaches: run twice or otherwise make sure one runs from memory and forget about I/O, or make rules and metrics for warm-up. We will see if the second is possible with BSBM. From this starting point, we look at various ways of scheduling I/O in Virtuoso using a 1000 Mt BSBM database on sets of each of HDDs (hard disk devices) and SSDs (solid-state storage devices). We will see that SSDs in this specific application can make a significant difference. In this test we have the same 4 stripes of a 1000 Mt BSBM database on each of two storage arrays. Storage Arrays Type Quantity Maker Size Speed Interface speed Controller Drive Cache RAID SSD 4 Crucial 128 GB N/A 6Gbit SATA RocketRaid 640 128 MB None HDD 4 Samsung 1000 GB 7200 RPM 3Gbit SATA Intel ICH on Supermicro motherboard 16 MB None We make sure that the files are not in OS cache by filling it with other big files, reading a total of 120 GB off SSDs with `cat file &gt; /dev/null`. The configuration files are as in the report on the 1000 Mt run. We note as significant that we have a few file descriptors for each stripe, and that read-ahead for each is handled by its own thread. Two different read-ahead schemes are used: With 6 Single, if a 2MB extent gets a second read within a given time after the first, the whole extent is scheduled for background read. With 7 Single, as an index search is vectored, we know a large number of values to fetch at one time and these values are sorted into an ascending sequence. Therefore, by looking at a node in an index tree, we can determine which sub-trees will be accessed and schedule these for read-ahead, skipping any that will not be accessed. In either model, a sequential scan touching more than a couple of consecutive index leaf pages triggers a read-ahead, to the end of the scanned range or to the next 3000 index leaves, whichever comes first. However, there are no sequential scans of significant size in BSBM. There are a few different possibilities for the physical I/O: Using a separate read system call for each page. There may be several open file descriptors on a file so that many such calls can proceed concurrently on different threads; the OS will order the operations. A thread finds it needs a page and reads it. Using Unix asynchronous I/O, aio.h, with the aio_* and lio_listio functions. Using single-read system calls for adjacent pages. In this way, the drive sees longer requests and should give better throughput. If there are short gaps in the sequence, the gaps are also read, wasting bandwidth but saving on latency. The two latter apply only to bulk I/O that are scheduled on background threads, one per independently-addressable device (HDD, SSD, or RAID-set). These bulk-reads operate on an elevator model, keeping a sorted queue of things to read or write and moving through this queue from start to end. At any time, the queue may get more work from other threads. There is a further choice when seeing single-page random requests. They can either go to the elevator or they can be done in place. Taking the elevator is presumably good for throughput but bad for latency. In general, the elevator should have a notion of fairness; these matters are discussed in the CWI collaborative scan paper. Here we do not have long queries, so we do not have to talk about elevator policies or scan sharing; there are no scans. We may touch on these questions later with the column store, the BSBM BI mix, and TPC-H. While we may know principles, I/O has always given us surprises; the only way to optimize this is to measure. The metric we try to optimize here is the time it takes for a multiuser BSBM run starting from cold cache to get to 1200% CPU. When running from memory, the CPU is around 1350% for the system in question. This depends on getting I/O throughput, which in turn depends on having a lot of speculative reading since the workload itself does not give any long stretches to read. The test driver is set at 16 clients, and the run continues for 2000 query mixes or until target throughput is reached. Target throughput is deemed reached after the first 20 second stretch with CPU at 1200% or higher. The meter is a stored procedure that records the CPU time, count of reads, cumulative elapsed time spent waiting for I/O, and other metrics. The code for this procedure (for 7 Single; this file will not work on Virtuoso 6 or earlier) is available here. The database space allocation gives each index a number of 2MB segments, each with 256 8K pages. When a page splits, the new page is allocated from the same extent if possible, or from a specific second extent which is designated as the overflow extent of this extent. This scheme provides for a sort of pseudo-locality within extents over random insert order. Thus there is a chance that pre-reading an extent will get key values in the same range a the ones on the page being requested in the first place. At least the pre-read pages will be from the same index tree. There are insertion orders that do not create good locality with this allocation scheme, though. In order to generally improve locality, one could shuffle pages of an all-dirty subtree before writing this out so as to have physical order match key order. We will look at some tricks in this vein with the column store. For the sake of simplicity we only run 7 Single with the 1000 Mt scale. The first experiment was with SSDs and the vectored read-ahead. The target throughput was reached after 280 seconds. The next test was with HDDs and extent read-ahead. One hour into the experiment, the CPU was about 70% after processing around 1000 query mixes. It might have been hours before HDD reads became rare enough for hitting 1200% CPU. The test was not worth continuing. The result with HDDs and vectored read-ahead would be worse since vectored read-ahead leads to smaller read-ahead batches and to less contiguous read patterns. The individual read times here, are over twice the individual read times with per-extent read-ahead. The fact that vectored read-ahead does not read potentially unneeded pages makes no difference. Hence this test is also not worth running to completion. There are other possibilities for improving HDD I/O. If only 2MB read requests are made, a transfer will be about 20 ms at a sequential transfer speed of 50 MB/s. Then seeking to the next 2MB extent will be a few ms, most often less than 20, so the HDD should give at least half the nominal throughput. We note that, when reading sequential 8K pages inside a single 2MB (256 page) extent, the seek latency is not 0 as one would expect but an extreme 5 ms. One would think that the drive would buffer a whole track, and a track would hold a large number of 2MB sections, but apparently this is not so. Therefore, now if we have a sequential read pattern that is more dense than 1 page out of 10, we read all the pages and just keep the ones we want. So now we set the read-ahead to merge reads that fall within 10 pages. This wastes bandwidth, but supposedly saves on latency. We will see. So we try, and we find that read-ahead does not account for most pages since it does not get triggered. Thus, we change the triggering condition to be the 2nd read to fall in the extent within 20 seconds of the first. The HDDs were in all cases 700% busy for 4 HDDs. But with the new setting we get longer requests, most often full extents, which gets a per-HDD transfer rate of about 5 MB/s. With the looser condition for starting read-ahead, 89% of all pages were read in a read-ahead batch. We see the I/O throughput decrease during the run because there are more single-page reads that do not trigger extent read-ahead. So HDDs have 1.7 concurrent operations pending, but the batch size drops, dropping the throughput. Thus with the best settings, the test with 2000 query mixes finishes in 46 minutes, and the CPU utilization is steadily increasing, hitting 392% for the last minute. In comparison, with SSDs and our worst read-ahead setting we got 1200% CPU in under 5 minutes from cold start. The I/O system can be further tuned; for example, by only reading full extents as long as the buffer pool is not full. In the next post we will measure some more. BSBM Note We look at query times with semi-warm cache, with CPU around 400%. We note that Q8-Q12 are especially bad. Q5 runs at about half speed. Q12 runs at under 1/10th speed. The relatively slowest queries appear to be single-instance lookups. Nothing short of the most aggressive speculative reading can help there. Neither query nor workload has any exploitable pattern. Therefore if an I/O component is to be included in a BSBM metric, the only way to score in this is to use speculative read to the maximum. Some of the queries take consecutive property values of a single instance. One could parallelize this pipeline, but this would be a one-off and would make sense only when reading from storage (whether HDD, SSD, or otherwise). Multithreading for single rows is not worth the overhead. A metric for BSBM warm-up is not interesting for database science, but may still be of practical interest in the specific case of RDF stores. Specially reading large chunks at startup time is good, so putting a section in BSBM that would force one to implement this would be a service to most end users. Measuring and reporting such I/O performance would favor space efficiency in general. Space efficiency is generally a good thing, especially at larger scales, so we can put an optional section in the report for warm-up. This is also good for comparing HDDs and SSDs, and for testing read-ahead, which is still something a database is expected to do. Implementors have it easy; just speculatively read everything. Looking at the BSBM fictional use case, anybody running such a portal would do this from RAM only, so it makes sense to define the primary metric as running from warm cache, in practice 100% from memory. Benchmarks, Redux Series Benchmarks, Redux (part 1): On RDF Benchmarks Benchmarks, Redux (part 2): A Benchmarking Story Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs (this post) Benchmarks, Redux (part 6): BSBM and I/O, continued Benchmarks, Redux (part 7): What Does BSBM Explore Measure? Benchmarks, Redux (part 8): BSBM Explore and Update Benchmarks, Redux (part 9): BSBM With Cluster Benchmarks, Redux (part 10): LOD2 and the Benchmark Process Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks Benchmarks, Redux (part 12): Our Own BSBM Results Report Benchmarks, Redux (part 13): BSBM BI Modifications Benchmarks, Redux (part 14): BSBM BI Mix Benchmarks, Redux (part 15): BSBM Test Driver Enhancements</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>In the context of database benchmarks we cannot ignore I/O, as pretty much has been done so far by <a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x2a0452f8">BSBM</a>. </p>

<p>There are two approaches:</p> 

<ol>
<li>
  <p>run twice or otherwise make sure one runs from memory and forget about I/O, or</p>
</li>
<li>
  <p>make rules and metrics for warm-up.</p>
</li>
</ol>
<p>We will see if the second is possible with BSBM.</p>

<p>From this starting point, we look at various ways of scheduling I/O in <a class="auto-href" href="http://virtuoso.openlinksw.com" id="link-id0x2a9fdb88">Virtuoso</a> using a 1000 Mt BSBM database on sets of each of HDDs (hard disk devices) and SSDs (solid-state storage devices). We will see that SSDs in this specific application can make a significant difference. </p>


<p>In this test we have the same 4 stripes of a 1000 Mt BSBM database on each of two storage arrays.</p>

<table border="1" cellspacing="2" cellpadding="2" align="center" width="90%">
	<tr>
		<th colspan="9" align="center">Storage Arrays</th>
	</tr>
	<tr>
		<th align="center"> Type </th>
		<th align="center"> Quantity </th>
		<th align="center"> Maker </th>
		<th align="center"> Size </th>
		<th align="center"> Speed </th>
		<th align="center"> Interface speed </th>
		<th align="center"> Controller </th>
		<th align="center"> Drive <a class="auto-href" href="http://dbpedia.org/resource/Cache" id="link-id0x2ad20cd0">Cache</a> </th>
		<th align="center"> RAID </th>
	</tr>
	<tr>
		<td align="center"> SSD </td>
		<td align="center"> 4 </td>
		<td align="center"> Crucial </td>
		<td align="center"> 128 GB </td>
		<td align="center"> N/A </td>
		<td align="center"> 6Gbit SATA </td>
		<td align="center"> RocketRaid 640 </td>
		<td align="center"> 128 MB </td>
		<td align="center"> None </td>
	</tr>
	<tr>
		<td align="center"> HDD </td>
		<td align="center"> 4 </td>
		<td align="center"> Samsung </td>
		<td align="center"> 1000 GB </td>
		<td align="center"> 7200 RPM </td>
		<td align="center"> 3Gbit SATA </td>
		<td align="center"> <a class="auto-href" href="http://dbpedia.org/resource/Intel_Corporation" id="link-id0x2a8cfca0">Intel</a> ICH on Supermicro motherboard </td>
		<td align="center"> 16 MB </td>
		<td align="center"> None </td>
	</tr>
</table>


<p>We make sure that the files are not in OS cache by filling it with other big files, reading a total of 120 GB off SSDs with <code>`cat file &gt; /dev/null`</code>. </p>

<p>The configuration files are as in the report on the 1000 Mt run. We note as significant that we have a few file descriptors for each stripe, and that read-ahead for each is handled by its own thread.</p>

<p>Two different read-ahead schemes are used: </p>
<ul>
 <li>
  <p>With 6 Single, if a 2MB extent gets a second read within a given time after the first, the whole extent is scheduled for background read.</p>
 </li>
<li>
  <p>With 7 Single, as an index search is vectored, we know a large number of values to fetch at one time and these values are sorted into an ascending sequence. Therefore, by looking at a node in an index tree, we can determine which sub-trees will be accessed and schedule these for read-ahead, skipping any that will not be accessed.</p>
</li>
</ul>

<p>In either model, a sequential scan touching more than a couple of consecutive index leaf pages triggers a read-ahead, to the end of the scanned range or to the next 3000 index leaves, whichever comes first. However, there are no sequential scans of significant size in BSBM.</p>

<p>There are a few different possibilities for the physical I/O: </p>

<ol>
<li>
  <p>Using a separate read system call for each page. There may be several open file descriptors on a file so that many such calls can proceed concurrently on different threads; the OS will order the operations.</p>
</li>
<li>
  <p>A thread finds it needs a page and reads it.</p>
</li>
<li>
  <p>Using Unix asynchronous I/O, <code>aio.h</code>, with the <code>aio_*</code> and <code>lio_listio</code> functions.</p>
</li>
<li>
  <p>Using single-read system calls for adjacent pages. In this way, the drive sees longer requests and should give better throughput. If there are short gaps in the sequence, the gaps are also read, wasting bandwidth but saving on latency.</p>
</li>
</ol>

<p>The two latter apply only to bulk I/O that are scheduled on background threads, one per independently-addressable device (HDD, SSD, or RAID-set).  These bulk-reads operate on an elevator model, keeping a sorted queue of things to read or write and moving through this queue from start to end. At any time, the queue may get more work from other threads.</p>

<p>There is a further choice when seeing single-page random requests. They can either go to the elevator or they can be done in place. Taking the elevator is presumably good for throughput but bad for latency. In general, the elevator should have a notion of fairness; these matters are discussed in the <a href="http://www.cwi.nl/" id="link-id0x1f62abb8">CWI collaborative scan paper</a>. Here we do not have long queries, so we do not have to talk about elevator policies or scan sharing; there are no scans. We may touch on these questions later with the column store, the BSBM BI mix, and <a class="auto-href" href="http://www.tpc.org/" id="link-id0x2a9067d0">TPC</a>-<a class="auto-href" href="http://dbpedia.org/resource/TPC-H" id="link-id0x2a8874f0">H</a>.</p>

<p>While we may know principles, I/O has always given us surprises; the only way to optimize this is to measure.</p>

<p>The metric we try to optimize here is the time it takes for a multiuser BSBM run starting from cold cache to get to 1200% <a class="auto-href" href="http://dbpedia.org/resource/Central_processing_unit" id="link-id0x2997a660">CPU</a>. When running from memory, the CPU is around 1350% for the system in question. </p>

<p>This depends on getting I/O throughput, which in turn depends on having a lot of speculative reading since the workload itself does not give any long stretches to read. </p>

<p>The test driver is set at 16 clients, and the run continues for 2000 query mixes or until target throughput is reached. Target throughput is deemed reached after the first 20 second stretch with CPU at 1200% or higher.</p>

<p>The meter is a stored procedure that records the CPU time, count of reads, cumulative elapsed time spent waiting for I/O, and other metrics. The code for this procedure (for 7 Single; this file will not work on Virtuoso 6 or earlier) is <a href="http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/BenchmarksReduxSupportingFiles/ldmeter.sql" id="link-id0x1b5adb08">available here</a>. </p>


<p>The database space allocation gives each index a number of 2MB segments, each with 256 8K pages. When a page splits, the new page is allocated from the same extent if possible, or from a specific second extent which is designated as the overflow extent of this extent. This scheme provides for a sort of pseudo-locality within extents over random insert order. Thus there is a chance that pre-reading an extent will get key values in the same range a the ones on the page being requested in the first place. At least the pre-read pages will be from the same index tree. There are insertion orders that do not create good locality with this allocation scheme, though. In order to generally improve locality, one could shuffle pages of an all-dirty subtree before writing this out so as to have physical order match key order. We will look at some tricks in this vein with the column store.</p>

<p>For the sake of simplicity we only run 7 Single with the 1000 Mt scale.</p>


<p>The first experiment was with SSDs and the vectored read-ahead.  The target throughput was reached after 280 seconds. </p>

<p>The next test was with HDDs and extent read-ahead. One hour into the experiment, the CPU was about 70% after processing around 1000 query mixes. It might have been hours before HDD reads became rare enough for hitting 1200% CPU. The test was not worth continuing.</p>

<p>The result with HDDs and vectored read-ahead would be worse since vectored read-ahead leads to smaller read-ahead batches and to less contiguous read patterns. The individual read times here, are over twice the individual read times with per-extent read-ahead. The fact that vectored read-ahead does not read potentially unneeded pages makes no difference. Hence this test is also not worth running to completion.</p>

<p>There are other possibilities for improving HDD I/O. If only 2MB read requests are made, a transfer will be about 20 ms at a sequential transfer speed of 50 MB/s. Then seeking to the next 2MB extent will be a few ms, most often less than 20, so the HDD should give at least half the nominal throughput.</p>

<p>We note that, when reading sequential 8K pages inside a single 2MB (256 page) extent, the seek latency is not 0 as one would expect but an extreme 5 ms. One would think that the drive would buffer a whole track, and a track would hold a large number of 2MB sections, but apparently this is not so. </p>

<p>Therefore, now if we have a sequential read pattern that is more dense than 1 page out of 10, we read all the pages and just keep the ones we want.</p>

<p>So now we set the read-ahead to merge reads that fall within 10 pages. This wastes bandwidth, but supposedly saves on latency. We will see. </p>

<p>So we try, and we find that read-ahead does not account for most pages since it does not get triggered.  Thus, we change the triggering condition to be the 2nd read to fall in the extent within 20 seconds of the first.</p>

<p>The HDDs were in all cases 700% busy for 4 HDDs. But with the new setting we get longer requests, most often full extents, which gets a per-HDD transfer rate of about 5 MB/s. With the looser condition for starting read-ahead, 89% of all pages were read in a read-ahead batch. We see the I/O throughput decrease during the run because there are more single-page reads that do not trigger extent read-ahead. So HDDs have 1.7 concurrent operations pending, but the batch size drops, dropping the throughput.</p>
<p>

</p>
<p>Thus with the best settings, the test with 2000 query mixes finishes in 46 minutes, and the CPU utilization is steadily increasing, hitting 392% for the last minute. In comparison, with SSDs and our worst read-ahead setting we got 1200% CPU in under 5 minutes from cold start. The I/O system can be further tuned; for example, by only reading full extents as long as the buffer pool is not full. In the next post we will measure some more. </p>
<p>


</p>
<h3>BSBM Note </h3>

<p>We look at query times with semi-warm cache, with CPU around 400%. We note that Q8-Q12 are especially bad. Q5 runs at about half speed. Q12 runs at under 1/10th speed. The relatively slowest queries appear to be single-instance lookups. Nothing short of the most aggressive speculative reading can help there. Neither query nor workload has any exploitable pattern. Therefore if an I/O component is to be included in a BSBM metric, the only way to score in this is to use speculative read to the maximum.</p>

<p>Some of the queries take consecutive property values of a single instance. One could parallelize this pipeline, but this would be a one-off and would make sense only when reading from storage (whether HDD, SSD, or otherwise). Multithreading for single rows is not worth the overhead.</p>

<p>A metric for BSBM warm-up is not interesting for database science, but may still be of practical interest in the specific case of <a class="auto-href" href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x29987c90">RDF</a> stores. Specially reading large chunks at startup time is good, so putting a section in BSBM that would force one to implement this would be a service to most end users. Measuring and reporting such I/O performance would favor space efficiency in general. Space efficiency is generally a good thing, especially at larger scales, so we can put an optional section in the report for warm-up. This is also good for comparing HDDs and SSDs, and for testing read-ahead, which is still something a database is expected to do. Implementors have it easy; just speculatively read everything.</p>

<p>Looking at the BSBM fictional use case, anybody running such a portal would do this from RAM only, so it makes sense to define the primary metric as running from warm cache, in practice 100% from memory.</p>


<h3>
<i>Benchmarks, Redux</i> Series</h3>
<ul>
<li> <a href="http://www.openlinksw.com/weblog/oerling/?id=1658" id="link-id0x1ecb2af0">Benchmarks, Redux (part 1): On RDF Benchmarks</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x19d05678">Benchmarks, Redux (part 2): A Benchmarking Story</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1663" id="link-id0x1d542328">Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1665" id="link-id0x13947e08">Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire</a>
</li>
<li>
Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs <i>(this post)</i>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1669" id="link-id0x1a7f6b30">Benchmarks, Redux (part 6): BSBM and I/O, continued</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1671" id="link-id0x1d67dd40">Benchmarks, Redux (part 7): What Does BSBM Explore Measure?</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1673" id="link-id0x1ebcee68">Benchmarks, Redux (part 8): BSBM Explore and Update </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1675" id="link-id0x1a855ba0">Benchmarks, Redux (part 9): BSBM With Cluster</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1677" id="link-id0x1b081e70">Benchmarks, Redux (part 10): LOD2 and the Benchmark Process</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1678" id="link-id0x1d7a7940">Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1d7e2cd0">Benchmarks, Redux (part 12): Our Own BSBM Results Report</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1e375338">Benchmarks, Redux (part 13): BSBM BI Modifications </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1d199728">Benchmarks, Redux (part 14): BSBM BI Mix </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1e808818">Benchmarks, Redux (part 15): BSBM Test Driver Enhancements </a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2011-03-04#1665">
  <rss:title>Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-03-04T20:28:28Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Below is a questionnaire I sent to the BSBM participants in order to get tuning instructions for the runs we were planning. I have filled in the answers for Virtuoso, here. This can be a checklist for pretty much any RDF database tuning. Threading - What settings should be used (e.g., for query parallelization, I/O parallelization [e.g., prefetch, flush of dirty], thread pools [e,.g. web server], any other thread related)? We will run with 8 and 32 cores, so if there are settings controlling number of read/write (R/W) locks or mutexes or such for serializing diverse things, these should be set accordingly to minimize contention. The following three settings are all in the [Parameters] section of the virtuoso.ini file. AsyncQueueMaxThreads controls the size of a pool of extra threads that can be used for query parallelization. This should be set to either 1.5 * the number of cores or 1.5 * the number of core threads; see which works better. ThreadsPerQuery is the maximum number of threads a single query will take. This should be set to either the number of cores or the number of core threads; see which works better. IndexTreeMaps is the number of mutexes over which control for buffering an index tree is split. This can generally be left at default (256 in normal operation; valid settings are powers of 2 from 2 to 1024), but setting to 64, 128, or 512 may be beneficial. A low number will lead to frequent contention; upwards of 64 will have little contention. We have sometimes seen a multiuser workload go 10% faster when setting this to 64 (down from 256), which seems counter-intuitive. This may be a cache artifact. In the [HTTPServer] section of the virtuoso.ini file, the ServerThreads setting is the number of web server threads, i.e., the maximum number of concurrent SPARQL protocol requests. Having a value larger than the number of concurrent clients is OK; for large numbers of concurrent clients a lower value may be better, which will result in requests waiting for a thread to be available. Note â The [HTTPServer] ServerThreads are taken from the total pool made available by the [Parameters] ServerThreads. Thus, the [Parameters] ServerThreads should always be at least as large as (and is best set greater than) the [HTTPServer] ServerThreads, and if using the closed-source Commercial Version, [Parameters] ServerThreads cannot exceed the licensed thread count. File layout - Are there settings for striping over multiple devices? Settings for other file access parallelism? Settings for SSDs (e.g., SSD based cache of hot set of larger db files on disk)? The target config is for 4 independent disks and 4 independent SSDs. If you depend on RAID, are there settings for this? If you need RAID to be set up, please provide the settings/script for doing this with 4 SSDs on Linux (RH and Debian). This will be software RAID, as we find the hardware RAID to be much worse than an independent disk setup on the system in question. It is best to stripe database files over all available disks, and to not use RAID. If RAID is desired, then stripe database files across many RAID sets. Use the segment declaration in the virtuoso.ini file. It is very important to give each independently seekable device its own I/O queue thread. See the documentation on the TPC-C sample for examples. in the [Parameters] section of the virtuoso.ini file, set FDsPerFile to be (the number of concurrent threads * 1.5) Ã· the number of distinct database files. There are no SSD specific settings. Loading - How many parallel streams work best? We are looking for non-transactional bulk load, with no inference materialization. For partitioned cluster settings, do we divide the load streams over server processes? Use one stream per core (not per core thread). In the case of a cluster, divide load streams evenly across all processes. The total number of streams on a cluster can equal the total number of cores; adjust up or down depending on what is observed. Use the built-in bulk load facility, i.e., ld_dir (&#39;&lt;source-filename-or-directory&gt;&#39;, &#39;&lt;file name pattern&gt;&#39;, &#39;&lt;destination graph iri&gt;&#39;); For example, SQL&gt; ld_dir (&#39;/path/to/files&#39;, &#39;*.n3&#39;, &#39;http://dbpedia.org&#39;); Then do a rdf_loader_run () on enough connections. For example, you can use the shell command isql rdf_loader_run () &amp; to start one in a background isql process. When starting background load commands from the shell, you can use the shell wait command to wait for completion. If starting from isql, use the wait_for_children; command (see isql documentation for details). See the BSBM disclosure report for an example load script. What command should be used after non-transactional bulk load, to ensure a consistent persistent state on disk, like a log checkpoint or similar? Load and checkpoint will be timed separately, load being CPU-bound and checkpoint being I/O-bound. No roll-forward log or similar is required; the load does not have to recover if it fails before the checkpoint. Execute CHECKPOINT; through a SQL client, e.g., isql. This is not a SPARQL statement and cannot be executed over the SPARQL protocol. What settings should be used for trickle load of small triple sets into a pre-existing graph? This should be as transactional as supported; at least there should be a roll forward log, unlike the case for the bulk load. No special settings are needed for load testing; defaults will produce transactional behavior with a roll forward log. Default transaction isolation is REPEATABLE READ, but this may be altered via SQL session settings or at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini file, with DefaultIsolation = 4 Transaction isolation cannot be set over the SPARQL protocol. NOTE: When testing full CRUD operations, other isolation settings may be preferable, due to ACID considerations. See answer #12, below, and detailed discussion in part 8 of this series, BSBM Explore and Update. What settings control allocation of memory for database caching? We will be running mostly from memory, so we need to make sure that there is enough memory configured. In the [Parameters] section of the virtuoso.ini file, NumberOfBuffers controls the amount of RAM used by Virtuoso to cache database files. One buffer caches an 8KB database page. In practice, count 10KB of memory per page. If &quot;swappiness&quot; on Linux is low (e.g., 2), two-thirds or more of physical memory can be used for database buffers. If swapping occurs, decrease the setting. What command gives status on memory allocation (e.g., number of buffers, number of dirty buffers, etc.) so that we can verify that things are indeed in server memory and not, for example, being served from OS disk cache. If the cached format is different from the disk layout (e.g., decompression after disk read), is there a command for space statistics for database cache? In an isql session, execute STATUS ( ? ? ); The second result paragraph gives counts of total, used, and dirty buffers. If used buffers is steady and less than total, and if the disk read count on the line below does not increase, the system is running from memory. The cached format is the same as the disk based format. What command gives information on disk allocation for different things? We are looking for the total size of allocated database pages for quads (including table, indices, anything else associated with quads) and dictionaries for literals, IRI names, etc. If there is a text index on literals, what command gives space stats for this? We count used pages, excluding any preallocated unused pages or other gaps. There is one number for quads and another for the dictionaries or other such structures, optionally a third for text index. Execute on an isql session: CHECKPOINT; SELECT TOP 20 * FROM sys_index_space_stats ORDER BY iss_pages DESC; The iss_pages column is the total pages for each index, including blob pages. Pages are 8KB. Only used pages are reported, gaps and unused pages are not counted. The rows pertaining to RDF_QUAD are for quads; RDF_IRI, RDF_PREFIX, RO_START, RDF_OBJ are for dictionaries; RDF_OBJ_RO_FLAGS_WORDS and VTLOG_DB_DBA_RDF_OBJ are for text index. If there is a choice between triples and quads, we will run with quads. How do we ascertain that the run is with quads? How do we find out the index scheme? Should be use an alternate index scheme? Most of the data will be in a single big graph. The default scheme uses quads. The default index layout is PSOG, POGS, GS, SP, OP. To see the current index scheme, use an isql session to execute STATISTICS DB.DBA.RDF_QUAD; For partitioned cluster settings, are there partitioning-related settings to control even distribution of data between partitions? For example, is there a way to set partitioning by S or O depending on which is first in key order for each index? The default partitioning settings are good, i.e., partitioning is on O or S, whichever is first in key order. For partitioned clusters, are there settings to control message batching or similar? What are the statistics available for checking interconnect operation, e.g. message counts, latencies, total aggregate throughput of interconnect? In the [Cluster] section of the cluster.ini file, ReqBatchSize is the number of query states dispatched between cluster nodes per message round trip. This may be incremented from the default of 10000 to 50000 or so if this is seen to be useful. To change this on the fly, the following can be issued through an isql session: cl_exec ( &#39; __dbf_set (&#39;&#39;cl_request_batch_size&#39;&#39;, 50000) &#39; ); The commands below may be executed through an isql session to get a summary of CPU and message traffic for the whole cluster or process-by-process, respectively. The documentation details the fields. STATUS (&#39;cluster&#39;) ;; whole cluster STATUS (&#39;cluster_d&#39;) ;; process-by-process Other settings - Are there settings for limiting query planning, when appropriate? For example, the BSBM Explore mix has a large component of unnecessary query optimizer time, since the queries themselves access almost no data. Any other relevant settings? For BSBM, needless query optimization should be capped at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini, with StopCompilerWhenXOverRun = 1 When testing full CRUD operations (not simply CREATE, i.e., load, as discussed in #5, above), it is essential to make queries run with transaction isolation of READ COMMITTED, to remove most lock contention. Transaction isolation cannot be adjusted via SPARQL. This can be changed through SQL session settings, or at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini file, with DefaultIsolation = 2 Benchmarks, Redux Series Benchmarks, Redux (part 1): On RDF Benchmarks Benchmarks, Redux (part 2): A Benchmarking Story Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire (this post) Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs Benchmarks, Redux (part 6): BSBM and I/O, continued Benchmarks, Redux (part 7): What Does BSBM Explore Measure? Benchmarks, Redux (part 8): BSBM Explore and Update Benchmarks, Redux (part 9): BSBM With Cluster Benchmarks, Redux (part 10): LOD2 and the Benchmark Process Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks Benchmarks, Redux (part 12): Our Own BSBM Results Report Benchmarks, Redux (part 13): BSBM BI Modifications Benchmarks, Redux (part 14): BSBM BI Mix Benchmarks, Redux (part 15): BSBM Test Driver Enhancements</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Below is a questionnaire I sent to the <a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0xa2f6798">BSBM</a> participants in order to get tuning instructions for the runs we were planning. I have filled in the answers for <a class="auto-href" href="http://virtuoso.openlinksw.com" id="link-id0x195c2070">Virtuoso</a>, here. This can be a checklist for pretty much any <a class="auto-href" href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1e1c9bb0">RDF</a> database tuning.</p>


<ol>
<li>
<p>
    <b>Threading - What settings should be used (e.g., for query parallelization, I/O parallelization [e.g., prefetch, flush of dirty], thread pools [e,.g. web server], any other thread related)? We will run with 8 and 32 cores, so if there are settings controlling number of read/write (R/W) locks or mutexes or such for serializing diverse things, these should be set accordingly to minimize contention.</b>
  </p>

<p>The following three settings are all <a href="http://docs.openlinksw.com/virtuoso/databaseadmsrv.html#ini_Parameters" id="link-id0x1ed4fe10">in the <code>[Parameters]</code> section of the <code>virtuoso.ini</code> file</a>. </p>

<ul>
<li>
      <p>
     <b><code>AsyncQueueMaxThreads</code>
     </b> controls the size of a pool of extra threads that can be used for query parallelization. This should be set to either <b>1.5 * the number of cores</b> or <b>1.5 * the number of core threads</b>; see which works better.</p>
    </li>

<li>
      <p>
     <b><code>ThreadsPerQuery</code>
     </b> is the maximum number of threads a single query will take. This should be set to either <b>the number of cores</b> or <b>the number of core threads</b>; see which works better. </p>
    </li>

<li>
      <p>
     <b><code>IndexTreeMaps</code>
     </b> is the number of mutexes over which control for buffering an index tree is split. This can generally be left at default (<b>256</b> in normal operation; valid settings are powers of 2 from 2 to 1024), but setting to <b>64, 128, or 512</b> may be beneficial.</p>

<p>A low number will lead to frequent contention; upwards of 64 will have little contention. We have sometimes seen a multiuser workload go 10% faster when setting this to 64 (down from 256), which seems counter-intuitive. This may be a <a class="auto-href" href="http://dbpedia.org/resource/Cache" id="link-id0x1262b4b0">cache</a> artifact.</p>
    </li>
</ul>

<p></p>
  <p>
    <a href="http://docs.openlinksw.com/virtuoso/databaseadmsrv.html#ini_HTTPServer" id="link-id0x1f8960a0">In the <code>[HTTPServer]</code> section of the <code>virtuoso.ini</code> file</a>, the <b><code>ServerThreads</code></b> setting is the number of web server threads, i.e., the maximum number of concurrent <a class="auto-href" href="http://www.w3.org/TR/rdf-sparql-protocol/" id="link-id0x17c1bef0">SPARQL protocol</a> requests. Having a value larger than the number of concurrent clients is OK; for large numbers of concurrent clients a lower value may be better, which will result in requests waiting for a thread to be available.</p>
<p>Note â The <code>[HTTPServer] ServerThreads</code> are taken from the total pool made available by the <code>[Parameters] ServerThreads</code>. Thus, the <code>[Parameters] ServerThreads</code> should always be at least as large as (and is best set greater than) the <code>[HTTPServer] ServerThreads</code>, and if using the closed-source Commercial Version, <code>[Parameters] ServerThreads</code> cannot exceed the licensed thread count. </p>
</li>


<li>
<p>
    <b>File layout - Are there settings for striping over multiple devices? Settings for other file access parallelism? Settings for SSDs (e.g., SSD based cache of hot set of larger db files on disk)? The target config is for 4 independent disks and 4 independent SSDs. If you depend on RAID, are there settings for this? If you need RAID to be set up, please provide the settings/script for doing this with 4 SSDs on Linux (RH and Debian). This will be software RAID, as we find the hardware RAID to be much worse than an independent disk setup on the system in question.</b>
  </p>

<p>It is best to stripe database files over all available disks, and to not use RAID. If RAID is desired, then stripe database files across many RAID sets. Use the <code>segment</code> declaration in the <code>virtuoso.ini</code> file. It is very important to give each independently seekable device its own I/O queue thread. See the documentation on the <a class="auto-href" href="http://www.tpc.org/" id="link-id0x1e0deb38">TPC</a>-<a class="auto-href" href="http://dbpedia.org/resource/C%2B%2B" id="link-id0x1ddc2bf0">C</a> sample for examples. </p>

<p> <a href="http://docs.openlinksw.com/virtuoso/databaseadmsrv.html#ini_Parameters" id="link-id0x1f893f48">in the <code>[Parameters]</code> section of the <code>virtuoso.ini</code> file</a>, set <code>FDsPerFile</code> to be <code> (the number of concurrent threads * 1.5) Ã· the number of distinct database files</code>.</p>

<p>There are no SSD specific settings.</p>
</li>


<li>
<p>
    <b>Loading - How many parallel streams work best? We are looking for non-transactional bulk load, with no inference materialization. For partitioned cluster settings, do we divide the load streams over server processes? </b>
  </p>

<p>Use one stream per core (not per core thread). In the case of a cluster, divide load streams evenly across all processes. The total number of streams on a cluster can equal the total number of cores; adjust up or down depending on what is observed.</p>

<p>Use the built-in bulk load facility, i.e., </p>
<blockquote>
    <code>ld_dir (&#39;&lt;source-filename-or-directory&gt;&#39;, &#39;&lt;file name pattern&gt;&#39;, &#39;&lt;destination graph iri&gt;&#39;);</code>
  </blockquote>
<p>For example,</p>
<blockquote>
    <code><a class="auto-href" href="http://dbpedia.org/resource/SQL" id="link-id0x17c854c0">SQL</a>&gt; ld_dir (&#39;/path/to/files&#39;, &#39;*.n3&#39;, &#39;<a class="auto-href" href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x1f10c3d8">http</a>://<a class="auto-href" href="http://dbpedia.org/resource/DBpedia" id="link-id0x1c6378a0">dbpedia</a>.org&#39;);</code>
  </blockquote>
<p>Then do a <code>rdf_loader_run ()</code> on enough connections. For example, you can use the shell command </p>
<blockquote>
    <code>isql rdf_loader_run () &amp;</code> </blockquote>
<p>to start one in a background isql process. When starting background load commands from the shell, you can use the shell <code>wait</code> command to wait for completion. If starting from isql, use the <code>wait_for_children;</code> command (see <a href="http://docs.openlinksw.com/virtuoso/isql.html" id="link-id0x1ae0f230">isql documentation</a> for details). </p>
<p>See the <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1d635820">BSBM disclosure report</a> for an example load script.</p>
</li>


<li>
<p>
    <b>What command should be used after non-transactional bulk load, to ensure a consistent persistent state on disk, like a log checkpoint or similar? Load and checkpoint will be timed separately, load being <a class="auto-href" href="http://dbpedia.org/resource/Central_processing_unit" id="link-id0x1c522378">CPU</a>-bound and checkpoint being I/O-bound. No roll-forward log or similar is required; the load does not have to recover if it fails before the checkpoint.</b>
  </p>

<p>Execute </p>
<blockquote>
    <code> CHECKPOINT;</code>
  </blockquote> 
<p>through a SQL client, e.g., <code>isql</code>. This is not a <a class="auto-href" href="http://dbpedia.org/resource/SPARQL" id="link-id0x1c1e95b0">SPARQL</a> statement and cannot be executed over the SPARQL protocol.</p>
</li>


<li>
<p>
    <b>What settings should be used for trickle load of small triple sets into a pre-existing graph? This should be as transactional as supported; at least there should be a roll forward log, unlike the case for the bulk load.</b>
  </p>

<p>No special settings are needed for load testing; defaults will produce transactional behavior with a roll forward log. Default transaction isolation is <b><code>REPEATABLE READ</code></b>, but this may be altered via SQL session settings or at Virtuoso server start-up through <a href="http://docs.openlinksw.com/virtuoso/databaseadmsrv.html#ini_Parameters" id="link-id0x1a791b80">the <code>[Parameters]</code> section of the <code>virtuoso.ini</code> file</a>, with</p>
<blockquote>
   <b><code><a href="http://wikis.openlinksw.com/dataspace/owiki/wiki/VirtuosoWikiWeb/ChangeVirtuosoSDefaultTransactionIsolationLevel" id="link-id0x1e5536b8">DefaultIsolation</a> = 4</code>
   </b>
  </blockquote>
<p> Transaction isolation cannot be set over the SPARQL protocol.</p>
<p> NOTE: When testing full CRUD operations, other isolation settings may be preferable, due to <a class="auto-href" href="http://dbpedia.org/resource/ACID" id="link-id0x1c592f70">ACID</a> considerations.  See answer #12, below, and detailed discussion in part 8 of this series, <a href="http://www.openlinksw.com/weblog/oerling/?id=1673" id="link-id0x1b7eb5f0">BSBM <i>Explore and Update</i></a>.</p>
</li>


<li>
<p>
    <b>What settings control allocation of memory for database caching? We will be running mostly from memory, so we need to make sure that there is enough memory configured. </b>
  </p>

<p>
    <a href="http://docs.openlinksw.com/virtuoso/databaseadmsrv.html#ini_Parameters" id="link-id0x1acd8fe8">In the <code>[Parameters]</code> section of the <code>virtuoso.ini</code> file</a>, <b><code>NumberOfBuffers</code></b> controls the amount of RAM used by Virtuoso to cache database files. One buffer caches an 8KB database page. In practice, count 10KB of memory per page. If &quot;swappiness&quot; on Linux is low (e.g., 2), two-thirds or more of physical memory can be used for database buffers. If swapping occurs, decrease the setting.</p>
</li>


<li>
<p>
    <b>What command gives status on memory allocation (e.g., number of buffers, number of dirty buffers, etc.) so that we can verify that things are indeed in server memory and not, for example, being served from OS disk cache. If the cached format is different from the disk layout (e.g., decompression after disk read), is there a command for space statistics for database cache? </b>
  </p>

<p>In an <code>isql</code> session, execute </p>
<blockquote>
    <code>STATUS ( ? ? );</code>
  </blockquote> 
<p>The second result paragraph gives counts of total, used, and dirty buffers. If used buffers is steady and less than total, and if the disk read count on the line below does not increase, the system is running from memory. The cached format is the same as the disk based format.</p>
</li>


<li>
<p>
    <b>What command gives <a class="auto-href" href="http://dbpedia.org/resource/Information" id="link-id0x11bf3008">information</a> on disk allocation for different things? We are looking for the total size of allocated database pages for quads (including table, indices, anything else associated with quads) and dictionaries for literals, IRI names, etc. If there is a text index on literals, what command gives space stats for this? We count used pages, excluding any preallocated unused pages or other gaps. There is one number for quads and another for the dictionaries or other such structures, optionally a third for text index.</b>
  </p>


<p>Execute on an <code>isql</code> session: </p>

<blockquote>
   <code><pre>
CHECKPOINT;
SELECT TOP 20 * FROM sys_index_space_stats ORDER BY iss_pages DESC;
</pre>
   </code>
  </blockquote>

<p>The <code>iss_pages</code> column is the total pages for each index, including blob pages. Pages are 8KB. Only used pages are reported, gaps and unused pages are not counted. The rows pertaining to <code>RDF_QUAD</code> are for quads; <code>RDF_IRI</code>, <code>RDF_PREFIX</code>, <code>RO_START</code>, <code>RDF_OBJ</code> are for dictionaries; <code>RDF_OBJ_RO_FLAGS_WORDS</code> and <code>VTLOG_DB_DBA_RDF_OBJ</code> are for text index. </p>


</li>
<li>
<p>
    <b>If there is a choice between triples and quads, we will run with quads. How do we ascertain that the run is with quads? How do we find out the index scheme? Should be use an alternate index scheme? Most of the <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0x17eb98f8">data</a> will be in a single big graph.</b>
  </p>

<p>The default scheme uses quads. The default index layout is <code>PSOG</code>, <code>POGS</code>, <code>GS</code>, <code>SP</code>, <code>OP</code>. To see the current index scheme, use an <code>isql</code> session to execute</p>
<blockquote>
    <code>STATISTICS DB.DBA.RDF_QUAD;</code>
  </blockquote>


</li>
<li>
<p>
    <b>For partitioned cluster settings, are there partitioning-related settings to control even distribution of data between partitions? For example, is there a way to set partitioning by <code>S</code> or <code>O</code> depending on which is first in key order for each index? </b>
  </p>

<p>The default partitioning settings are good, i.e., partitioning is on <code>O</code> or <code>S</code>, whichever is first in key order.</p>


</li>
<li>
<p>
    <b>For partitioned clusters, are there settings to control message batching or similar? What are the statistics available for checking interconnect operation, e.g. message counts, latencies, total aggregate throughput of interconnect?</b>
  </p>

<p> <a href="http://docs.openlinksw.com/virtuoso/clusteroperation.html#clusteroperationgeneralclusterinifields" id="link-id0x1ec6dff0">In the <code>[Cluster]</code> section of the <code>cluster.ini</code> file</a>, <b><code>ReqBatchSize</code></b> is the number of query states dispatched between cluster nodes per message round trip. This may be incremented from the default of <code>10000</code> to <code>50000</code> or so if this is seen to be useful. </p>

<p>To change this on the fly, the following can be issued through an <code>isql</code> session:</p>
<blockquote>
<code>cl_exec ( &#39; __dbf_set (&#39;&#39;cl_request_batch_size&#39;&#39;, 50000) &#39; ); </code>
  </blockquote>

<p>The commands below may be executed through an <code>isql</code> session to get a summary of CPU and message traffic for the whole cluster or process-by-process, respectively. The documentation <a href="http://docs.openlinksw.com/virtuoso/clusteroperation.html#clusteroperationadminstdispl" id="link-id0x1dfccec0">details the fields</a>. </p>
<blockquote>
   <pre> <code>STATUS (&#39;cluster&#39;)      ;; whole cluster</code> <br /> <code>STATUS (&#39;cluster_d&#39;)    ;; process-by-process</code>
   </pre></blockquote>

</li>
<li>
<p>
    <b>Other settings - Are there settings for limiting query planning, when appropriate? For example, the BSBM <i>Explore</i> mix has a large component of unnecessary query optimizer time, since the queries themselves access almost no data. Any other relevant settings?</b>
  </p>

<ul>
<li>
      <p>For BSBM, needless query <a class="auto-href" href="http://dbpedia.org/resource/Program_optimization" id="link-id0x11be47b8">optimization</a> should be capped at Virtuoso server start-up through the <code>[Parameters]</code> section of the <code>virtuoso.ini</code>, with</p>
<blockquote>
     <b><code>StopCompilerWhenXOverRun = 1</code>
     </b>
      </blockquote> </li>
<li>
      <p>When testing full CRUD operations (not simply CREATE, i.e., load, as discussed in #5, above), it is essential to make queries run with transaction isolation of <code>READ COMMITTED</code>, to remove most lock contention.  Transaction isolation cannot be adjusted via SPARQL.  This can be changed through SQL session settings, or at Virtuoso server start-up <a href="http://docs.openlinksw.com/virtuoso/databaseadmsrv.html#ini_Parameters" id="link-id0x1f3a43c8">through the <code>[Parameters]</code> section of the <code>virtuoso.ini</code> file</a>, with</p>
<blockquote>
     <b><code><a href="http://wikis.openlinksw.com/dataspace/owiki/wiki/VirtuosoWikiWeb/ChangeVirtuosoSDefaultTransactionIsolationLevel" id="link-id0x1a5a51e0">DefaultIsolation</a> = 2</code>
     </b>
      </blockquote>
</li>
</ul>
</li>
</ol>



<h3>
<i>Benchmarks, Redux</i> Series</h3>
<ul>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1658" id="link-id0x1d6e5428">Benchmarks, Redux (part 1): On RDF Benchmarks</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x1c3ea770">Benchmarks, Redux (part 2): A Benchmarking Story</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1663" id="link-id0x1efeca30">Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore</a>
</li>
<li>
Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire <i>(this post)</i>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1667" id="link-id0x1bda5158">Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1669" id="link-id0x1ec74808">Benchmarks, Redux (part 6): BSBM and I/O, continued</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1671" id="link-id0x1ea253a0">Benchmarks, Redux (part 7): What Does BSBM Explore Measure?</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1673" id="link-id0x1b02d528">Benchmarks, Redux (part 8): BSBM Explore and Update </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1675" id="link-id0x1ae81fc0">Benchmarks, Redux (part 9): BSBM With Cluster</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1677" id="link-id0x197515c0">Benchmarks, Redux (part 10): LOD2 and the Benchmark Process</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1678" id="link-id0x1a78db90">Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1d32ae10">Benchmarks, Redux (part 12): Our Own BSBM Results Report</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1e8fcc18">Benchmarks, Redux (part 13): BSBM BI Modifications </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1ae95050">Benchmarks, Redux (part 14): BSBM BI Mix </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1dbf3158">Benchmarks, Redux (part 15): BSBM Test Driver Enhancements </a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2011-03-02#1663">
  <rss:title>Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-03-02T23:23:16Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">In this post I will summarize the figures for BSBM Load and Explore mixes at 100 Mt, 200 Mt, and 1000 Mt. (1 Mt = 1 Megatriple, or one million triples.) The measurements were made on a 72GB 2xXeon 5520 with 4 SSDs. The exact specifications and configurations are in the raw reports to follow. The load time in the recent Berlin report was measured with the wrong function, and so far as we can tell, without multiple threads. The intermediate cut of Virtuoso they tested also had broken SPARQL/Update (also known as SPARUL) features. We have fixed this since, and give here the right numbers. In the course of the discussion to follow, we talk about 3 different kinds of Virtuoso: 6 Single is the generally available single server configuration of Virtuoso. Whether this is open source or not does not make a difference. 6 Cluster is the generally available commercial only cluster-capable Virtuoso. 7 Single is the next generation single server Virtuoso, about to be released as a preview. To understand the numbers, we must explain how these differ from each other in execution: 6 Single has one thread-per-query, and operates on one state of the query at a time. 6 Cluster has one thread-per-query-per-process, and between processes it operates on batches of some tens-of-thousands of simultaneous query states. Within each node, these batches run through the execution pipeline one state at a time. Aggregation is distributed, and the query optimizer is generally smart about shipping colocated functions together. 7 Single has multiple threads-per-query and in all situations operates on batches of 10,000 or more simultaneous query states. This means, for example, that index lookups get large numbers of parameters which then are sorted to get an ascending search pattern which benefits from locality, so the n * log(n) index access for the batch becomes more like linear if the data accessed has any locality. Furthermore, if there are many operands to an operator, these can be split on multiple threads. Also, scans of consecutive rows can be split before the scan on multiple threads, each doing a range of the scan. These features are called vectored execution and query parallelization. These techniques will also be applied to the cluster variant in due time. The version 6 and 7 variants discussed here use the same physical storage layout with row-wise key compression. Additionally, there exists a column-wise storage option in 7 that can fit 4x the number of quads in the same space. This column store option is not used here because it still has some problems with random order inserts. We will first consider loading. Below are the load times and rates for 7 at each scale. 7 Single Scale Rate (quads per second) Load time (seconds) Checkpoint time (seconds) 100 Mt 261,366 301 82 200 Mt 216,000 802 123 1000 Mt 130,378 6641 1012 In each case the load was made on 8 concurrent streams, each reading a file from a pool of 80 files for the two smaller scales and 360 files for the larger scale. We also loaded the smallest data set with 6 Single using the same load script. 6 Single Scale Rate (quads per second) Load time (seconds) Checkpoint time (seconds) 100 Mt 74,713 1192 145 CPU time with 6 Single was 8047 seconds. We compare this to 4453 seconds of CPU for the same load on 7 Single. The CPU% during the run was on either side of 700% for 6 Single and 1300% for 7 Single. Note that high percentages involve core threads, not real cores. The difference is mostly attributable to vectoring and the introduction of a non-transactional insert. The 6 Single inserts transactionally but makes very frequent commits and writes no log, resulting in de facto non-transactional behavior but still there is a lock and commit cycle. Inserts in RDF load usually exhibit locality on all SPOG. Sorting by value gives ascending insert order and eliminates much of the lookup time for deciding where the next row will go. Contention on page read-write locks is less because the engine stays longer on a page, inserting multiple values in one go, instead of re-acquiring the read-write lock and possible transaction locks for each row. Furthermore, for single stream loading the non-transactional mode can serve one thread doing the parsing with many threads doing the inserting; hence, in practice the speed is bounded by the parsing speed. In multi-stream load this parallelization also happens but is less significant, as adding threads past the count of core threads is not useful. Writes are all in-place, and no delta-merge mechanism is involved. For transactional inserts, the uncommitted rows are not visible to read-committed readers, which do not block. Repeatable and serializable readers would block before an uncommitted insert. Now for the run (larger numbers indicate more queries executed, and are therefore better): 6 Single Throughput (QMpH, query mixes per hour) Scale Single User 16 User 100 Mt 7641 29433 200 Mt 6017 13335 1000 Mt 1770 2487 7 Single Throughput (QMpH, query mixes per hour) Scale Single User 16 User 100 Mt 11742 72278 200 Mt 10225 60951 1000 Mt 6262 24672 The 100 Mt and 200 Mt runs are entirely in memory; the 1000 Mt run is mostly in memory, with about a 1.6 MB/s trickle from SSD in steady state. Accordingly, the 1000 Mt run is longer, with 2000 query mixes in the timed period, preceded by a warm-up of 2000 mixes with a different seed. For the memory-only scales, we run 500 mixes twice, and take the timing of the second run. Looking at single user speeds, 6 Single and 7 Single are closest at the small end and drift farther apart at the larger scales. This comes from the increased opportunity to parallelize Q5, since this works on more data and is relatively more important as the scale gets larger. The 100 Mt run of 7 Single has about 130% CPU, and the 1000 Mt run has about 270%. This also explains why adding clients gives a larger boost at the smaller scale. Now let us look at the relative effects of parallelizing and vectoring in 7 Single. We run 50 mixes of Single User Explore: 6132 QMpH with both parallelizing and vectoring on; 2805 QMpH with execution limited to a single thread. Then we set the vector size to 1, meaning that the query pipeline runs one row at a time. This gets us 1319 QMpH which is a bit worse than 6 Single. This is to be expected since there is some overhead to running vectored with single-element vectors. Q5 on 7 Single with vectoring and a single thread runs at 1.9 qps; with single-element vectors, at 0.8 qps. The 6 Single engine runs Q5 at 1.13 qps. The 100 Mt scale 7 Single gains the most from adding clients; the 1000 Mt 6 Single gains the least. The reason for the latter is covered in detail in A Benchmarking Story. We note that while vectoring is primarily geared to better single-thread speed and better cache hit rates, it delivers a huge multithreaded benefit by eliminating the mutex contention at the index tree top which stops 6 Single dead at 1000 Mt. In conclusion, we see that even with a workload of short queries and little opportunity for parallelism, we get substantial benefits from query parallelization and vectoring. When moving to more complex workloads, the benefits become more pronounced. For a single user complex query load, we can get 7x speed-up from parallelism (8 core), plus up to 3x from vectoring. These numbers do not take into account the benefits of the column store; those will be analyzed separately a bit later. The full run details will be supplied at the end of this blog series. Benchmarks, Redux Series Benchmarks, Redux (part 1): On RDF Benchmarks Benchmarks, Redux (part 2): A Benchmarking Story Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore (this post) Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs Benchmarks, Redux (part 6): BSBM and I/O, continued Benchmarks, Redux (part 7): What Does BSBM Explore Measure? Benchmarks, Redux (part 8): BSBM Explore and Update Benchmarks, Redux (part 9): BSBM With Cluster Benchmarks, Redux (part 10): LOD2 and the Benchmark Process Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks Benchmarks, Redux (part 12): Our Own BSBM Results Report Benchmarks, Redux (part 13): BSBM BI Modifications Benchmarks, Redux (part 14): BSBM BI Mix Benchmarks, Redux (part 15): BSBM Test Driver Enhancements</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>In this post I will summarize the figures for <a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x1edb1dd0">BSBM</a> Load and <i>Explore</i> mixes at 100 Mt, 200 Mt, and 1000 Mt.  (1 Mt = 1 Megatriple, or one million triples.)  The measurements were made on a 72GB 2xXeon 5520 with 4 SSDs.  The exact specifications and configurations are in the raw reports to follow.</p>

<p>The load time in <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/V6/index.html" id="link-id0x1f3716d8">the recent Berlin report</a> was measured with <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/V6/index.html#resultsExplore" id="link-id0x1dd37f80">the wrong function</a>, and so far as we can tell, without multiple threads. The intermediate cut of <a class="auto-href" href="http://virtuoso.openlinksw.com" id="link-id0x1c1c7798">Virtuoso</a> they tested also <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/V6/index.html#resultsExploreAndUpdate" id="link-id0x1e5fcf40"> had broken</a> <a class="auto-href" href="http://dbpedia.org/resource/SPARQL" id="link-id0x1bfa40b8">SPARQL</a>/<a class="auto-href" href="http://dbpedia.org/page/SPARUL" id="link-id0x1c1e1320">Update</a> (also known as <a class="auto-href" href="http://dbpedia.org/page/SPARUL" id="link-id0x1ddc87d8">SPARUL</a>) features.  We have fixed this since, and give <a href="http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/BenchmarksReduxSupportingFiles/results.zip" id="link-id0x1edf36b0">here the right numbers</a>.</p>

<p>In the course of the discussion to follow, we talk about 3 different kinds of Virtuoso:</p>

<ul>
 <li>
  <p>
    <i>6 Single</i> is the generally available single server configuration of Virtuoso.  Whether this is open source or not does not make a difference.</p>
 </li>
<li>
  <p>
    <i>6 Cluster</i> is the generally available commercial only cluster-capable Virtuoso.</p>
</li>
<li>
  <p>
    <i>7 Single</i> is the next generation single server Virtuoso, about to be released as a preview.</p>
</li>
</ul>

<p>To understand the numbers, we must explain how these differ from each other in execution:</p>

<ul>
 <li>
  <p>
    <i>6 Single</i> has one thread-per-query, and operates on one state of the query at a time.</p>
 </li>

<li>
  <p>
    <i>6 Cluster</i> has one thread-per-query-per-process, and between processes it operates on batches of some tens-of-thousands of simultaneous query states.  Within each node, these batches run through the execution pipeline one state at a time. Aggregation is distributed, and the query optimizer is generally smart about shipping colocated functions together.</p>
</li>

<li>
  <p>
    <i>7 Single</i> has multiple threads-per-query and in all situations operates on batches of 10,000 or more simultaneous query states.  This means, for example, that index lookups get large numbers of parameters which then are sorted to get an ascending search pattern which benefits from locality, so the <code>n * log(n)</code> index access for the batch becomes more like linear if the <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0x1ea197c8">data</a> accessed has any locality. Furthermore, if there are many operands to an operator, these can be split on multiple threads.  Also, scans of consecutive rows can be split before the scan on multiple threads, each doing a range of the scan.  These features are called <i>vectored execution</i> and <i>query parallelization</i>.  These techniques will also be applied to the cluster variant in due time.</p>
</li>
</ul>

<p>The version 6 and 7 variants discussed here use the same physical storage layout with row-wise <a class="auto-href" href="http://dbpedia.org/resource/Data_compression" id="link-id0x1bd035c0">key compression</a>.  Additionally, there exists a column-wise storage option in 7 that can fit 4x the number of quads in the same space.  This column store option is not used here because it still has some problems with random order inserts.</p>

<p> We will first consider loading.  Below are the load times and rates for 7 at each scale.</p>

<table border="1" cellspacing="2" cellpadding="2" align="center" width="90%">
	<tr>
		<th colspan="4" align="center">7 Single</th>
	</tr>
	<tr>
		<th align="center">Scale</th>
		<th align="center">Rate <br /> (quads per second)</th>
		<th align="center">Load time <br /> (seconds)</th>
		<th align="center">Checkpoint time <br /> (seconds)</th>
	</tr>
	<tr>
		<th align="center">100 Mt</th>
		<td align="center"> 261,366 </td>
		<td align="center"> 301 </td>
		<td align="center"> 82 </td>
	</tr>
	<tr>
		<th align="center">200 Mt</th>
		<td align="center"> 216,000 </td>
		<td align="center"> 802 </td>
		<td align="center"> 123 </td>
	</tr>
	<tr>
		<th align="center">1000 Mt</th>
		<td align="center"> 130,378 </td>
		<td align="center"> 6641 </td>
		<td align="center"> 1012 </td>
	</tr>
</table>

<p>In each case the load was made on 8 concurrent streams, each reading a file from a pool of 80 files for the two smaller scales and 360 files for the larger scale.</p>

<p>We also loaded the smallest data set with 6 Single using the same load script.

</p>
<table border="1" cellspacing="2" cellpadding="2" align="center" width="90%">
	<tr>
		<th colspan="4" align="center">6 Single</th>
	</tr>
	<tr>
		<th align="center">Scale</th>
		<th align="center">Rate <br /> (quads per second)</th>
		<th align="center">Load time <br /> (seconds)</th>
		<th align="center">Checkpoint time <br /> (seconds)</th>
	</tr>
	<tr>
		<th align="center">100 Mt</th>
		<td align="center"> 74,713 </td>
		<td align="center"> 1192 </td>
		<td align="center"> 145 </td>
	</tr>
</table>


<p>
<a class="auto-href" href="http://dbpedia.org/resource/Central_processing_unit" id="link-id0x1c0b96c0">CPU</a> time with 6 Single was 8047 seconds.  We compare this to 4453 seconds of CPU for the same load on 7 Single.  The CPU% during the run was on either side of 700% for 6 Single and 1300% for 7 Single.  Note that high percentages involve core threads, not real cores. </p>

<p>The difference is mostly attributable to vectoring and the introduction of a non-transactional insert.  The 6 Single inserts transactionally but makes very frequent commits and writes no log, resulting in <i>de facto</i> non-transactional behavior but still there is a lock and commit cycle.  Inserts in <a class="auto-href" href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1ddef3e8">RDF</a> load usually exhibit locality on all SPOG.  Sorting by value gives ascending insert order and eliminates much of the lookup time for deciding where the next row will go.  Contention on page read-write locks is less because the engine stays longer on a page, inserting multiple values in one go, instead of re-acquiring the read-write lock and possible transaction locks for each row.</p>

<p>Furthermore, for single stream loading the non-transactional mode can serve one thread doing the parsing with many threads doing the inserting; hence, in practice the speed is bounded by the parsing speed.  In multi-stream load this parallelization also happens but is less significant, as adding threads past the count of core threads is not useful.  Writes are all in-place, and no delta-merge mechanism is involved.  For transactional inserts, the uncommitted rows are not visible to read-committed readers, which do not block.  Repeatable and serializable readers would block before an uncommitted insert.</p>



<p>Now for the run (larger numbers indicate more queries executed, and are therefore better):</p>

<table border="1" cellspacing="2" cellpadding="2" align="center" width="90%">
	<tr>
		<th colspan="3" align="center"> 6 Single Throughput <br /> (QMpH, query mixes per hour) </th>
	</tr>
	<tr>
		<th align="center">Scale</th>
		<th align="center"> Single User </th>
		<th align="center"> 16 User </th>
	</tr>
	<tr>
		<th align="center">100 Mt</th>
		<td align="center"> 7641 </td>
		<td align="center"> 29433 </td>
	</tr>
	<tr>
		<th align="center">200 Mt</th>
		<td align="center"> 6017 </td>
		<td align="center"> 13335 </td>
	</tr>
	<tr>
		<th align="center">1000 Mt</th>
		<td align="center"> 1770 </td>
		<td align="center"> 2487 </td>
	</tr>
</table>
<br />
<table border="1" cellspacing="2" cellpadding="2" align="center" width="90%">
	<tr>
		<th colspan="3" align="center"> 7 Single Throughput <br /> (QMpH, query mixes per hour) </th>
	</tr>
	<tr>
		<th align="center">Scale</th>
		<th align="center"> Single User </th>
		<th align="center"> 16 User </th>
	</tr>
	<tr>
		<th align="center">100 Mt</th>
		<td align="center"> 11742 </td>
		<td align="center"> 72278 </td>
	</tr>
	<tr>
		<th align="center">200 Mt</th>
		<td align="center"> 10225 </td>
		<td align="center"> 60951 </td>
	</tr>
	<tr>
		<th align="center">1000 Mt</th>
		<td align="center"> 6262 </td>
		<td align="center"> 24672 </td>
	</tr>
</table>

<p>The 100 Mt and 200 Mt runs are entirely in memory; the 1000 Mt run is mostly in memory, with about a 1.6 MB/s trickle from SSD in steady state.  Accordingly, the 1000 Mt run is longer, with 2000 query mixes in the timed period, preceded by a warm-up of 2000 mixes with a different seed.  For the memory-only scales, we run 500 mixes twice, and take the timing of the second run.</p>

<p>Looking at single user speeds, 6 Single and 7 Single are closest at the small end and drift farther apart at the larger scales. This comes from the increased opportunity to parallelize Q5, since this works on more data and is relatively more important as the scale gets larger. The 100 Mt run of 7 Single has about 130% CPU, and the 1000 Mt run has about 270%.  This also explains why adding clients gives a larger boost at the smaller scale. </p>

<p>Now let us look at the relative effects of parallelizing and vectoring in 7 Single.  We run 50 mixes of Single User <i>Explore</i>: 6132 QMpH with both parallelizing and vectoring on; 2805 QMpH with execution limited to a single thread.  Then we set the vector size to 1, meaning that the query pipeline runs one row at a time.  This gets us 1319 QMpH which is a bit worse than 6 Single.  This is to be expected since there is some overhead to running vectored with single-element vectors. Q5 on 7 Single with vectoring and a single thread runs at 1.9 qps; with single-element vectors, at 0.8 qps. The 6 Single engine runs Q5 at 1.13 qps.</p>

<p>The 100 Mt scale 7 Single gains the most from adding clients; the 1000 Mt 6 Single gains the least.  The reason for the latter is covered in detail in <a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x1b9ed390">A Benchmarking Story</a>.  We note that while vectoring is primarily geared to better single-thread speed and better <a class="auto-href" href="http://dbpedia.org/resource/Cache" id="link-id0x1ddc2f78">cache</a> hit rates, it delivers a huge multithreaded benefit by eliminating the mutex contention at the index tree top which stops 6 Single dead at 1000 Mt.</p>

<p>In conclusion, we see that even with a workload of short queries and little opportunity for parallelism, we get substantial benefits from query parallelization and vectoring.  When moving to more complex workloads, the benefits become more pronounced.  For a single user complex query load, we can get 7x speed-up from parallelism (8 core), plus up to 3x from vectoring.  These numbers do not take into account the benefits of the column store; those will be analyzed separately a bit later.</p>

<p>The full run details will be supplied at the end of this <a class="auto-href" href="http://dbpedia.org/resource/Blog" id="link-id0x1e9f69f0">blog</a> series.</p>

<h3>
<i>Benchmarks, Redux</i> Series</h3>
<ul>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1658" id="link-id0x1d0bb988">Benchmarks, Redux (part 1): On RDF Benchmarks </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x155fc700">Benchmarks, Redux (part 2): A Benchmarking Story</a>
</li>
<li>Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore <i>(this post)</i>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1665" id="link-id0x1d96e218">Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1667" id="link-id0x1d7a5170">Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1669" id="link-id0x1def9ca0">Benchmarks, Redux (part 6): BSBM and I/O, continued</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1671" id="link-id0x1a7a7800">Benchmarks, Redux (part 7): What Does BSBM Explore Measure?</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1673" id="link-id0x1e9c6c68">Benchmarks, Redux (part 8): BSBM Explore and Update </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1675" id="link-id0x1e80c208">Benchmarks, Redux (part 9): BSBM With Cluster</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1677" id="link-id0x1dafd290">Benchmarks, Redux (part 10): LOD2 and the Benchmark Process</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1678" id="link-id0x1f34f7f8">Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1df24f50">Benchmarks, Redux (part 12): Our Own BSBM Results Report</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1f4b19c8">Benchmarks, Redux (part 13): BSBM BI Modifications </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1de90cf8">Benchmarks, Redux (part 14): BSBM BI Mix </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1ebefbe8">Benchmarks, Redux (part 15): BSBM Test Driver Enhancements </a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2011-02-28#1660">
  <rss:title>Benchmarks, Redux (part 2): A Benchmarking Story</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-02-28T21:12:28Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Caeterum censeo, benchmarks are for vendors... This is an edifying story about benchmarks and how databases work. I will show how one detail makes a 5+x difference, and how one really must understand how things work in order to make sense of benchmarks. We begin right after the publication of the recent Berlin report. This report gives us OK performance for queries and very bad performance for loading. Trickle updates were not measurable. This comes as a consequence of testing intermediate software cuts and having incomplete instructions for operating them. I will cover the whole BSBM matter and the general benchmarking question in forthcoming posts; for now, let&#39;s talk about specifics. In the course of the discussion to follow, we talk about 3 different kinds of Virtuoso: 6 Single is the generally available single-instance-server configuration of Virtuoso. Whether this is open source or not does not make a difference. 6 Cluster is the generally available, commercial-only, cluster-capable Virtuoso. 7 Single is the next-generation single-instance-server Virtuoso, about to be released as a preview. We began by running the various parts of BSBM at different scales with different Virtuoso variants. In so doing, we noticed that the BSBM Explore mix at one scale got better throughput as we added more clients, approximately as one would expect based on CPU usage and number of cores, while at another scale this was not so. At the 1-billion-triple scale (1000 Mt; 1 Mt = 1 Megatriple, or one million triples) we saw CPU going from 200% with 1 client to 1400% with 16 clients but throughput increased by less than 20%. When we ran the same scale with our shared-nothing 6 Cluster, running 8 processes on the same box, throughput increased normally with the client count. We have not previously tried BSBM with 6 Cluster simply because there is little to gain and a lot to lose by distributing this workload. But here we got a multiuser throughput with 6 Cluster that is easily 3 times that of the single server, even with a cluster-unfriendly workload. See, sometimes scaling out even within a shared memory multiprocessor pays! Still, what we saw was rather anomalous. Over the years we have looked at performance any number of times and have a lot of built-in meters. For cases of high CPU with no throughput, the prime suspect is contention on critical sections. Quite right, when building with the mutex meter enabled, counting how many times each mutex is acquired and how many times this results in a wait, we found a mutex which gets acquired 600M times in the run, of which an insane 450M result in a wait. One can count a microsecond of real time each time a mutex wait results in the kernel switching tasks. The run took 500 s or so, of which 450 s of real time were attributable to the overhead of waiting for this one mutex. Waiting for a mutex is a real train wreck. We have tried spinning a few times before it, which the OS does anyhow, but this does not help. Using spin locks is good only if waits are extremely rare; with any frequency of waiting, even for very short waits, a mutex is still a lot better. Now, the mutex in question happens to serialize the buffer cache for one specific page of data, one level down from the root of the index for RDF PSOG. By the luck of the draw, the Ps falling on that page are commonly accessed Ps pertaining to product features. In order to get any product feature value, one must pass via this page. At the smaller scale, the different properties web their different ways based on the index root. One might here ask why the problem is one level down from the root and not in the root. The index root is already handled specially, so the read-write locks for buffers usually apply only for the first level down. One might also ask why have a mutex in the first place. Well, unless one is read-only and all in memory, there simply must be a way to say that a buffer must not get written to by one thread while another is reading it. Same for cache replacement. Some in-memory people fork a whole copy of the database process to do a large query and so can forget about serialization. But one must have long queries for this and have all in memory. One can make writes less frequent by keeping deltas, but this does not remove the need to merge the deltas at some point, which cannot happen without serializing this with the readers. Most of the time the offending mutex is acquired for getting a property of a product in Q5, the one that looks for products with similar values of a numeric property. We retrieve this property for a number of products in one go, due to vectoring. Vectoring is supposed to save us from constantly hitting the index tree top when getting the next match. So how come there is contention in the index tree top? As it happens, the vectored index lookup checks for locality only when all search conditions on key parts are equalities. Here however there is equality on P and S and a range on O; hence, the lookup starts from the index root every time. So I changed this. The effect was Q5 getting over twice as fast, with the single user throughput at 1000 Mt going from 2000 to 5200 QMpH (Query Mixes per Hour) and the 16-user throughput going from 3800 to over 21000 QMpH. The previously &quot;good&quot; throughput of 40K QMpH at 100 Mt went to 66K QMpH. Vectoring can make a real difference. The throughputs for the same workload on 6 Single, without vectoring, thus unavoidably hitting the page with the crazy contention, are 1770 QMpH single user and 2487 QMpH with 16 users. The 6 Cluster throughput, avoiding the contention but without the increased locality from vectoring and with the increased latency of going out-of-process for most of the data, was about 11.5K QMpH with 16 users. Each partition had a page getting the hits but since the partitioning was on S and S was about-evenly distributed, each partition got 1/8 of the load; thus waiting on the mutex did not become a killer issue. We see how detailed analysis of benchmarks can lead to almost an order of magnitude improvements in a short time. This analysis is however both difficult and tedious. It is not readily delegable; one needs real knowledge of how things work and of how they ought to work in order to get anywhere with this. Experience tends to show that a competitive situation is needed in order to motivate one to go to the trouble. Unless something really sticks out in an obvious manner, one is most likely not going to look deep enough. Of course, this is seen in applications too but application optimization tends to stop at a point where the application is usable. Also stored procedures and specially-tweaked queries will usually help. In most application scenarios, we are not simultaneously looking at multiple different implementations, except maybe at the start of development but then this falls under benchmarking and evaluation. So, the usefulness of benchmarks is again confirmed. There is likely great unexplored space for improvement as we move to more interesting and diverse scenarios. Benchmarks, Redux Series Benchmarks, Redux (part 1): On RDF Benchmarks Benchmarks, Redux (part 2): A Benchmarking Story (this post) Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs Benchmarks, Redux (part 6): BSBM and I/O, continued Benchmarks, Redux (part 7): What Does BSBM Explore Measure? Benchmarks, Redux (part 8): BSBM Explore and Update Benchmarks, Redux (part 9): BSBM With Cluster Benchmarks, Redux (part 10): LOD2 and the Benchmark Process Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks Benchmarks, Redux (part 12): Our Own BSBM Results Report Benchmarks, Redux (part 13): BSBM BI Modifications Benchmarks, Redux (part 14): BSBM BI Mix Benchmarks, Redux (part 15): BSBM Test Driver Enhancements</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<blockquote>
<i>Caeterum censeo, benchmarks are for vendors...</i>
</blockquote>

<p>This is an edifying story about benchmarks and how databases work. I will show how one detail makes a 5+x difference, and how one really must understand how things work in order to make sense of benchmarks.</p>

<p>We begin right after the publication of the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/V6/index.html" id="link-id0x1df843f8">recent Berlin report</a>. This report gives us OK performance for queries and very bad performance for loading. Trickle updates were not measurable. This comes as a consequence of testing intermediate software cuts and having incomplete instructions for operating them. I will cover the whole <a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x1da5e160">BSBM</a> matter and the general benchmarking question in forthcoming posts; for now, let&#39;s talk about specifics.</p>

<p>In the course of the discussion to follow, we talk about 3 different kinds of <a class="auto-href" href="http://virtuoso.openlinksw.com" id="link-id0x1c1e12f0">Virtuoso</a>:</p>

<ul>
 <li>
  <p>
    <i>6 Single</i> is the generally available single-instance-server configuration of Virtuoso.  Whether this is open source or not does not make a difference.</p>
 </li>
<li>
  <p>
    <i>6 Cluster</i> is the generally available, commercial-only, cluster-capable Virtuoso.</p>
</li>
<li>
  <p>
    <i>7 Single</i> is the next-generation single-instance-server Virtuoso, about to be released as a preview.</p>
</li>
</ul>


<p>We began by running the various parts of BSBM at different scales with different Virtuoso variants. In so doing, we noticed that the BSBM <i>Explore</i> mix at one scale got better throughput as we added more clients, approximately as one would expect based on <a class="auto-href" href="http://dbpedia.org/resource/Central_processing_unit" id="link-id0x1ddb5200">CPU</a> usage and number of cores, while at another scale this was not so.</p>

<p>At the 1-billion-triple scale (1000 Mt; 1 Mt = 1 Megatriple, or one million triples) we saw CPU going from 200% with 1 client to 1400% with 16 clients but throughput increased by less than 20%. </p>

<p>When we ran the same scale with our shared-nothing 6 Cluster, running 8 processes on the same box, throughput increased normally with the client count. We have not previously tried BSBM with 6 Cluster simply because there is little to gain and a lot to lose by distributing this workload. But here we got a multiuser throughput with 6 Cluster that is easily 3 times that of the single server, even with a cluster-unfriendly workload. </p>

<p> See, sometimes scaling out even within a shared memory multiprocessor pays! Still, what we saw was rather anomalous.</p>

<p>Over the years we have looked at performance any number of times and have a lot of built-in meters. For cases of high CPU with no throughput, the prime suspect is contention on critical sections. Quite right, when building with the mutex meter enabled, counting how many times each mutex is acquired and how many times this results in a wait, we found a mutex which gets acquired 600M times in the run, of which an insane 450M result in a wait. One can count a microsecond of real time each time a mutex wait results in the kernel switching tasks. The run took 500 s or so, of which 450 s of real time were attributable to the overhead of waiting for this one mutex.</p>

<p>Waiting for a mutex is a real train wreck. We have tried spinning a few times before it, which the OS does anyhow, but this does not help. Using spin locks is good only if waits are extremely rare; with any frequency of waiting, even for very short waits, a mutex is still a lot better.</p>

<p>Now, the mutex in question happens to serialize the buffer <a class="auto-href" href="http://dbpedia.org/resource/Cache" id="link-id0x1c1c29d8">cache</a> for one specific page of <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0x1c1c2a30">data</a>, one level down from the root of the index for <a class="auto-href" href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1e543ce0">RDF</a> PSOG. By the luck of the draw, the Ps falling on that page are commonly accessed Ps pertaining to product features. In order to get any product feature value, one must pass via this page. At the smaller scale, the different properties web their different ways based on the index root.</p>

<p>One might here ask why the problem is one level down from the root and not in the root. The index root is already handled specially, so the read-write locks for buffers usually apply only for the first level down. One might also ask why have a mutex in the first place. Well, unless one is read-only and all in memory, there simply must be a way to say that a buffer must not get written to by one thread while another is reading it. Same for cache replacement. Some in-memory people fork a whole copy of the database process to do a large query and so can forget about serialization. But one must have long queries for this and have all in memory. One can make writes less frequent by keeping deltas, but this does not remove the need to merge the deltas at some point, which cannot happen without serializing this with the readers.</p>

<p>Most of the time the offending mutex is acquired for getting a property of a product in Q5, the one that looks for products with similar values of a numeric property. We retrieve this property for a number of products in one go, due to vectoring. Vectoring is supposed to save us from constantly hitting the index tree top when getting the next match. So how come there is contention in the index tree top? As it happens, the vectored index lookup checks for locality only when all search conditions on key parts are equalities. Here however there is equality on P and S and a range on O; hence, the lookup starts from the index root every time.</p>

<p>So I changed this. The effect was Q5 getting over twice as fast, with the single user throughput at 1000 Mt going from 2000 to 5200 QMpH (Query Mixes per Hour) and the 16-user throughput going from 3800 to over 21000 QMpH. The previously &quot;good&quot; throughput of 40K QMpH at 100 Mt went to 66K QMpH. </p>

<p>Vectoring can make a real difference. The throughputs for the same workload on 6 Single, without vectoring, thus unavoidably hitting the page with the crazy contention, are 1770 QMpH single user and 2487 QMpH with 16 users. The 6 Cluster throughput, avoiding the contention but without the increased locality from vectoring and with the increased latency of going out-of-process for most of the data, was about 11.5K QMpH with 16 users. Each partition had a page getting the hits but since the partitioning was on S and S was about-evenly distributed, each partition got 1/8 of the load; thus waiting on the mutex did not become a killer issue. </p>

<p>We see how detailed analysis of benchmarks can lead to almost an order of magnitude improvements in a short time. This analysis is however both difficult and tedious. It is not readily delegable; one needs real <a class="auto-href" href="http://dbpedia.org/resource/Knowledge" id="link-id0x1c1e2578">knowledge</a> of how things work and of how they ought to work in order to get anywhere with this. Experience tends to show that a competitive situation is needed in order to motivate one to go to the trouble. Unless something really sticks out in an obvious manner, one is most likely not going to look deep enough. Of course, this is seen in applications too but application <a class="auto-href" href="http://dbpedia.org/resource/Program_optimization" id="link-id0x1c59f4d8">optimization</a> tends to stop at a point where the application is usable. Also stored procedures and specially-tweaked queries will usually help. In most application scenarios, we are not simultaneously looking at multiple different implementations, except maybe at the start of development but then this falls under benchmarking and evaluation.</p>

<p>So, the usefulness of benchmarks is again confirmed. There is likely great unexplored space for improvement as we move to more interesting and diverse scenarios.</p>

<h3>
<i>Benchmarks, Redux</i> Series</h3>
<ul>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1658" id="link-id0x1f619550">Benchmarks, Redux (part 1): On RDF Benchmarks</a>
</li>
<li>Benchmarks, Redux (part 2): A Benchmarking Story <i>(this post)</i>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1663" id="link-id0x1caa7cd8">Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1665" id="link-id0x1d8b7648">Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1667" id="link-id0x1f2a6ba8">Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1669" id="link-id0x17b425f0">Benchmarks, Redux (part 6): BSBM and I/O, continued</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1671" id="link-id0x1a7f6b30">Benchmarks, Redux (part 7): What Does BSBM Explore Measure?</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1673" id="link-id0x1ee5ec98">Benchmarks, Redux (part 8): BSBM Explore and Update </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1675" id="link-id0x1b7c5af8">Benchmarks, Redux (part 9): BSBM With Cluster</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1677" id="link-id0x1dad7588">Benchmarks, Redux (part 10): LOD2 and the Benchmark Process</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1678" id="link-id0x1c5520a0">Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1eb19bf8">Benchmarks, Redux (part 12): Our Own BSBM Results Report</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1eb2c398">Benchmarks, Redux (part 13): BSBM BI Modifications </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1fb6a118">Benchmarks, Redux (part 14): BSBM BI Mix </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1f160580">Benchmarks, Redux (part 15): BSBM Test Driver Enhancements </a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2011-02-28#1658">
  <rss:title>Benchmarks, Redux (part 1): On RDF Benchmarks</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-02-28T20:20:22Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">This post introduces a series on RDF benchmarking. In these posts I will cover the following: Correct misleading information about us in the recent Berlin report: The load rate is off-the wall and the update mix is missing. We supply the right numbers and explain how to load things so that one gets decent performance. Discuss configuration options for Virtuoso. Tell a story about multithreading and its perils and how vectoring and scale-out can save us. Analyze the run time behavior of Virtuoso 6 Single, 6 Cluster, and 7 Single. Look at the benefits of SSDs (solid-state storage devices) over HDDs (hard disk devices; spinning platters), and I/O matters in general. Talk in general about modalities of benchmark running, and how to reconcile vendors doing what they know best with the air of legitimacy of a third party. Whether to do things a la TPC or a la TREC? We will hopefully try a bit of both, at least so I have proposed to our partners in LOD2, the EU FP7 that also funded the recent Berlin report. Outline the desiderata for an RDF benchmark that is not just an RDF-ized relational workload, the Social Intelligence Benchmark. Talk about BSBM in specific. What does it measure? Discuss some experiments with the BI use case of BSBM. Document how the results mentioned here were obtained and suggest practices for benchmark running and disclosure. The background is that the LOD2 FP7 project is supposed to deliver a report about the state of the art and benchmark laboratory by March 1. The Berlin report is a part thereof. In the project proposal we talk about an ongoing benchmarking activity and about having up-to-date installations of the relevant RDF stores and RDBMS. Since this is taxpayer money for supposedly the common good, I see no reason why such a useful thing should be restricted to the project participants. On the other hand, running a display window of stuff for benchmarking, when in at least in some cases licenses prohibit unauthorized publishing of benchmark results might be seen to conflict with the spirit of the license if not its letter. We will see. For now, my take is that we want to run benchmarks of all interesting software, inviting the vendors to tell us how to do that if they will, and maybe even letting them perform those runs themselves. Then we promise not to disclose results without the vendor&#39;s permission. Access to the installations is limited to whoever operates the equipment. Configuration files and detailed hardware specs and such on the other hand will be made public. If a run is published, it will be with permission and in a format that includes full information for replicating the experiment. In the LOD2 proposal we also in so many words say that we will stretch the limits of the state of the art. This stretching is surely not limited to the project&#39;s own products but should also include the general benchmarking aspect. I will say with confidence that running single server benchmarks at a max 200 Mtriples of data is not stretching anything. So to ameliorate this situation, I thought to run the same at 10x the scale on a couple of large boxes we have access to. 1 and 2 billion triples are still comfortably single server scales. Then we could go for example to Giovanni&#39;s cluster at DERI and do 10 and 20 billion triples, this should fly reasonably on 8 or 16 nodes of the DERI gear. Or we might talk to SEALS who by now should have their own lab. Even Amazon EC2 might be an option, although not the preferred one. So I asked everybody about config instructions, which produced a certain amount of dismay as I might be said to be biased and to be skirting the edges of conflict of interest. The inquiry was not altogether negative though since Ontotext and Garlik provided some information. We will look into these this and next week. We will not publish any information without asking first. In this series of posts I will only talk about OpenLink Software. Benchmarks, Redux Series Benchmarks, Redux (part 1): On RDF Benchmarks (this post) Benchmarks, Redux (part 2): A Benchmarking Story Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs Benchmarks, Redux (part 6): BSBM and I/O, continued Benchmarks, Redux (part 7): What Does BSBM Explore Measure? Benchmarks, Redux (part 8): BSBM Explore and Update Benchmarks, Redux (part 9): BSBM With Cluster Benchmarks, Redux (part 10): LOD2 and the Benchmark Process Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks Benchmarks, Redux (part 12): Our Own BSBM Results Report Benchmarks, Redux (part 13): BSBM BI Modifications Benchmarks, Redux (part 14): BSBM BI Mix Benchmarks, Redux (part 15): BSBM Test Driver Enhancements</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>This post introduces a series on <a class="auto-href" href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1ea861a8">RDF</a> benchmarking. In these posts I will cover the following:</p>

<ul>
 <li>
  <p>Correct misleading <a class="auto-href" href="http://dbpedia.org/resource/Information" id="link-id0x17f5faa8">information</a> about us in the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/V6/index.html" id="link-id0x1ded41d0">recent Berlin report</a>: The load rate is off-the wall and the update mix is missing. We supply the right numbers and explain how to load things so that one gets decent performance.</p>
 </li>

 <li>
  <p>Discuss configuration options for <a class="auto-href" href="http://virtuoso.openlinksw.com" id="link-id0x1ac63820">Virtuoso</a>.</p>
</li>

 <li>
  <p>Tell a story about multithreading and its perils and how vectoring and scale-out can save us.</p>
</li>

 <li>
  <p>Analyze the run time behavior of Virtuoso 6 Single, 6 Cluster, and 7 Single.</p>
</li>

 <li>
  <p>Look at the benefits of SSDs (solid-state storage devices) over HDDs (hard disk devices; spinning platters), and I/O matters in general.</p>
</li>

 <li>
  <p>Talk in general about modalities of benchmark running, and how to reconcile vendors doing what they know best with the air of legitimacy of a third party. Whether to do things a la <a class="auto-href" href="http://www.tpc.org/" id="link-id0x19c0a7f0">TPC</a> or a la TREC? We will hopefully try a bit of both, at least so I have proposed to our partners in <a class="auto-href" href="http://lod2.eu/" id="link-id0xa2e6170">LOD2</a>, the EU FP7 that also funded the recent Berlin report.</p>
</li>

 <li>
  <p>Outline the desiderata for an RDF benchmark that is not just an RDF-ized relational workload, the Social Intelligence Benchmark.</p>
</li>

 <li>
  <p>Talk about <a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x17c81c20">BSBM</a> in specific. What does it measure?</p>
</li>

 <li>
  <p>Discuss some experiments with the BI use case of BSBM.</p>
</li>

 <li>
  <p>Document how the results mentioned here were obtained and suggest practices for benchmark running and disclosure.</p>
</li>
</ul>

<p>The background is that the LOD2 FP7 project is supposed to deliver a report about the state of the art and benchmark laboratory by March 1. The Berlin report is a part thereof. In the project proposal we talk about an ongoing benchmarking activity and about having up-to-date installations of the relevant RDF stores and <a class="auto-href" href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x17f2a230">RDBMS</a>.</p>

<p>Since this is taxpayer money for supposedly the common good, I see no reason why such a useful thing should be restricted to the project participants. On the other hand, running a display window of stuff for benchmarking, when in at least in some cases licenses prohibit unauthorized publishing of benchmark results might be seen to conflict with the spirit of the license if not its letter. We will see.</p>

<p>For now, my take is that we want to run benchmarks of all interesting software, inviting the vendors to tell us how to do that if they will, and maybe even letting them perform those runs themselves. Then we promise not to disclose results without the vendor&#39;s permission. Access to the installations is limited to whoever operates the equipment. Configuration files and detailed hardware specs and such on the other hand will be made public. If a run is published, it will be with permission and in a format that includes full information for replicating the experiment.</p>

<p>In the LOD2 proposal we also in so many words say that we will stretch the limits of the state of the art. This stretching is surely not limited to the project&#39;s own products but should also include the general benchmarking aspect. I will say with confidence that running single server benchmarks at a max 200 Mtriples of <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0x1d7499d8">data</a> is not stretching anything.</p>

<p>So to ameliorate this situation, I thought to run the same at 10x the scale on a couple of large boxes we have access to. 1 and 2 billion triples are still comfortably single server scales. Then we could go for example to Giovanni&#39;s cluster at <a class="auto-href" href="http://dbpedia.org/resource/Digital_Enterprise_Research_Institute" id="link-id0x17c46a88">DERI</a> and do 10 and 20 billion triples, this should fly reasonably on 8 or 16 nodes of the DERI gear. Or we might talk to SEALS who by now should have their own lab. Even Amazon <a class="auto-href" href="http://aws.amazon.com/ec2/" id="link-id0x8bed290">EC2</a> might be an option, although not the preferred one.</p>

<p>So I asked everybody about config instructions, which produced a certain amount of dismay as I might be said to be biased and to be skirting the edges of conflict of interest. The inquiry was not altogether negative though since <a class="auto-href" href="http://dbpedia.org/resource/Ontotext" id="link-id0x1d269998">Ontotext</a> and <a class="auto-href" href="http://freebase.com/guid/9202a8c04000641f8000000005c908d6" id="link-id0x1bbc0a48">Garlik</a> provided some information. We will look into these this and next week. We will not publish any information without asking first.</p>

<p>In this series of posts I will only talk about <a class="auto-href" href="http://www.openlinksw.com/dataspace/organization/openlink#this" id="link-id0x1ea6d948">OpenLink Software</a>.</p>

<h3>
<i>Benchmarks, Redux</i> Series</h3>
<ul>
<li>Benchmarks, Redux (part 1): On RDF Benchmarks <i>(this post)</i>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x1b668d10">Benchmarks, Redux (part 2): A Benchmarking Story</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1663" id="link-id0x1b3a0c08">Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1665" id="link-id0x1f9f1740">Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1667" id="link-id0x1ad929f8">Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1669" id="link-id0x1db437c0">Benchmarks, Redux (part 6): BSBM and I/O, continued</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1671" id="link-id0x17138c38">Benchmarks, Redux (part 7): What Does BSBM Explore Measure?</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1673" id="link-id0x1c0e74f8">Benchmarks, Redux (part 8): BSBM Explore and Update </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1675" id="link-id0x1f297d10">Benchmarks, Redux (part 9): BSBM With Cluster</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1677" id="link-id0x1e4994b8">Benchmarks, Redux (part 10): LOD2 and the Benchmark Process</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1678" id="link-id0x1ebea6d0">Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1d5c86c0">Benchmarks, Redux (part 12): Our Own BSBM Results Report</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1efec0e0">Benchmarks, Redux (part 13): BSBM BI Modifications </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1a9941f8">Benchmarks, Redux (part 14): BSBM BI Mix </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1ea26de8">Benchmarks, Redux (part 15): BSBM Test Driver Enhancements </a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2011-01-19#1649">
  <rss:title>Virtuoso Directions for 2011</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-01-19T16:29:37Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">At the start of 2010, I wrote that 2010 would be the year when RDF became performance- and cost-competitive with relational technology for data warehousing and analytics. More specifically, RDF would shine where data was heterogenous and/or where there was a high frequency of schema change. I will now discuss what we have done towards this end in 2010 and how you will gain by this in 2011. At the start of 2010, we had internally demonstrated 4x space efficiency gains from column-wise compression and 3x loop join speed gains from vectored execution. To recap, column-wise compression means a column-wise storage layout where values of consecutive rows of a single column are consecutive in memory/disk and are compressed in a manner that benefits from the homogenous data type and possible sort order of the column. Vectored execution means passing large numbers of query variable bindings between query operators and possibly sorting inputs to joins for improving locality. Furthermore, always operating on large sets of values gives extra opportunities for parallelism, from instruction level to threads to scale out. So, during 2010, we integrated these technologies into Virtuoso, for relational- and graph-based applications alike. Further, even if we say that RDF will be close to relational speed in Virtuoso, the point is moot if Virtuoso&#39;s relational speed is not up there with the best of analytics-oriented RDBMS. RDF performance does rest on the basis of general-purpose database performance; what is sauce for the goose is sauce for the gander. So we reimplemented HASH JOIN and GROUP BY, and fine-tuned many of the tricks required by TPC-H. TPC-H is not the sole final destination, but it is a step on the way and a valuable checklist for what a database ought to do. At the Semdata workshop of VLDB 2010 we presented some results of our column store applied to RDF and relational tasks. As noted in the paper, the implementation did demonstrate significant gains over the previous row-wise architecture but was not yet well optimized, so not ready to be compared with the best of the relational analytics world. A good part of the fall of 2010 went into optimizing the column store and completing functionality such as transaction support with columns. A lot of this work is not specifically RDF oriented, but all of this work is constantly informed by the specific requirements of RDF. For example, the general idea of vectored execution is to eliminate overheads and optimize CPU cache and other locality by doing single query operations on arrays of operands so that the whole batch runs more or less in CPU cache. Are the gains not lost if data is typed at run time, as in RDF? In fact, the cost of run-time-typing turns out to be small, since data in practice tends to be of homogenous type and with locality of reference in values. Virtuoso&#39;s column store implementation resembles in broad outline other column stores like Vertica or VectorWise, the main difference being the built-in support for run-time heterogenous types. The LOD2 EU FP 7 project started in September 2010. In this project OpenLink and the celebrated heroes of the column store, CWI of MonetDB and VectorWise fame, represent the database side. The first database task of LOD2 is making a survey of the state of the art and a round of benchmarking of RDF stores. The Berlin SPARQL Benchmark (BSBM) has accordingly evolved to include a business intelligence section and an update stream. Initial results from running these will become available in February/March, 2011. The specifics of this process merit another post; let it for now be said that benchmarking is making progress. In the end, it is our conviction that we need a situation where vendors may publish results as and when they are available and where there exists a well defined process for documenting and checking results. LOD2 will continue by linking the universe, as I half-facetiously put it on a presentation slide. This means alignment of anything from schema to instance identifiers, with and without supervision, and always with provenance, summarization, visualization, and so forth. In fact, putting it this way, this gets to sound like the old chimera of generating applications from data or allowing users to derive actionable intelligence from data of which they do not even know the structure. No, we are not that unrealistic. But we are moving toward more ad-hoc discovery and faster time to answer. And since we provide an infrastructure element under all this, we want to do away with the &quot;RDF tax,&quot; by which we mean any significant extra cost of RDF compared to an alternate technology. To put it another way, you ought to pay for unpredictable heterogeneity or complex inference only when you actually use them, not as a fixed up-front overhead. So much for promises. When will you see something? It is safe to say that we cannot very well publish benchmarks of systems that are not generally available in some form. This places an initial technology preview cut of Virtuoso 7 with vectored execution somewhere in January or early February. The column store feature will be built in, but more than likely the row-wise compressed RDF format of Virtuoso 6 will still be the default. Version 6 and 7 databases will be interchangeable unless column-store structures are used. For now, our priority is to release the substantial gains that have already been accomplished. After an initial preview cut, we will return to the agenda of making sure Virtuoso is up there with the best in relational analytics, and that the equivalent workload with an RDF data model runs as close as possible to relational performance. As a first step this means taking TPC-H as is, and then converting the data and queries to the trivially equivalent RDF and SPARQL and seeing how it goes. In the September paper we dabbled a little with the data at a small scale but now we must run the full set of queries at 100GB and 300GB scales, which come to about 14 billion and 42 billion triples, respectively. A well done analysis of the issues encountered, covering similarities and dissimilarities of the implementation of the workload as SQL and SPARQL, should make a good VLDB paper. Database performance is an entirely open-ended quest and the bag of potentially applicable tricks is as good as infinite. Having said this, it seems that the scales comfortably reached in the TPC benchmarks are more than adequate for pretty much anything one is likely to encounter in real world applications involving comparable workloads. Businesses getting over 6 million new order transactions per minute (the high score of TPC-C) or analyzing a warehouse of 60 billion orders shipped to 6 billion customers over 7 years (10000GB or 10TB TPC-H) are not very common if they exist at all. The real world frontier has moved on. Scaling up the TPC workloads remains a generally useful exercise that continues to contribute to the state of the art but the applications requiring this advance are changing. Someone once said that for a new technology to become mainstream, it needs to solve a new class of problem. Yes, while it is a preparatory step to run TPC-H translated to SPARQL without dying of overheads, there is little point in doing this in production since SQL is anyway likely better and already known, proven, and deployed. The new class of problem, as LOD2 sees it, is the matter of web-wide cross-organizational data integration. Web-wide does not necessarily mean crawling the whole web, but does tend to mean running into significant heterogeneity of sources, both in terms of modeling and in terms of usage of more-or-less standard data models. Around this topic we hear two messages. The database people say that inference beyond what you can express in SQL views is theoretically nice but practically not needed; on the other side, we hear that the inference now being standardized in efforts like RIF and OWL is not expressive enough for the real world. As one expert put it, if enterprise data integration in the 1980s was between a few databases, today it is more like between 1000 databases, which makes this matter similar to searching the web. How can one know in such a situation that the data being aggregated is in fact meaningfully aggregate-able? Add to this the prevalence of unstructured data in the world and the need to mine it for actionable intelligence. Think of combining data from CRM, worldwide media coverage of own and competitive brands, and in-house emails for assessing organizational response to events on the market. These are the actual use cases for which we need RDF at relational DW performance and scale. This is not limited to RDF and OWL profiles, since we fully believe that inference needs are more diverse. The reason why this is RDF and not SQL plus some extension of Datalog, is the widespread adoption of RDF and linked data as a data publishing format, with all the schema-last and open world aspects that have been there from the start. Stay tuned for more news later this month! Related Linked Data and Virtuoso in 2010 Linked Data &amp; The Year 2009 Retrospective and Outlook for 2008</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>
<a href="http://www.openlinksw.com/weblog/oerling/?id=1603" id="link-id0x1d584720">At the start of 2010, I wrote</a> that 2010 would be the year when <a class="auto-href" href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x29d4eeb8">RDF</a> became performance- and cost-competitive with relational technology for <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0x1d92cc18">data</a> warehousing and analytics. More specifically, RDF would shine where data was heterogenous and/or where there was a high frequency of <a class="auto-href" href="http://dbpedia.org/resource/Database_schema" id="link-id0x1ccf1d80">schema</a> change.</p>

<p>I will now discuss what we have done towards this end in 2010 and how you will gain by this in 2011.</p>

<p>At the start of 2010, we had internally demonstrated 4x space efficiency gains from column-wise compression and 3x loop join speed gains from vectored execution. To recap, <i>column-wise compression</i> means a column-wise storage layout where values of consecutive rows of a single column are consecutive in memory/disk and are compressed in a manner that benefits from the homogenous data type and possible sort order of the column. <i>Vectored execution</i> means passing large numbers of query variable bindings between query operators and possibly sorting inputs to joins for improving locality. Furthermore, always operating on large sets of values gives extra opportunities for parallelism, from instruction level to threads to scale out.</p>

<p>So, during 2010, we integrated these technologies into <a class="auto-href" href="http://virtuoso.openlinksw.com" id="link-id0x1c294c00">Virtuoso</a>, for relational- and graph-based applications alike. Further, even if we say that RDF will be close to relational speed in Virtuoso, the point is moot if Virtuoso&#39;s relational speed is not up there with the best of analytics-oriented <a class="auto-href" href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x1d7bc8d0">RDBMS</a>. RDF performance does rest on the basis of general-purpose database performance; what is sauce for the goose is sauce for the gander. So we reimplemented <code><a class="auto-href" href="http://dbpedia.org/resource/Hash_join" id="link-id0x29d25c10">HASH JOIN</a></code> and <code>GROUP BY</code>, and fine-tuned many of the tricks required by <a class="auto-href" href="http://www.tpc.org/" id="link-id0x8ce58b8">TPC</a>-<a class="auto-href" href="http://dbpedia.org/resource/TPC-H" id="link-id0x1d610298">H. TPC-H</a> is not the sole final destination, but it is a step on the way and a valuable checklist for what a database ought to do.</p>

<p>At the Semdata workshop of <a class="auto-href" href="http://www.vldb2010.org/" id="link-id0x1950c050">VLDB 2010</a> <a href="http://www.openlinksw.com/weblog/oerling/?id=1632" id="link-id0x1de8fee8">we presented some results</a> of our column store applied to RDF and relational tasks. As noted in the paper, the implementation did demonstrate significant gains over the previous row-wise architecture but was not yet well optimized, so not ready to be compared with the best of the relational analytics world. A good part of the fall of 2010 went into optimizing the column store and completing functionality such as transaction support with columns.</p>

<p>A lot of this work is not specifically RDF oriented, but all of this work is constantly informed by the specific requirements of RDF. For example, the general idea of vectored execution is to eliminate overheads and optimize <a class="auto-href" href="http://dbpedia.org/resource/Central_processing_unit" id="link-id0x1d75a9c0">CPU</a> <a class="auto-href" href="http://dbpedia.org/resource/Cache" id="link-id0x1ce80608">cache</a> and other locality by doing single query operations on arrays of operands so that the whole batch runs more or less in CPU cache. Are the gains not lost if data is typed at run time, as in RDF? In fact, the cost of run-time-typing turns out to be small, since data in practice tends to be of homogenous type and with locality of reference in values. Virtuoso&#39;s column store implementation resembles in broad outline other column stores like <a class="auto-href" href="http://www.vertica.com/" id="link-id0x1b303538">Vertica</a> or <a class="auto-href" href="http://www.ingres.com/vectorwise/" id="link-id0x279f6968">VectorWise</a>, the main difference being the built-in support for run-time heterogenous types.</p>

<p>The <a class="auto-href" href="http://lod2.eu/" id="link-id0x29d48f00">LOD2</a> EU FP 7 project <a href="http://www.openlinksw.com/weblog/oerling/?id=1630" id="link-id0x1d8eaf28">started in September 2010</a>. In this project OpenLink and the celebrated heroes of the column store, <a class="auto-href" href="http://dbpedia.org/resource/National_Research_Institute_for_Mathematics_and_Computer_Science" id="link-id0x19ec66f0">CWI</a> of <a class="auto-href" href="http://dbpedia.org/resource/MonetDB" id="link-id0x1cda7178">MonetDB</a> and VectorWise fame, represent the database side.</p>

<p>The first database task of LOD2 is making a survey of the state of the art and a round of benchmarking of RDF stores. The <a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x1e144608">Berlin SPARQL Benchmark</a> (<a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x139cd920">BSBM</a>) has accordingly evolved to include a business intelligence section and an update stream. Initial results from running these will become available in February/March, 2011. The specifics of this process merit another post; let it for now be said that benchmarking is making progress. In the end, it is our conviction that we need a situation where vendors may publish results as and when they are available and where there exists a well defined process for documenting and checking results.</p>

<p>LOD2 will continue by <i>linking the universe,</i> as I half-facetiously put it on a presentation slide. This means alignment of anything from schema to instance identifiers, with and without supervision, and always with provenance, summarization, visualization, and so forth. In fact, putting it this way, this gets to sound like the old chimera of generating applications from data or allowing users to derive actionable intelligence from data of which they do not even know the structure. No, we are not that unrealistic. But we are moving toward more ad-hoc discovery and faster time to answer. And since we provide an infrastructure element under all this, we want to do away with the &quot;RDF tax,&quot; by which we mean any significant extra cost of RDF compared to an alternate technology. To put it another way, you ought to pay for unpredictable heterogeneity or complex inference only when you actually use them, not as a fixed up-front overhead.</p>

<p>So much for promises. When will you see something? It is safe to say that we cannot very well publish benchmarks of systems that are not generally available in some form. This places an initial technology preview cut of Virtuoso 7 with vectored execution somewhere in January or early February. The column store feature will be built in, but more than likely the row-wise compressed RDF format of Virtuoso 6 will still be the default. Version 6 and 7 databases will be interchangeable unless column-store structures are used.</p>

<p>For now, our priority is to release the substantial gains that have already been accomplished.</p>

<p>After an initial preview cut, we will return to the agenda of making sure Virtuoso is up there with the best in relational analytics, and that the equivalent workload with an RDF data model runs as close as possible to relational performance. As a first step this means taking TPC-H as is, and then converting the data and queries to the trivially equivalent RDF and <a class="auto-href" href="http://dbpedia.org/resource/SPARQL" id="link-id0x29b26b50">SPARQL</a> and seeing how it goes. In <a href="http://www.openlinksw.com/weblog/oerling/?id=1627" id="link-id0x1af60d40">the September paper</a> we dabbled a little with the data at a small scale but now we must run the full set of queries at 100GB and 300GB scales, which come to about 14 billion and 42 billion triples, respectively. A well done analysis of the issues encountered, covering similarities and dissimilarities of the implementation of the workload as <a class="auto-href" href="http://dbpedia.org/resource/SQL" id="link-id0x1ce99a98">SQL</a> and SPARQL, should make a good VLDB paper.</p>

<p>Database performance is an entirely open-ended quest and the bag of potentially applicable tricks is as good as infinite. Having said this, it seems that the scales comfortably reached in the TPC benchmarks are more than adequate for pretty much anything one is likely to encounter in real world applications involving comparable workloads. Businesses getting over 6 million new order transactions per minute (the high score of TPC-<a class="auto-href" href="http://dbpedia.org/resource/C%2B%2B" id="link-id0x1ce764f8">C</a>) or analyzing a warehouse of 60 billion orders shipped to 6 billion customers over 7 years (10000GB or 10TB TPC-H) are not very common if they exist at all.</p>

<p>The real world frontier has moved on. Scaling up the TPC workloads remains a generally useful exercise that continues to contribute to the state of the art but the applications requiring this advance are changing.</p>

<p>Someone once said that for a new technology to become mainstream, it needs to solve a new class of problem. Yes, while it is a preparatory step to run TPC-H translated to SPARQL without dying of overheads, there is little point in doing this in production since SQL is anyway likely better and already known, proven, and deployed.</p>

<p>The new class of problem, as LOD2 sees it, is the matter of web-wide cross-organizational data integration. Web-wide does not necessarily mean crawling the whole web, but does tend to mean running into significant heterogeneity of sources, both in terms of modeling and in terms of usage of more-or-less standard data models. Around this topic we hear two messages. The database people say that inference beyond what you can express in SQL views is theoretically nice but practically not needed; on the other side, we hear that the inference now being standardized in efforts like <a class="auto-href" href="http://dbpedia.org/resource/Rule_Interchange_Format" id="link-id0x1cb916b0">RIF</a> and <a class="auto-href" href="http://dbpedia.org/resource/Web_Ontology_Language" id="link-id0x29dd4a60">OWL</a> is not expressive enough for the real world. As one expert put it, <i>if enterprise data integration in the 1980s was between a few databases, today it is more like between 1000 databases,</i> which makes this matter similar to searching the web. How can one know in such a situation that the data being aggregated is in fact meaningfully aggregate-able?</p>

<p>Add to this the prevalence of unstructured data in the world and the need to mine it for actionable intelligence. Think of combining data from CRM, worldwide media coverage of own and competitive brands, and in-house emails for assessing organizational response to events on the market.</p>

<p>These are the actual use cases for which we need RDF at relational DW performance and scale. This is not limited to RDF and OWL profiles, since we fully believe that inference needs are more diverse. The reason why this is RDF and not SQL plus some extension of <a class="auto-href" href="http://dbpedia.org/resource/Datalog" id="link-id0x1cde9dc8">Datalog</a>, is the widespread adoption of RDF and <a class="auto-href" href="http://dbpedia.org/resource/Linked_Data" id="link-id0x1d029c80">linked data</a> as a data publishing format, with all the schema-last and <a class="auto-href" href="http://dbpedia.org/resource/Open_world_assumption" id="link-id0x1d81f5b0">open world</a> aspects that have been there from the start.</p>

<p>Stay tuned for more news later this month!</p>

<h3>Related</h3>
<ul>
 <li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1603" id="link-id0x1de6b370">Linked Data and Virtuoso in 2010</a>
 </li>
 <li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1510" id="link-id0x1b031180">Linked Data &amp; The Year 2009</a>
 </li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1286" id="link-id0x1a582d10">Retrospective and Outlook for 2008</a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2010-09-21#1631">
  <rss:title>Suggested Extensions to the BSBM</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2010-09-21T21:13:39Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Below is a list of possible extensions to the Berlin SPARQL Benchmark. Our previous critique of BSBM consists of: The queries touch very little data, to the point where compilation is a large fraction of execution time. This is not representative of the data integration/analytics orientation of RDF. Most queries are logarithmic to scale factor, but some are linear. The linear ones come to dominate the metric at larger scales. An update stream would make the workload more realistic. We could rectify this all with almost no changes to the data generator or test driver by adding one or two more metrics. So I am publishing the below as a starting point for discussion. BSBM Analytics Mix Below is a set of business questions that can be answered with the BSBM data set. These are more complex and touch a greater percentage of the data than the initial mix. Their evaluation is between linear and n * log(n) to the data size. The TPC-H rules can be used for a power (single user) and a throughput (multi-user, where each submits queries from the mix with different parameters and in different order). The TPC-H score formula and executive summary formats are directly applicable. This can be a separate metric from the &quot;restricted&quot; BSBM score. Restricted means &quot;without a full scan with regexp&quot; which will dominate the whole metric at larger scales. Vendor specific variations in syntax will occur, hence these are allowed but disclosure of specific query text should accompany results. Hints for JOIN order and the like are not allowed; queries must be declarative. We note that both SPARQL and SQL implementations of the queries are possible. The queries are ordered so that the first ones fill the cache. Running the analytics mix immediately after backup post initial load is allowed, resulting in semi-warm cache. Steady-state rules will be defined later, seeing the characteristics of the actual workload. For each country, list the top 10 product categories, ordered by the count of reviews from the country. Product with the most reviews during its first month on the market 10 products most similar to X, with similarity score based on the count of features in common Top 10 reviewers of category X Product with largest increase in reviews in month X compared to month X-minus-1. Product of category X with largest change in mean price in the last month Most active American reviewer of Japanese cameras last year Correlation of price and average review Features with greatest impact on price â for features occurring in category X, find the top 10 features where the mean price with the feature is most above the mean price without the feature Country with greatest popularity of products in category X â reviews of category X from country Y divided by total reviews Leading product of category X by country, mentioning mean price in each country and number of offers, sort by number of offers Fans of manufacturer â find top reviewers who score manufacturer above their mean score Products sold only in country X BSBM IR Since RDF stores often implement a full text index, and since a full scan with regexp matching would never be used in an online E-commerce portal, it is meaningful to extend the benchmark to have some full text queries. For the SPARQL implementation, text indexing should be enabled for all string-valued literals even though only some of them will be queried in the workload. Q6 from the original mix, now allowing use of text index. Reviews of products of category X where the review contains the names of 1 to 3 product features that occur in said category of products; e.g., MP3 players with support for mp4 and ogg. ibid but now specifying review author. The intent is that structured criteria are here more selective than text. Difference in the frequency of use of &quot;awesome&quot;, &quot;super&quot;, and &quot;suck(s)&quot; by American vs. European vs. Asian review authors. Changes to Test Driver For full text queries, the search terms have to be selected according to a realistic distribution. DERI has offered to provide a definition and possibly an implementation for this. The parameter distribution for the analytics queries will be defined when developing the queries; the intent is that one run will touch 90% of the values in the properties mentioned in the queries. The result report will have to be adapted to provide a TPC-H executive summary-style report and appropriate metrics. Changes to Data Generation For supporting the IR mix, reviews should, in addition to random text, contain the following: For each feature in the product concerned, add the label of said feature to 60% of the reviews. Add the names of review author, product, product category, and manufacturer. The review score should be expressed in the text by adjectives (e.g., awesome, super, good, dismal, bad, sucky). Every 20th word can be an adjective from the list correlating with the score in 80% of uses of the word and random in 20%. For 90% of adjectives, pick the adjectives from lists of idiomatic expressions corresponding to the country of the reviewer. In 10% of cases, use a random list of idioms. Skew the review scores so that comparatively expensive products have a smaller chance for a bad review. Update Stream During the benchmark run: 1% of products are added; 3% of initial offers are deleted and 3% are added; and 5% of reviews are added. Updates may be divided into transactions and run in series or in parallel in a manner specified by the test sponsor. The code for loading the update stream is vendor specific but must be disclosed. The initial bulk load does not have to be transactional in any way. Loading the update stream must be transactional, guaranteeing that all information pertaining to a product or an offer constitutes a transaction. Multiple offers or products may be combined in a transaction. Queries should run at least in READ COMMITTED isolation, so that half-inserted products or offers are not seen. Full text indices do not have to be updated transactionally; the update can lag up to 2 minutes behind the insertion of the literal being indexed. The test data generator generates the update stream together with the initial data. The update stream is a set of files containing Turtle-serialized data for the updates, with all triples belonging to a transaction in consecutive order. The possible transaction boundaries are marked with a comment distinguishable from the text. The test sponsor may implement a special load program if desired. The files must be loaded in sequence but a single file may be loaded on any number of parallel threads. The data generator should generate multiple files for the initial dump in order to facilitate parallel loading. The same update stream can be used during all tests, starting each run from a backup containing only the initial state. In the original run, the update stream is applied starting at the measurement interval, after the SUT is in steady state.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Below is a list of possible extensions to the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x236ebfd0">Berlin SPARQL Benchmark</a>. 
Our previous critique of <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x237c3af0">BSBM</a> consists of:</p>
<ol>
 <li>
  <p>The queries touch very little <a href="http://dbpedia.org/resource/Data" id="link-id0x22e140a8">data</a>, to the point where compilation is a large fraction of execution time. This is not representative of the data integration/analytics orientation of <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x237ccb50">RDF</a>. </p>
 </li>
<li>
  <p>Most queries are logarithmic to scale factor, but some are linear. The linear ones come to dominate the metric at larger scales.</p>
</li>
<li>
  <p>An update stream would make the workload more realistic.</p>
</li>
</ol>

<p>We could rectify this all with almost no changes to the data generator or test driver by adding one or two more metrics.</p>

<p>So I am publishing the below as a starting point for discussion.</p>

<h2>BSBM Analytics Mix</h2>

<p>Below is a set of business questions that can be answered with the BSBM data set. These are more complex and touch a greater percentage of the data than the initial mix. Their evaluation is between linear and <i>n * log(n)</i> to the data size. The <a href="http://www.tpc.org/" id="link-id0x237c81a8">TPC</a>-<a href="http://dbpedia.org/resource/TPC-H" id="link-id0x2382baf0">H</a> rules can be used for a power (single user) and a throughput (multi-user, where each submits queries from the mix with different parameters and in different order). The TPC-H score formula and executive summary formats are directly applicable.</p>

<p>This can be a separate metric from the &quot;restricted&quot; BSBM score. Restricted means &quot;without a full scan with regexp&quot; which will dominate the whole metric at larger scales.</p>

<p>Vendor specific variations in syntax will occur, hence these are allowed but disclosure of specific query text should accompany results. Hints for <code>JOIN</code> order and the like are not allowed; queries must be declarative. We note that both <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x235c9048">SPARQL</a> and <a href="http://dbpedia.org/resource/SQL" id="link-id0x239194b0">SQL</a> implementations of the queries are possible.</p>

<p>The queries are ordered so that the first ones fill the <a href="http://dbpedia.org/resource/Cache" id="link-id0x23845240">cache</a>. Running the analytics mix immediately after backup post initial load is allowed, resulting in semi-warm cache. Steady-state rules will be defined later, seeing the characteristics of the actual workload.</p>

<ol>
 <li>
  <p>For each country, list the top 10 product categories, ordered by the count of reviews from the country.</p>
 </li>
<li>
  <p>Product with the most reviews during its first month on the market</p>
</li>
<li>
  <p>10 products most similar to X, with similarity score based on the count of features in common</p>
</li>
<li>
  <p>Top 10 reviewers of category X</p>
</li>
<li>
  <p>Product with largest increase in reviews in month X compared to month X-minus-1.</p>
</li>
<li>
  <p>Product of category X with largest change in mean price in the last month </p>
</li>
<li>
  <p>Most active American reviewer of Japanese cameras last year</p>
</li>
<li>
  <p>Correlation of price and average review</p>
</li>
<li>
  <p>Features with greatest impact on price â for features occurring in category X, find the top 10 features where the mean price with the feature is most above the mean price without the feature</p>
</li>
<li>
  <p>Country with greatest popularity of products in category X â reviews of category X from country Y divided by total reviews</p>
</li>
<li>
  <p>Leading product of category X by country, mentioning mean price in each country and number of offers, sort by number of offers</p>
</li>
<li>
  <p>Fans of manufacturer â find top reviewers who score manufacturer above their mean score</p>
</li>
<li>
  <p>Products sold only in country X</p>
</li>
</ol>

<h2>BSBM IR</h2>

<p>Since RDF stores often implement a full text index, and since a full scan with regexp matching would never be used in an online E-commerce portal, it is meaningful to extend the benchmark to have some full text queries.</p>

<p>For the SPARQL implementation, text indexing should be enabled for all string-valued literals even though only some of them will be queried in the workload.</p>

<ul>
 <li>
  <p>Q6 from the original mix, now allowing use of text index.</p>
 </li>
<li>
  <p>Reviews of products of category X where the review contains the names of 1 to 3 product features that occur in said category of products; e.g., MP3 players with support for mp4 and ogg.</p>
</li>
<li>
  <p>ibid but now specifying review author. The intent is that structured criteria are here more selective than text.</p>
</li>
<li>
  <p>Difference in the frequency of use of &quot;awesome&quot;, &quot;super&quot;, and &quot;suck(s)&quot; by American vs. European vs. Asian review authors.</p>
</li>
</ul>

<h2>Changes to Test Driver</h2>

<p>For full text queries, the search terms have to be selected according to a realistic distribution. <a href="http://dbpedia.org/resource/Digital_Enterprise_Research_Institute" id="link-id0x2391c0a0">DERI</a> has offered to provide a definition and possibly an implementation for this.</p>

<p>The parameter distribution for the analytics queries will be defined when developing the queries; the intent is that one run will touch 90% of the values in the properties mentioned in the queries.</p>

<p>The result report will have to be adapted to provide a TPC-H executive summary-style report and appropriate metrics.</p>

<h2>Changes to Data Generation</h2>

<p>For supporting the IR mix, reviews should, in addition to random text, contain the following:</p>

<ul>
 <li>
  <p>For each feature in the product concerned, add the label of said feature to 60% of the reviews.</p>
 </li>
<li>
  <p>Add the names of review author, product, product category, and manufacturer.</p>
</li>
<li>
  <p>The review score should be expressed in the text by adjectives (e.g., awesome, super, good, dismal, bad, sucky). Every 20th word can be an adjective from the list correlating with the score in 80% of uses of the word and random in 20%. For 90% of adjectives, pick the adjectives from lists of idiomatic expressions corresponding to the country of the reviewer. In 10% of cases, use a random list of idioms.</p>
</li>
<li>
  <p>Skew the review scores so that comparatively expensive products have a smaller chance for a bad review.</p>
</li>
</ul>

<h2>Update Stream</h2>

<p>During the benchmark run:</p>

<ul>
 <li>
  <p>1% of products are added;</p>
 </li>
<li>
  <p>3% of initial offers are deleted and 3% are added; and </p>
</li>
<li>
  <p>5% of reviews are added.</p>
</li>
</ul>

<p>Updates may be divided into transactions and run in series or in parallel in a manner specified by the test sponsor. The code for loading the update stream is vendor specific but must be disclosed.</p>

<p>The initial bulk load does not have to be transactional in any way.</p>

<p>Loading the update stream must be transactional, guaranteeing that all <a href="http://dbpedia.org/resource/Information" id="link-id0x237aa408">information</a> pertaining to a product or an offer constitutes a transaction. Multiple offers or products may be combined in a transaction. Queries should run at least in <code>READ COMMITTED</code> isolation, so that half-inserted products or offers are not seen.</p>

<p>Full text indices do not have to be updated transactionally; the update can lag up to 2 minutes behind the insertion of the literal being indexed.</p>

<p>The test data generator generates the update stream together with the initial data. The update stream is a set of files containing Turtle-serialized data for the updates, with all triples belonging to a transaction in consecutive order. The possible transaction boundaries are marked with a comment distinguishable from the text. The test sponsor may implement a special load program if desired. The files must be loaded in sequence but a single file may be loaded on any number of parallel threads.</p>

<p>The data generator should generate multiple files for the initial dump in order to facilitate parallel loading.</p>

<p>The same update stream can be used during all tests, starting each run from a backup containing only the initial state. In the original run, the update stream is applied starting at the measurement interval, after the SUT is in steady state.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2010-09-21#1630">
  <rss:title>LOD2 Kick Off</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2010-09-21T21:13:03Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">The LOD2 kick off meeting was held in Leipzig on Sept 6-8. I will here talk about OpenLink plans as concerns LOD2; hence this is not to be taken as representative of the whole project. I will first discuss the immediate and conclude with the long term. As concerns OpenLink specifically, we have two short term activities, namely publishing the initial LOD2 repository in December and publishing a set of RDB and RDF benchmarks in February. The LOD2 repository is a fusion of the OpenLink LOD Cloud Cache (which includes data from URIBurner and PingTheSemanticWeb) and Sindice, both hosted at DERI. The value-add compared to Sindice or the Virtuoso-based LOD Cloud Cache alone is the merger of the timeliness and ping-ping crawling of Sindice with the SPARQL of Virtuoso. Further down the road, after we migrate the system to the Virtuoso column store, we will also see gains in performance, primarily due to much better working set, as data is many times more compact than with the present row-wise key compression. Still further, but before next September, we will have dynamic repartitioning; the time of availability is set as this is part of the LOD2 project roadmap. The operational need for this is pushed back somewhat by the compression gains from column-wise storage. As for benchmarks, I just compiled a draft of suggested extensions to the BSBM (Berlin SPARQL Benchmark). I talked about this with Peter Boncz and Chris Bizer, to the effect that some extensions of BSBM could be done but that the time was a bit short for making a RDF-specific benchmark. We do recall that BSBM is fully feasible with a relational schema and that RDF offers no fundamental edge for the workload. There was a graph benchmark talk at the TPC workshop at VLDB 2010. There too, the authors were suggesting a social network use case for benchmarking anything from RDF stores to graph libraries. The presentation did not include any specification of test data, so it may be that some cooperation is possible there. The need for such a benchmark is well acknowledged. The final form of this is not yet set but LOD2 will in time publish results from such. We did informally talk about a process for publishing with our colleagues from Franz and Ontotext at VLDB 2010. The idea is that vendors tune their own systems and do the runs and that the others check on this, preferably all using the same hardware. Now, the LOD2 benchmarks will also include relational-to-RDF comparisons, for example TPC-H in SQL and SPARQL. The SQL will be Virtuoso, MonetDB, and possibly VectorWise and others, depending on what legal restrictions apply at the time. This will give an RDF-to-SQL comparison of TPC-H at least on Virtuoso, later also on MonetDB, depending on the schedule for a MonetDB SPARQL front-end. In the immediate term, this of course focuses our efforts on productizing the Virtuoso column store extension and the optimizations that go with it. LOD2 is however about much more than database benchmarks. Over the longer term, we plan to apply suitable parts of the ground-breaking database research done at CWI to RDF use cases. This involves anything from adaptive indexing, to reuse and caching of intermediate results, to adaptive execution. This is however more than just mapping column store concepts to RDF. New challenges are posed by running on clusters and dealing with more expressive queries than just SQL, in specific queries with Datalog-like rules and recursion. LOD2 is principally about integration and alignment, from the schema to the instance level. This involves complex batch processing, close to the data, on large volumes of data. Map-reduce is not the be-all-end-all of this. Of course, a parallel database like Virtuoso, Greenplum, or Vertica can do map-reduce style operations under control of the SQL engine. After all, the SQL engine needs to do map-reduce and a lot more to provide good throughput for parallel, distributed SQL. Something like the Berkeley Orders Of Magnitude (BOOM) distributed Datalog implementation (Overlog, Deadalus, BLOOM) could be a parallel computation framework that would subsume any map-reduce-style functionality under a more elegant declarative framework while still leaving control of execution to the developer for the cases where this is needed. From our viewpoint, the project&#39;s gains include: Significant narrowing of the RDB to RDF performance gap. RDF will be an option for large scale warehousing, cutting down on time to integration by providing greater schema flexibility. Ready to use toolbox for data integration, including schema alignment and resolution of coreference. Data discovery, summarization and visualization Integrating this into a relatively unified stack of tools is possible, since these all cluster around the task of linking the universe with RDF and linked data. In this respect the integration of results may be stronger than often seen in European large scale integrating projects. The use cases fit the development profile well: Wolters Kluwer will develop an application for integrating resources around law, from the actual laws to court cases to media coverage. The content is modeled in a fine grained legal ontology. Exalead will implement the linked data enterprise, addressing enterprise search and any typical enterprise data integration plus generating added value from open sources. The Open Knowledge Foundation will create a portal of all government published data for easy access by citizens. In all these cases, the integration requirements of schema alignment, resolution of identity, information extraction, and efficient storage and retrieval play a significant role. The end user interfaces will be task-specific but developer interfaces around integration tools and query formulation may be quite generic and suited for generic RDF application development.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>The <a href="http://lod2.eu/" id="link-id0x236f6368">LOD2</a> <a href="http://lod2.eu/BlogPost/9-press-release-lod2-project-launch.html" id="link-id0x18c0c770">kick off meeting</a> was held in Leipzig on Sept 6-8. I will here talk about OpenLink plans as concerns LOD2; hence this is not to be taken as representative of the whole project. I will first discuss the immediate and conclude with the long term.</p>

<p>As concerns OpenLink specifically, we have two short term activities, namely publishing the initial LOD2 repository in December and publishing a set of RDB and <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x22e46c78">RDF</a> benchmarks in February.</p>

<p>The LOD2 repository is a fusion of the OpenLink <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x235aaf00">LOD</a> <a href="http://lod.openlinksw.com/" id="link-id0x237a3470">Cloud</a> <a href="http://dbpedia.org/resource/Cache" id="link-id0x237d6380">Cache</a> (which includes <a href="http://dbpedia.org/resource/Data" id="link-id0x236d3830">data</a> from <a href="http://uriburner.com/" id="link-id0x22ff6f88">URIBurner</a> and <a href="http://www.pingthesemanticweb.com/" id="link-id0x235a3be8">PingTheSemanticWeb</a>) and <a href="http://sindice.com/" id="link-id0x23783d68">Sindice</a>, both hosted at <a href="http://dbpedia.org/resource/Digital_Enterprise_Research_Institute" id="link-id0x22e48ff8">DERI</a>. The value-add compared to Sindice or the <a href="http://virtuoso.openlinksw.com" id="link-id0x23904730">Virtuoso</a>-based LOD Cloud Cache alone is the merger of the timeliness and ping-ping crawling of Sindice with the <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x2382c800">SPARQL</a> of Virtuoso.</p>

<p>Further down the road, after we migrate the system to the Virtuoso column store, we will also see gains in performance, primarily due to much better working set, as data is many times more compact than with the present row-wise <a href="http://dbpedia.org/resource/Data_compression" id="link-id0x236e64d0">key compression</a>.</p>

<p>Still further, but before next September, we will have dynamic repartitioning; the time of availability is set as this is part of the LOD2 project roadmap. The operational need for this is pushed back somewhat by the compression gains from column-wise storage.</p>

<p>As for benchmarks, I just compiled <a href="http://www.openlinksw.com/weblogs/oerling/" id="link-id0x1c29e720">a draft of suggested extensions to the BSBM</a> (<a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x2391d000">Berlin SPARQL Benchmark</a>). I talked about this with <a href="http://nl.linkedin.com/in/peterboncz" id="link-id0x22e4efe8">Peter Boncz</a> and <a href="http://data.semanticweb.org/person/christian-bizer" id="link-id0x23910980">Chris Bizer</a>, to the effect that some extensions of BSBM could be done but that the time was a bit short for making a RDF-specific benchmark. We do recall that BSBM is fully feasible with a relational <a href="http://dbpedia.org/resource/Database_schema" id="link-id0x236fd5c8">schema</a> and that RDF offers no fundamental edge for the workload.</p>

<p>There was a graph benchmark talk at the <a href="http://www.tpc.org/" id="link-id0x236eaa88">TPC</a> workshop at <a href="http://www.vldb2010.org/" id="link-id0x2391e818">VLDB 2010</a>. There too, the authors were suggesting a social network use case for benchmarking anything from RDF stores to graph libraries. The presentation did not include any specification of test data, so it may be that some cooperation is possible there. The need for such a benchmark is well acknowledged. The final form of this is not yet set but LOD2 will in time publish results from such.</p>

<p>We did informally talk about a process for publishing with our colleagues from <a href="http://semanticweb.org/id/Franz_Inc" id="link-id0x235a4c40">Franz</a> and <a href="http://dbpedia.org/resource/Ontotext" id="link-id0x236ec978">Ontotext</a> at VLDB 2010. The idea is that vendors tune their own systems and do the runs and that the others check on this, preferably all using the same hardware.</p>

<p>Now, the LOD2 benchmarks will also include relational-to-RDF comparisons, for example TPC-<a href="http://dbpedia.org/resource/TPC-H" id="link-id0x23819e80">H</a> in <a href="http://dbpedia.org/resource/SQL" id="link-id0x2382b890">SQL</a> and SPARQL. The SQL will be Virtuoso, <a href="http://dbpedia.org/resource/MonetDB" id="link-id0x2382b8b8">MonetDB</a>, and possibly <a href="http://www.ingres.com/vectorwise/" id="link-id0x237c2b10">VectorWise</a> and others, depending on what legal restrictions apply at the time. This will give an RDF-to-SQL comparison of TPC-H at least on Virtuoso, later also on MonetDB, depending on the schedule for a MonetDB SPARQL front-end.</p>

<p>In the immediate term, this of course focuses our efforts on productizing the Virtuoso column store extension and the optimizations that go with it.</p>

<p>LOD2 is however about much more than database benchmarks. Over the longer term, we plan to apply suitable parts of the ground-breaking database research done at <a href="http://dbpedia.org/resource/National_Research_Institute_for_Mathematics_and_Computer_Science" id="link-id0x2381d9a0">CWI</a> to RDF use cases.</p>

<p>This involves anything from adaptive indexing, to reuse and caching of intermediate results, to adaptive execution. This is however more than just mapping column store concepts to RDF. New challenges are posed by running on clusters and dealing with more expressive queries than just SQL, in specific queries with Datalog-like rules and recursion.</p>

<p>LOD2 is principally about integration and alignment, from the schema to the instance level. This involves complex batch processing, close to the data, on large volumes of data. Map-reduce is not the be-all-end-all of this. Of course, a parallel database like Virtuoso, <a href="http://dbpedia.org/resource/Greenplum" id="link-id0x22e1f068">Greenplum</a>, or <a href="http://www.vertica.com/" id="link-id0x23905c58">Vertica</a> can do map-reduce style operations under control of the SQL engine. After all, the SQL engine needs to do map-reduce and a lot more to provide good throughput for parallel, distributed SQL. Something like the <a href="http://www.eecs.berkeley.edu/Research/Projects/Data/105733.html" id="link-id0x23905c80">Berkeley Orders Of Magnitude</a> (<a href="http://www.eecs.berkeley.edu/Research/Projects/Data/105733.html" id="link-id0x234bc1f8">BOOM</a>) distributed Datalog implementation (Overlog, Deadalus, BLOOM) could be a parallel computation framework that would subsume any map-reduce-style functionality under a more elegant declarative framework while still leaving control of execution to the developer for the cases where this is needed.</p>

<p>From our viewpoint, the project&#39;s gains include:</p>

<ul>
 <li>
  <p>Significant narrowing of the RDB to RDF performance gap. RDF will be an option for large scale warehousing, cutting down on time to integration by providing greater schema flexibility.</p>
 </li>
<li>
  <p>Ready to use toolbox for data integration, including schema alignment and resolution of coreference.</p>
</li>
<li>
  <p>Data discovery, summarization and visualization</p>
</li>
</ul>

<p>Integrating this into a relatively unified stack of tools is possible, since these all cluster around the task of linking the universe with RDF and <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x2390bdf8">linked data</a>. In this respect the integration of results may be stronger than often seen in European large scale integrating projects.</p>

<p>The use cases fit the development profile well: </p>
<ul>
 <li>
  <p>
    <a href="http://dbpedia.org/resource/Wolters_Kluwer" id="link-id0x237a3420">Wolters Kluwer</a> will develop an application for integrating resources around law, from the actual laws to court cases to media coverage. The content is modeled in a fine grained legal ontology.</p>
 </li>
<li>
  <p>
    <a href="http://dbpedia.org/resource/Exalead" id="link-id0x235c5d70">Exalead</a> will implement the linked data enterprise, addressing enterprise search and any typical enterprise data integration plus generating added value from open sources.</p>
</li>
<li>
  <p>The Open <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x235f8138">Knowledge</a> Foundation will create a portal of all government published data for easy access by citizens.</p>
</li>
</ul>

<p>In all these cases, the integration requirements of schema alignment, resolution of identity, <a href="http://dbpedia.org/resource/Information" id="link-id0x22fef2e8">information</a> extraction, and efficient storage and retrieval play a significant role. The end user interfaces will be task-specific but developer interfaces around integration tools and query formulation may be quite generic and suited for generic RDF application development.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2010-03-15#1614">
  <rss:title>SemData@Sofia Roundtable write-up</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2010-03-15T14:46:57Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">There was last week an invitation-based roundtable about semantic data management in Sofia, Bulgaria. Lots of smart people together. The meeting was hosted by Ontotext and chaired by Dieter Fensel. On the database side we had Ontotext, SYSTAP (Bigdata), CWI (MonetDB), Karlsruhe Institute of Technology (YARS2/SWSE). LarKC was well represented, being our hosts, with STI, Ontotext, CYC, and VU Amsterdam. Notable absences were Oracle, Garlik, Franz, and Talis. Now of semantic data management... What is the difference between a relational database and a semantic repository, a triple/quad store, a whatever-you-call-them? I had last fall a meeting at CWI with Martin Kersten, Peter Boncz and Lefteris Sidirourgos from CWI, and Frank van Harmelen and Spiros Kotoulas of VU Amsterdam, to start a dialogue between semanticists and databasers. Here we were with many more people trying to discover what the case might be. What are the differences? Michael Stonebraker and Martin Kersten have basically said that what is sauce for the goose is sauce for the gander, and that there is no real difference between relational DB and RDF storage, except maybe for a little tuning in some data structures or parameters. Semantic repository implementors on the other hand say that when they tried putting triples inside an RDB it worked so poorly that they did everything from scratch. (It is a geekly penchant to do things from scratch, but then this is not always unjustified.) OpenLink Software and Virtuoso are in agreement with both sides, contradictory as this might sound. We took our RDBMS and added data types and structures and cost model alterations to an existing platform. Oracle did the same. MonetDB considers doing this and time will tell the extent of their RDF-oriented alterations. Right now the estimate is that this will be small and not in the kernel. I would say with confidence that without source code access to the RDB, RDF will not be particularly convenient or efficient to accommodate. With source access, we found that what serves RDB also serves RDF. For example, execution engine and data compression considerations are the same, with minimal tweaks for RDF&#39;s run time typing needs. So now we are founding a platform for continuing this discussion. There will be workshops and calls for papers and the beginnings of a research community. After the initial meeting at CWI, I tried to figure what the difference was between the databaser and semanticist minds. Really, the things are close but there is still a disconnect. Database is about big sets and semantics is about individuals, maybe. The databaser discovers that the operation on each member of the set is not always the same, and the semanticist discovers that the operation on each member of the set is often the same. So the semanticist says that big joins take time. The databaser tells the semanticist not to repeat what&#39;s been obvious for 40 years and for which there is anything from partitioned hashes to merges to various vectored execution models. Not to mention columns. Spiros of VU Amsterdam/LarKC says that map-reduce materializes inferential closure really fast. Lefteris of CWI says that while he is not a semantic person, he does not understand what the point of all this materializing is, nobody is asking the question, right? So why answer? I say that computing inferential closure is a semanticist tradition; this is just what they do. Atanas Kiryakov of Ontotext says that this is not just a tradition whose start and justification is in the forgotten mists of history, but actually a clear and present need; just look at all the joining you would need. Michael Witbrock of CYC says that it is not about forward or backward inference on toy rule sets, but that both will be needed and on massively bigger rule sets at that. Further, there can be machine learning to direct the inference, doing the meta-reasoning merged with the reasoning itself. I say that there is nothing wrong with materialization if it is guided by need, in the vein of memo-ization or cracking or recycling as is done in MonetDB. Do the work when it is needed, and do not do it again. Brian Thompson of Systap/Bigdata asks whether it is not a contradiction in terms to both want pluggability and merging inference into the data, like LarKC would be doing. I say that this is difficult but not impossible and that when you run joins in a cluster database, as you decide based on the data where the next join step will be, so it will be with inference. Right there, between join steps, integrated with whatever data partitioning logic you have, for partitioning you will have, data being bigger and bigger. And if you have reuse of intermediates and demand driven indexing Ã  la MonetDB, this too integrates and applies to inference results. So then, LarKC and CYC, can you picture a pluggable inference interface at this level of granularity? So far, I have received some more detail as to the needs of inference and database integration, essentially validating our previous intuitions and plans. Aside talking of inference, we have the more immediate issue of creating an industry out of the semantic data management offerings of today. What do we need for this? We need close-to-parity with relational â doing your warehouse in RDF with the attendant agility thereof can&#39;t cost 10x more to deploy than the equivalent relational solution. We also want to tell the key-value, anti-SQL people, who throw away transactions and queries, that there is a better way. And for this, we need to improve our gig just a little bit. Then you have the union of some level of ACID, at least consistent read, availability, complex query, large scale. And to do this, we need a benchmark. It needs a differentiation of online queries and browsing and analytics, graph algorithms and such. We are getting there. We will soon propose a social web benchmark for RDF which has both online and analytical aspects, a data generator, a test driver, and so on, with a TPC-style set of rules. If there is agreement on this, we will all get a few times faster. At this point, RDF will be a lot more competitive with mainstream and we will cross another qualitative threshold.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>There was last week an <a href="http://www.semdata.org/" id="link-id11a83cf98">invitation-based roundtable</a> about semantic <a href="http://dbpedia.org/resource/Data" id="link-id0x1d5ae638">data</a> management in <a href="http://www.dbpedia.org/resource/Sofia" id="link-id0x1c147340">Sofia, Bulgaria</a>.</p>

<p>Lots of smart people together. The meeting was hosted by <a href="http://dbpedia.org/resource/Ontotext" id="link-id0x1c77a6e8">Ontotext</a> and chaired by <a href="http://www.dbpedia.org/resource/Dieter_Fensel" id="link-id0x1e64f350">Dieter Fensel</a>. On the database side we had Ontotext, <a href="http://www.systap.com/" id="link-id0x1cc261c8">SYSTAP</a> (<a href="http://www.systap.com/bigdata.htm" id="link-id0x1dad5348">Bigdata</a>), <a href="http://dbpedia.org/resource/National_Research_Institute_for_Mathematics_and_Computer_Science" id="link-id0x1d3b68f8">CWI</a> (<a href="http://dbpedia.org/resource/MonetDB" id="link-id0x1dba4028">MonetDB</a>), <a href="http://www.dbpedia.org/resource/Karlsruhe_Institute_of_Technology" id="link-id0x1a01f668">Karlsruhe Institute of Technology</a> (YARS2/<a href="http://swse.deri.ie/" id="link-id0x1ceeed50">SWSE</a>). <a href="http://www.larkc.eu/" id="link-id0x1e650c98">LarKC</a> was well represented, being our hosts, with STI, Ontotext, CYC, and <a href="http://www.vu.nl/" id="link-id0x1ca044f0">VU Amsterdam</a>. Notable absences were <a href="http://dbpedia.org/resource/Oracle_Database" id="link-id0x1b9e5418">Oracle</a>, <a href="http://freebase.com/guid/9202a8c04000641f8000000005c908d6" id="link-id0x1e55adc8">Garlik</a>, <a href="http://semanticweb.org/id/Franz_Inc" id="link-id0x1cf1d4b8">Franz</a>, and <a href="http://www.talis.com/" id="link-id0x1cbb8740">Talis</a>.</p>

<p>Now of semantic data management... What is the difference between a relational database and a semantic repository, a triple/quad store, a whatever-you-call-them?</p>

<p>I had last fall a meeting at CWI with Martin Kersten, Peter Boncz and Lefteris Sidirourgos from CWI, and Frank van Harmelen and Spiros Kotoulas of VU Amsterdam, to start a dialogue between semanticists and databasers. Here we were with many more people trying to discover what the case might be. What are the differences?</p>

<p>Michael <a href="http://dbpedia.org/resource/Michael_Stonebraker" id="link-id0x1e7b3080">Stonebraker</a> and Martin Kersten have basically said that what is sauce for the goose is sauce for the gander, and that there is no real difference between relational DB and <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1dba61f8">RDF</a> storage, except maybe for a little tuning in some data structures or parameters. Semantic repository implementors on the other hand say that when they tried putting triples inside an RDB it worked so poorly that they did everything from scratch. (It is a geekly penchant to do things from scratch, but then this is not always unjustified.)</p>

<p>
<a href="http://www.openlinksw.com/dataspace/organization/openlink#this" id="link-id0x1cf45d00">OpenLink Software</a> and <a href="http://virtuoso.openlinksw.com" id="link-id0x1d8b3ac0">Virtuoso</a> are in agreement with both sides, contradictory as this might sound. We took our <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x1d51e110">RDBMS</a> and added data types and structures and cost model alterations to an existing platform. Oracle did the same. MonetDB considers doing this and time will tell the extent of their RDF-oriented alterations. Right now the estimate is that this will be small and not in the kernel.</p>

<p>I would say with confidence that without source code access to the RDB, RDF will not be particularly convenient or efficient to accommodate. With source access, we found that what serves RDB also serves RDF. For example, execution engine and data compression considerations are the same, with minimal tweaks for RDF&#39;s run time typing needs.</p>

<p>So now we are founding a platform for continuing this discussion. There will be workshops and calls for papers and the beginnings of a research community.</p>

<p>After the initial meeting at CWI, I tried to figure what the difference was between the databaser and semanticist minds. Really, the things are close but there is still a disconnect. Database is about big sets and semantics is about individuals, maybe. The databaser discovers that the operation on each member of the set is not always the same, and the semanticist discovers that the operation on each member of the set is often the same.</p>

<p>So the semanticist says that big joins take time. The databaser tells the semanticist not to repeat what&#39;s been obvious for 40 years and for which there is anything from partitioned hashes to merges to various vectored execution models. Not to mention columns.</p>

<p>Spiros of VU Amsterdam/LarKC says that map-reduce materializes inferential closure really fast. Lefteris of CWI says that while he is not a semantic person, he does not understand what the point of all this materializing is, nobody is asking the question, right? So why answer? I say that computing inferential closure is a semanticist tradition; this is just what they do. Atanas Kiryakov of Ontotext says that this is not just a tradition whose start and justification is in the forgotten mists of history, but actually a clear and present need; just look at all the joining you would need.</p>

<p>Michael Witbrock of CYC says that it is not about forward or backward inference on toy rule sets, but that both will be needed and on massively bigger rule sets at that. Further, there can be machine learning to direct the inference, doing the meta-reasoning merged with the reasoning itself.</p>

<p>I say that there is nothing wrong with materialization if it is guided by need, in the vein of memo-ization or cracking or recycling as is done in MonetDB. Do the work when it is needed, and do not do it again.</p>

<p>Brian Thompson of Systap/Bigdata asks whether it is not a contradiction in terms to both want pluggability and merging inference into the data, like LarKC would be doing. I say that this is difficult but not impossible and that when you run joins in a cluster database, as you decide based on the data where the next join step will be, so it will be with inference. Right there, between join steps, integrated with whatever data partitioning logic you have, for partitioning you <i>will</i> have, data being bigger and bigger. And if you have reuse of intermediates and demand driven indexing <i>Ã  la</i> MonetDB, this too integrates and applies to inference results.</p>


<p>So then, LarKC and CYC, can you picture a pluggable inference interface at this level of granularity? So far, I have received some more detail as to the needs of inference and database integration, essentially validating our previous intuitions and plans.</p>


<p>Aside talking of inference, we have the more immediate issue of creating an industry out of the semantic data management offerings of today.</p>

<p>What do we need for this? We need close-to-parity with relational â doing your warehouse in RDF with the attendant agility thereof can&#39;t cost 10x more to deploy than the equivalent relational solution.</p>

<p>We also want to tell the key-value, anti-<a href="http://dbpedia.org/resource/SQL" id="link-id0x1cbeaf70">SQL</a> people, who throw away transactions and queries, that there is a better way. And for this, we need to improve our gig just a little bit. Then you have the union of some level of <a href="http://dbpedia.org/resource/ACID" id="link-id0x1e11fbd0">ACID</a>, at least consistent read, availability, complex query, large scale.</p>

<p>And to do this, we need a benchmark. It needs a differentiation of online queries and browsing and analytics, graph algorithms and such. We are getting there. We will soon propose a social web benchmark for RDF which has both online and analytical aspects, a data generator, a test driver, and so on, with a <a href="http://www.tpc.org/" id="link-id0x1a950de0">TPC</a>-style set of rules. If there is agreement on this, we will all get a few times faster. At this point, RDF will be a lot more competitive with mainstream and we will cross another qualitative threshold. </p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2009-10-27#1585">
  <rss:title>European Commission and the Data Overflow</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-10-27T18:29:51Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">The European Commission recently circulated a questionnaire to selected experts on what could be done for the future of big data. Since the questionnaire is public, I am publishing my answers below. Data and data types What volumes of data are we dealing with today? What is the growth rate? Where can we expect to be in 2015? Private data warehouses of corporations have more than doubled yearly for the past years; hundreds of TB is not exceptional. This will continue. The real shift is in structured data being published in increasing quantities with a minimum level of integrate-ability through use of RDF and linked data principles. There are rewards for use of standard vocabularies and identifiers through search engines recognizing such data. There is convergence around DBpedia identifiers for real-world entities, e.g., most things that would be in the news. This also means that internal data processes and silos may be enriched with this content. There is consequent pressure for accommodating more diversity of data, with more flexible schema. Ultimately, all content presently stored in RDBs and presented in public accessible dynamic web pages will end up on the web of linked data. Examples are product catalogs, price lists, event schedules and the like. The volume of the well known linked data sets is around 10 billion statements. With the above mentioned trends, growth by two or three orders of magnitude by 2015 seems reasonable, This is so especially if explicit semantics are extracted from the document web and if there is some further progress in the precision/recall of such extraction. Relevant sections of this mass of data are a potential addition to any present or future analytics application. Since arbitrary analytics over the database which is the web cannot be economically provided by a centralized search engine, a cloud model may be used for on-demand selection of relevant data and mixing it with private data. This will drive database innovation for the next years even more than the continued classical warehouse growth. Science data is another driver of the data overflow. For example, faster gene sequencing, more accurate measurements in high energy physics, better imaging, and remote sensing will produce large volumes of data. This data has highly regular structure but labeling this data with source and lineage calls for a flexible, schema-last, self-describing model, such as RDF and linked data. Data and metadata should travel together but may have different data models. By and large, the metadata of science data will be another stream to the web of linked data, at least to the degree it is publicly accessible. Restricted circles can and likely will implement similar ideas. What types of data can we deal with intelligently due to their inherent structure (geospatial, temporal, social or knowledge graphs, 3D, sensor streams...)? All the above types should be supported inside one DBMS so as to allow efficient querying combining conditions on all these types of data, e.g., photos of sunsets taken last summer in Ibiza, with over 20 megapixels, by people I know. Note that the test for being a sunset is an operation on the image blob that should be taken to the data; the images cannot be economically transferred. Interleaving of all database functions and types becomes increasingly important. Industries, communities Who is producing these data and why? Could they do it better? How? Right now, projects such as Bio2RDF, Neurocommons, and DBPedia produce this data. The processes are in place and are reasonable. Incremental improvement is to be expected. These processes, along with the linked data meme generally taking off, drive demand for better NLP (Natural Language Processing), e.g., entity and relationship extraction, especially extraction that can produce instance data in given ontologies (e.g., events) using common identifiers (e.g., DBPedia URIs). Mapping of RDBs to RDF is possible, and a W3C working group is developing standards for this. The required baseline level has been reached; the rest is a matter of automating deployment. Within the enterprise, there are advantages to be gained for information integration; e.g., all entities in the CRM space can be integrated with all email and support tickets through giving everything a URI. Some of this information may even be published on an extranet for self-service and web-service interfaces. This has been done at small scales and the rest is a matter of spreading adoption and lowering the entry barrier. Incremental progress will take place, eventually resulting in qualitatively better integration along the value chain when adoption is sufficiently widespread. Who is consuming these data and why? Could they do it better? How? Consumers are various. The greatest need is for tools that summarize complex data and allow getting a bird&#39;s eye view of what data is in the first instance available. Consuming the data is hindered by the user not even necessarily knowing what data there is. This is somewhat new, as traditionally the business analyst did know the schema of the warehouse and was proficient with SQL report generators and statistics packages. Where Web 2.0 made the citizen journalist, the web of linked data will make the citizen analyst. For this to happen, with benefits for individuals, enterprises, and governments alike, more work in user interfaces, knowledge discovery, and query composition will be useful. We may envision a &quot;meshup economy&quot; where data is plentiful, but the unit of value and exchange is the smart report that crystallizes actionable value from this ocean. What industrial sectors in Europe could become more competitive if they became much better at managing data? Any sector could benefit. Early adopters are seen in the biomedical field and to an extent in media. Is the regulation landscape imposing constraints (privacy, compliance ...) that don&#39;t have today good tool support? The regulation landscape drives database demand through data retention requirements and the like. With data integration, especially with privacy-sensitive data (as in medicine), there are issues of whether one dares put otherwise-shareable information online. Regulation is needed to protect individuals, but integration should still be possible for science. For this, we see a need for progress in applying policy-based approaches (e.g., row level security) to relatively schema-last data such as RDF. This is possible but needs some more work. Also, creating on-the-fly-anonymizing views on data might help. More research is needed for reconciling the need for security with the advantages of broad-based ad hoc integration. Ideally, data should be intelligent, aware of its origins and classification and cautious of whom it interacts with, all of this supported under the covers so that the user could ask anything but the data might refuse to answer or might restrict answers according to the user&#39;s profile. This is a tall order and implementing something of the sort is an open question. What are the main practical problem identified for individuals and organizations? Please give examples and tell us about the main obstacles and barriers. We have come across the following: Knowing that the data exists in the first place. If the data is found, figuring out the provenance, units and precision of measurement, identifiers, and the like. Compatible subject matter but incompatible representation: For example, one has numbers on a map with different maps for different points in time; another has time series of instrument data with geo-location for the instrument. It is only to be expected that the time interval between measurements is not the same. So there is need for a lot of one-off programming to align data. Other problems have to do with sheer volume, i.e., transfer of data even in a local area network is too slow, let alone over a wide area network. Computation needs to go to the data, and databases need to support this. Services, software stacks, protocols, standards, benchmarks What combinations of components are needed to deal with these problems? Recent times have seen a proliferation of special purpose databases. Since the data needs of the future are about combining data with maximum agility and minimum performance hit, there is need to gather the currently-separate functionality into an integrated system with sufficient flexibility. We see some of this in integration of map-reduce and scale-out databases. The former antagonists have become partners. Vertica, Greenplum, and OpenLink Virtuoso are example of DBMS featuring work in this direction. Interoperability and at least de facto standards in ways of doing this will emerge. What data exchange and processing mechanisms will be needed to work across platforms and programming languages? HTTP, XML, and RDF are in fact very verbose, yet these are the formats and models that have uptake. Thus, these will continue to be used even though one might think binary formats to be more efficient. There are of course science data set standards that are more compressed and these will continue, hopefully adding a practice of rich metadata in RDF. For internals of systems, MPI and TCP/IP with proprietary optimized wire formats will continue. Inter-system communication will likely continue to be HTTP, XML, and RDF as appropriate. What data environments are today so wastefully messy that they would benefit from the development of standards? RDF and OWL are not messy but they could use some more performance; we are working on this. SPARQL is finally acquiring the capabilities of a serious query language, so things are slowly coming together. Community process for developing application domain specific vocabularies works quite well, even though one could argue it is ad hoc and not up to what a modeling purist might wish. Top-down imposition of standards has a mixed history, with long and expensive development and sometimes no or little uptake, consider some WS* standards for example. What kind of performance is expected or required of these systems? Who will measure it reliably? How? Relational databases have a history of substantial investment in optimization and some of them are very good for what they do, e.g., the newer generation of analytics databases. The very large schema-last, no-SQL, sometimes eventually consistent key-value stores have a somewhat shorter history but do fill a real need. These trends will merge: Extreme scale, schema-last, complex queries, even more complex inference, custom code for in-database machine learning and other bulk processing. We find RDF augmented with some binary types at this crossroads. This point of the design space will have to provide performance roughly on the level of today&#39;s best relational solution for workloads that fit the relational model. The added cost of schema-last and inference must come down. We are working on this. Research work such as carried out with MonetDB gives clues as to how these aims can be reached. The separation of query language and inference is artificial. After the concepts are mature, these functions will merge and execute close to the data; there are clear evolutionary pressures in this direction. Benchmarks are key. Some gain can be had even from repurposing standard relational benchmarks like TPC-H. But the TPC-H rules do not allow official reporting of such. Development of benchmarks for RDF, complex queries, and inference is needed. A bold challenge to the community, it should be rooted in real-life integration needs and involve high heterogeneity. A key-value store benchmark might also be conceived. A transaction benchmark like TPC-C might be the basis, maybe augmented with massive user-generated content like reviews and blogs. If benchmarks exist and are not too easy nor inaccessibly difficult nor too expensive to run â think of the high end TPC-C results â then TPC-style rules and processes would be quite adequate. The threshold to publish should be lowered: Everybody runs the TPC workloads internally but few publish. Some EC initiative for benchmarking could make sense, similar to the TREC initiative of the US government. Industry should be consulted for the specific content; possibly the answers to the present questionnaire can provide an approximate direction. Benchmarks should be run by software vendors on their own systems, tuned by themselves. But there should be a process of disclosure and auditing; the TPC rules give an example. Compliance should not be too expensive or time consuming. Some community development for automating these things would be a worthwhile target for EC funding. Usability and training How difficult will it be for a developer of average competence to deploy components whose core is based on rather deep computer science? Do we all need to understand Monads and Continuations? What can be done to make it ever easier? In the database world, huge advances in technology have taken place behind a relatively simple and stable interface: SQL. For the linked data web, the same will take place behind SPARQL. Beyond these, for example, programming with MPI with good utilization of a cluster platform for an arbitrary algorithm, is quite difficult. The casual amateur is hereby warned. There is no single solution. For automatic parallelization, since explicit, programmatic parallelization of things with MPI for example is very unscalable in terms of required skill, we should favor declarative and/or functional approaches. Developing a debugger and explanation engine for rule-based and description-logics-based inference would be an idea. For procedural workloads, things like Erlang may be good in cases and are not overly difficult in principle, especially if there are good debugging facilities. For shipping functions in a cluster or cloud, the BOOM (Berkeley Orders Of Magnitude) approach or logic programming with explicit specification of compute location seem promising, surely more flexible than map-reduce. The question is whether a PHP developer can be made to do logic programming. This bridge will be crossed only with actual need and even then reluctantly. We may look at the Web 2.0 practice of sharding MySQL, inconvenient as this may be, for an example. There is inertia and thus re-architecting is a constant process that is generally in reaction to facts, post hoc, often a point solution. One could argue that planning ahead would be smarter but by and large the world does not work so. One part of the answer is an infinitely-scalable SQL database that expands and shrinks in the clouds, with the usual semantics, maybe optional eventual consistency and built-in map reduce. If such a thing is inexpensive enough and syntax-level-compatible with present installed base, many developers do not have to learn very much more. This is maybe good for the bread-and-butter IT, but European competitiveness should not rest on this. Therefore we wish to go for bold new application types for which the client-server database application is not the model. Data-centric languages like BOOM, if they can be made very efficient and have good debugging support, are attractive there. These do require more intellectual investment but that is not a problem since the less-inquisitive part of the developer community is served by the first part of the answer. How is a developer of average skills going to learn about these new advanced tools? How can we plan for excellent documentation and training, community mentoring, exchange of good practices, etc... across all EU countries? For the most part, developers do not learn things for the sake of learning. When they have learned something and it is adequate, they stay with it for the most part and are even reluctant to engage in cross-camps interaction. The research world is often similarly insular. A new inflection in the application landscape is needed to drive learning. This inflection is provided by the ubiquity of mobile devices, sensor data, explicit semantics, NLP concept extraction, web of linked data, and such factors. RDFa is a good example of a new technique piggybacking on something everybody uses, namely HTML. These new things should, within possibility, be deployed in the usual technology stack, LAMP or Java. Of course these do not have to be LAMP or Java or HTML or HTTP themselves but they must manifest through these. A lot of the semantic web potential can be realized within the client-server database application model, thus no fundamental re-architecting, just some new data types and queries. For data- or processing-intensive tasks, an on-demand hookup to cloud-based servers with Erlang and/or BOOM for programming model would be easy enough to learn and utilize. The question is one of providing challenges. Addressing actual challenges with these techniques will lead to maturity, documentation, examples, and training. With virtual, Europe-wide distributed teams a reality in many places, Europe-wide dissemination is no longer insurmountable. As the data overflow proceeds, its victims will multiply and create demand for solutions. The EC could here encourage research project use cases gaining an extended life past the end of research projects, possibly being maintained and multiplied and spun off. If such things could be mutated into self-sustaining service businesses with pay-per-use revenue, say through a cloud SaaS business model, still primarily leveraging an open source technology stack, we could have self-propagating and self-supporting models for exploiting advanced IT. This would create interest, and interest would drive training and dissemination. The problem is creating the pull. Challenges What should be, in this domain, the equivalent of the Netflix challenge, Ansari X Prize, Google Lunar X Prize, etc. ... ? The EC itself no doubt suffers from data overflow in one function or another. Unless security/secrecy prohibits, simply publishing a large data set and a description of what operations should be done on it would be a start. The more real the data, the better â reality is consistently more complex and surprising than imagination. Since many interesting problems touch on fraud detection and law enforcement, there may be some security obstacles for using these application domains as subject matters of open challenges. Once there is a good benchmark, as discussed above, there can be some prize money allocated for the winners, specially if the race is tight. The Semantic Web Challenge and the Billion Triples Challenge exist and are useful as such, but do not seem to have any huge impact. The incentives should be sufficient and part of the expenses arising from running for such challenges could be funded. Otherwise investing in existing business development will be more interesting to industry. Some industry participation seems necessary; we would wish academia and industry to work closer. Also, having industry supply the baseline guarantees that academia actually does further the state of the art. This is not always certain. If challenges are based on actual problems, whether of the EC, its member governments, or private entities, and winning the challenge may lead to a contract for supplying an actual solution, these will naturally become more interesting for consortia involving integrators, specialist software vendors, and academia. Such a model would build actual capacity to deploy leading edge technologies in production, which is sorely needed. What should one do to set up such a challenge, administer, and monitor it? The EC should probably circulate a call for actual problem scenarios involving big data. If the matter of the overflow is as dire as represented, cases should be easy to find. A few should be selected and then anonymized if needed. The party with the use case would benefit by having hopefully the best work on it. The contestants would benefit from having real world needs guide R&amp;D. The EC would not have to do very much, except possibly use some money for funding the best proposals. The winner would possibly get a large account and related sales and service income. The contestants would have to be teams possibly involving many organizations; for example, development and first-line services and support could come from different companies along a systems integrator model such as is widely used in the US. There may be a good benchmark at the time, possibly resulting from FP7 itself. In such a case, the EC could offer a prize for winners. Details would have to be worked out case by case. Such a challenge could be repeated a few times, as benchmark-driven progress in databases or TREC for example have taken some years to reach a point of slowdown in progress. Administrating such an activity should not be prohibitive, as most of the expertise can be found with the stakeholders.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>The European Commission recently circulated a questionnaire to selected experts on what could be done for the future of big <a href="http://dbpedia.org/resource/Data" id="link-id0x79cfe58">data</a>.</p>
 
<p>Since the <a href="http://cordis.europa.eu/fp7/ict/content-knowledge/consultation_en.html" id="link-id1191c0f8">questionnaire is public</a>, I am publishing my answers below.</p>

<ol type="1" start="1">
<li>
  <p>
    <b>Data and data types</b>
  </p>

<ol type="a" start="1">
	<li>
    <p>
        <b>What volumes of data are we dealing with today? What is the growth rate? Where can we expect to be in 2015? </b>
    </p>

<p>Private data warehouses of corporations have more than doubled yearly for the past years; hundreds of TB is not exceptional.  This will continue. The real shift is in structured data being published in increasing quantities with a minimum level of integrate-ability through use of <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x7d7e7a0">RDF</a> and <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x7f2a788">linked data</a> principles. There are rewards for use of standard vocabularies and identifiers through search engines recognizing such data.  There is convergence around <a href="http://dbpedia.org/resource/DBpedia" id="link-id0x7dfbca8">DBpedia</a> identifiers for real-world entities, e.g., most things that would be in the news.</p>

<p>This also means that internal data processes and silos may be enriched with this content.  There is consequent pressure for accommodating more diversity of data, with more flexible <a href="http://dbpedia.org/resource/Database_schema" id="link-id0x7babaf8">schema</a>.</p>

<p>Ultimately, all content presently stored in RDBs and presented in public accessible dynamic web pages will end up on the web of linked data.  Examples are product catalogs, price lists, event schedules  and the like.</p>

<p>The volume of the well known linked data sets is around 10 billion statements.  With the above mentioned trends, growth by two or three orders of magnitude by 2015 seems reasonable,  This is so especially if explicit semantics are extracted from the document web and if there is some further progress in the precision/recall of such extraction.</p>

<p>Relevant sections of this mass of data are a potential addition to any present or future analytics application.</p>

<p>Since arbitrary analytics over the database which is the web cannot be economically provided by a centralized search engine, a cloud model may be used for on-demand selection of relevant data and mixing it with private data.  This will drive database innovation for the next years even more than the continued classical warehouse growth.</p>

<p>Science data is another driver of the data overflow.  For example, faster gene sequencing, more accurate measurements in high energy physics, better imaging, and remote sensing will produce large volumes of data.  This data has highly regular structure but labeling this data with source and lineage calls for a flexible, schema-last, self-describing model, such as RDF and linked data.  Data and <a href="http://dbpedia.org/resource/Metadata" id="link-id0x96ce60">metadata</a> should travel together but may have different data models.</p>

<p>By and large, the metadata of science data will be another stream to the web of linked data, at least to the degree it is publicly accessible.  Restricted circles can and likely will implement similar ideas.</p>
    </li>

<li>
    <p>
        <b>What types of data can we deal with intelligently due to their inherent structure (geospatial, temporal, social or <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x7e8e248">knowledge</a> graphs, 3D, sensor streams...)?</b>
    </p>

<p>All the above types should be supported inside one DBMS so as to allow efficient querying combining conditions on all these types of data, e.g., <i>photos of sunsets taken last summer in Ibiza, with over 20 megapixels, by people I know.</i>
      </p>

<p>Note that the test for being a sunset is an operation on the image blob that should be taken to the data; the images cannot be economically transferred.</p>

<p>Interleaving of all database functions and types becomes increasingly important.</p>
</li>
  </ol>
</li>


<li>
  <p>
    <b>Industries, communities</b>
  </p>

<ol type="a" start="1">
	<li>
    <p>
        <b>Who is producing these data and why? Could they do it better? How?</b>
    </p>

<p>Right now, projects such as <a href="http://www.bio2rdf.org/" id="link-id0x43bd098">Bio2RDF</a>, <a href="http://neurocommons.org/page/Main_Page" id="link-id0x5c074b0">Neurocommons</a>, and DBPedia produce this data.  The processes are in place and are reasonable.  Incremental improvement is to be expected.  These processes, along with the <a href="http://www.w3.org/DesignIssues/LinkedData.html" id="link-id0x72131d0">linked data meme</a> generally taking off, drive demand for better <a href="http://dbpedia.org/resource/Natural_language_processing" id="link-id0x71e7798">NLP</a> (<a href="http://dbpedia.org/resource/Natural_language_processing" id="link-id0x7e0e2f0">Natural Language Processing</a>), e.g., <a href="http://dbpedia.org/resource/Entity" id="link-id0x71ab500">entity</a> and relationship extraction, especially extraction that can produce instance data in given ontologies (e.g., events) using common identifiers (e.g., DBPedia URIs).</p>

<p>Mapping of RDBs to RDF is possible, and a W3C working group is developing standards for this.  The required baseline level has been reached; the rest is a matter of automating deployment.  Within the enterprise, there are advantages to be gained for <a href="http://dbpedia.org/resource/Information" id="link-id0x7a8e9a8">information</a> integration; e.g., all entities in the CRM space can be integrated with all email and support tickets through giving everything a <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id0x599f630">URI</a>.  Some of this information may even be published on an <a href="http://dbpedia.org/resource/Extranet" id="link-id0x2a28f98">extranet</a> for self-service and web-service interfaces.  This has been done at small scales and the rest is a matter of spreading adoption and lowering the entry barrier.  Incremental progress will take place, eventually resulting in qualitatively better integration along the value chain when adoption is sufficiently widespread.</p>

</li>
	<li>
    <p>
        <b>Who is consuming these data and why? Could they do it better? How?</b>
    </p>

<p>Consumers are various.  The greatest need is for tools that summarize complex data and allow getting a bird&#39;s eye view of what data is in the first instance available.  Consuming the data is hindered by the user not even necessarily knowing what data there is.  This is somewhat new, as traditionally the business analyst did know the schema of the warehouse and was proficient with <a href="http://dbpedia.org/resource/SQL" id="link-id0x5999558">SQL</a> report generators and statistics packages.</p>

<p>Where Web 2.0 made the <i>citizen journalist</i>, the web of linked data will make the <i>citizen analyst</i>.  For this to happen, with benefits for individuals, enterprises, and governments alike, more work in user interfaces, knowledge discovery, and query composition will be useful.  We may envision a &quot;meshup economy&quot; where data is plentiful, but the unit of value and exchange is the smart report that crystallizes actionable value from this ocean.</p>

</li>
	<li>
    <p>
        <b>What industrial sectors in Europe could become more competitive if they became much better at managing data?</b>
    </p>

<p>Any sector could benefit.  Early adopters are seen in the biomedical field and to an extent in media.  </p>

</li>
	<li>
    <p>
        <b>Is the regulation landscape imposing constraints (privacy, compliance ...) that don&#39;t have today good tool support?</b>
    </p>

<p>The regulation landscape drives database demand through data retention requirements and the like.</p>

<p>With data integration, especially with privacy-sensitive data (as in medicine), there are issues of whether one dares put otherwise-shareable information online.   Regulation is needed to protect individuals, but integration should still be possible for science.</p>

<p>For this, we see a need for progress in applying policy-based approaches (e.g., row level security) to relatively schema-last data such as RDF.  This is possible but needs some more work.  Also, creating on-the-fly-anonymizing views on data might help.</p>

<p>More research is needed for reconciling the need for security with the advantages of broad-based <i>ad hoc</i> integration.  Ideally, data should be intelligent, aware of its origins and classification and cautious of whom it interacts with, all of this supported under the covers so that the user could ask anything but the data might refuse to answer or might restrict answers according to the user&#39;s profile.  This is a tall order and implementing something of the sort is an open question.</p>


</li>
	<li>
    <p>
        <b>What are the main practical problem identified for individuals and organizations? Please give examples and tell us about the main obstacles and barriers.</b>
    </p>

<p>We have come across the following:</p>

<ul>
        <li>Knowing that the data exists in the first place.</li>
<li>If the data is found, figuring out the provenance, units and precision of measurement, identifiers, and the like.</li>
<li>Compatible subject matter but incompatible representation:  For example, one has numbers on a map with different maps for different points in time; another has time series of instrument data with geo-location for the instrument.  It is only to be expected that the time interval between measurements is not the same.  So there is need for a lot of one-off programming to align data.</li>
      </ul>

<p>Other problems have to do with sheer volume, i.e., transfer of data even in a local area network is too slow, let alone over a wide area network.  Computation needs to go to the data, and databases need to support this.</p>

</li>
  </ol>
</li>

<li>
  <p>
    <b>Services, software stacks, protocols, standards, benchmarks</b>
  </p>

<ol type="a" start="1">
	<li>
    <p>
        <b>What combinations of components are needed to deal with these problems?</b>
    </p>

<p>Recent times have seen a proliferation of special purpose databases.  Since the data needs of the future are about combining data with maximum agility and minimum performance hit, there is need to gather the currently-separate functionality into an integrated system with sufficient flexibility.  We see some of this in integration of map-reduce and scale-out databases.  The former antagonists have become partners. Vertica, <a href="http://dbpedia.org/resource/Greenplum" id="link-id0x45ecfa0">Greenplum</a>, and OpenLink <a href="http://virtuoso.openlinksw.com" id="link-id0x7f73fc8">Virtuoso</a> are example of DBMS featuring work in this direction.</p>

<p>Interoperability and at least <i>de facto</i> standards in ways of doing this will emerge.</p>

</li>
	<li>
    <p>
        <b>What data exchange and processing mechanisms will be needed to work across platforms and programming languages?</b>
    </p>

<p>
        <a href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x776a1a0">HTTP</a>, <a href="http://dbpedia.org/resource/XML" id="link-id0x2a4e8d0">XML</a>, and RDF are in fact very verbose, yet these are the formats and models that have uptake.  Thus, these will continue to be used even though one might think binary formats to be more efficient.</p>

<p>There are of course science data set standards that are more compressed and these will continue, hopefully adding a practice of rich metadata in RDF.</p>

<p>For internals of systems, MPI and TCP/IP with proprietary optimized wire formats will continue.  Inter-system communication will likely continue to be HTTP, XML, and RDF as appropriate.</p>


</li>
	<li>
    <p>
        <b>What data environments are today so wastefully messy that they would benefit from the development of standards?</b>
    </p>


<p>RDF and <a href="http://dbpedia.org/resource/Web_Ontology_Language" id="link-id0x2a35960">OWL</a> are not messy but they could use some more performance; we are working on this.  <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x12362e8">SPARQL</a> is finally acquiring the capabilities of a serious query language, so things are slowly coming together.</p>

<p>Community process for developing application domain specific vocabularies works quite well, even though one could argue it is <i>ad hoc</i> and not up to what a modeling purist might wish.</p>

<p>Top-down imposition of standards has a mixed history, with long and expensive development and sometimes no or little uptake, consider some WS* standards for example.</p>

</li>
	<li>
    <p>
        <b>What kind of performance is expected or required of these systems? Who will measure it reliably? How?</b>
    </p>

<p>Relational databases have a history of substantial investment in <a href="http://dbpedia.org/resource/Program_optimization" id="link-id0x7b2d7c8">optimization</a> and some of them are very good for what they do, e.g., the newer generation of analytics databases.</p>

<p>The very large schema-last, no-SQL, sometimes eventually consistent key-value stores have a somewhat shorter history but do fill a real need.</p>

<p>These trends will merge:  Extreme scale, schema-last, complex queries, even more complex inference, custom code for in-database machine learning and other bulk processing.</p>

<p>We find RDF augmented with some binary types at this crossroads.  This point of the design space will have to provide performance roughly on the level of today&#39;s best relational solution for workloads that fit the relational model.  The added cost of schema-last and inference must come down.  We are working on this.  Research work such as carried out with <a href="http://dbpedia.org/resource/MonetDB" id="link-id0x794ee48">MonetDB</a> gives clues as to how these aims can be reached.</p>

<p>The separation of query language and inference is artificial.  After the concepts are mature, these functions will merge and execute close to the data; there are clear evolutionary pressures in this direction.</p>

<p>Benchmarks are key.  Some gain can be had even from repurposing standard relational benchmarks like <a href="http://www.tpc.org/" id="link-id0x7d45c58">TPC</a>-<a href="http://dbpedia.org/resource/TPC-H" id="link-id0x45b0198">H</a>.  But the TPC-H rules do not allow official reporting of such.</p>

<p>Development of benchmarks for RDF, complex queries, and inference is needed.  A bold challenge to the community, it should be rooted in real-life integration needs and involve high heterogeneity.  A key-value store benchmark might also be conceived.  A transaction benchmark like TPC-<a href="http://dbpedia.org/resource/C%2B%2B" id="link-id0x7e32178">C</a> might be the basis, maybe augmented with massive user-generated content like reviews and blogs.</p>

<p>If benchmarks exist and are not too easy nor inaccessibly difficult nor too expensive to run â think of the high end TPC-C results â then TPC-style rules and processes would be quite adequate.  The threshold to publish should be lowered:  Everybody runs the TPC workloads internally but few publish.</p>

<p>Some EC initiative for benchmarking could make sense, similar to the TREC initiative of the US government.  Industry should be consulted for the specific content; possibly the answers to the present questionnaire can provide an approximate direction.</p>

<p>Benchmarks should be run by software vendors on their own systems, tuned by themselves.  But there should be a process of disclosure and auditing; the TPC rules give an example.  Compliance should not be too expensive or time consuming.  Some community development for automating these things would be a worthwhile target for EC funding.</p>

</li>
  </ol>
</li>

<li>
  <p>
    <b>Usability and training</b>
  </p>

<ol type="a" start="1">

	<li>
    <p>
        <b>How difficult will it be for a developer of average competence to deploy components whose core is based on rather deep computer science? Do we all need to understand Monads and Continuations? What can be done to make it ever easier?</b>
    </p>

<p>In the database world, huge advances in technology have taken place behind a relatively simple and stable interface: SQL.  For the linked data <a href="http://dbpedia.org/resource/Giant_Global_Graph" id="link-id0x7e01618">web</a>, the same will take place behind SPARQL.</p>

<p>Beyond these, for example, programming with MPI with good utilization of a cluster platform for an arbitrary algorithm, is quite difficult.  The casual amateur is hereby warned.</p>

<p>There is no single solution.  For automatic parallelization, since explicit, programmatic parallelization of things with MPI for example is very unscalable in terms of required skill, we should favor declarative and/or functional approaches.</p>

<p>Developing a debugger and explanation engine for rule-based and description-logics-based inference would be an idea.</p>

<p>For procedural workloads, things like Erlang may be good in cases and are not overly difficult in principle, especially if there are good debugging facilities.</p>

<p>For shipping functions in a cluster or cloud, the <a href="http://www.eecs.berkeley.edu/Research/Projects/Data/105733.html" id="link-id0x43665a8">BOOM</a> (<a href="http://www.eecs.berkeley.edu/Research/Projects/Data/105733.html" id="link-id0x7718f00">Berkeley Orders Of Magnitude</a>) approach or logic programming with explicit specification of compute location seem promising, surely more flexible than map-reduce.  The question is whether a <a href="http://dbpedia.org/resource/PHP" id="link-id0x7d64f68">PHP</a> developer can be made to do logic programming.</p>

<p>This bridge will be crossed only with actual need and even then reluctantly.  We may look at the Web 2.0 practice of sharding <a href="http://dbpedia.org/resource/MySQL" id="link-id0xbab1ae98">MySQL</a>, inconvenient as this may be, for an example.  There is inertia and thus re-architecting is a constant process that is generally in reaction to facts, <i>post hoc</i>, often a point solution.  One could argue that planning ahead would be smarter but by and large the world does not work so.</p>

<p>One part of the answer is an infinitely-scalable SQL database that expands and shrinks in the clouds, with the usual semantics, maybe optional eventual consistency and built-in map reduce.  If such a thing is inexpensive enough and syntax-level-compatible with present installed base, many developers do not have to learn very much more.</p>

<p>This is maybe good for the bread-and-butter IT, but European competitiveness should not rest on this.  Therefore we wish to go for bold new application types for which the client-server database application is not the model.  Data-centric languages like BOOM, if they can be made very efficient and have good debugging support, are attractive there.  These do require more intellectual investment but that is not a problem since the less-inquisitive part of the developer community is served by the first part of the answer.</p>

</li>
	<li>
    <p>
        <b>How is a developer of average skills going to learn about these new advanced tools? How can we plan for excellent documentation and training, community mentoring, exchange of good practices, etc... across all EU countries?</b>
    </p>

<p>For the most part, developers do not learn things for the sake of learning.  When they have learned something and it is adequate, they stay with it for the most part and are even reluctant to engage in cross-camps interaction.  The research world is often similarly insular.  A new inflection in the application landscape is needed to drive learning.  This inflection is provided by the <a href="https://wiki.mozilla.org/Labs/Ubiquity" id="link-id0x770df38">ubiquity</a> of mobile devices, sensor data, explicit semantics, NLP concept extraction, web of linked data, and such factors.</p>

<p>RDFa is a good example of a new technique piggybacking on something everybody uses, namely HTML.  These new things should, within possibility, be deployed in the usual technology stack, <a href="http://en.wikipedia.org/wiki/LAMP_%28software_bundle%29" id="link-id0x55596a8">LAMP</a> or Java.  Of course these do not have to be LAMP or Java or HTML or HTTP themselves but they must manifest through these.</p>

<p>A lot of the <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x3d5378">semantic web</a> potential can be realized within the client-server database application model, thus no fundamental re-architecting, just some new data types and queries.</p>

<p>For data- or processing-intensive tasks, an on-demand hookup to cloud-based servers with Erlang and/or BOOM for programming model would be easy enough to learn and utilize.</p>

<p>The question is one of providing challenges.  Addressing actual challenges with these techniques will lead to maturity, documentation, examples, and training.  With virtual, Europe-wide distributed teams a reality in many places, Europe-wide dissemination is no longer insurmountable.</p>

<p>As the data overflow proceeds, its victims will multiply and create demand for solutions.  The EC could here encourage research project use cases gaining an extended life past the end of research projects, possibly being maintained and multiplied and spun off.</p>

<p>If such things could be mutated into self-sustaining service businesses with pay-per-use revenue, say through a cloud SaaS business model, still primarily leveraging an open source technology stack, we could have self-propagating and self-supporting models for exploiting advanced IT.  This would create interest, and interest would drive training and dissemination.</p>

<p>The problem is creating the pull.</p>
</li>
  </ol>
</li>

<li>
  <p>
    <b>Challenges</b>
  </p>
<ol type="a" start="1">

	<li>
    <p>
        <b>What should be, in this domain, the equivalent of the Netflix challenge, Ansari X Prize, <a href="http://dbpedia.org/resource/Google" id="link-id0x6a6c2b0">Google</a> Lunar X Prize, etc. ... ?</b>
    </p>

<p>The EC itself no doubt suffers from data overflow in one function or another.  Unless security/secrecy prohibits, simply publishing a large data set and a description of what operations should be done on it would be a start.  The more real the data, the better â reality is consistently more complex and surprising than imagination.  Since many interesting problems touch on fraud detection and law enforcement, there may be some security obstacles for using these application domains as subject matters of open challenges.</p>

<p>Once there is a good benchmark, as discussed above, there can be some prize money allocated for the winners, specially if the race is tight.</p>

<p>The Semantic Web Challenge and the Billion Triples Challenge exist and are useful as such, but do not seem to have any huge impact.</p>

<p>The incentives should be sufficient and part of the expenses arising from running for such challenges could be funded.  Otherwise investing in existing business development will be more interesting to industry.  Some industry participation seems necessary; we would wish academia and industry to work closer.  Also, having industry supply the baseline guarantees that academia actually does further the state of the art.  This is not always certain.</p>

<p>If challenges are based on actual problems, whether of the EC, its member governments, or private entities, and winning the challenge may lead to a contract for supplying an actual solution, these will naturally become more interesting for consortia involving integrators, specialist software vendors, and academia.  Such a model would build actual capacity to deploy leading edge technologies in production, which is sorely needed.</p>


</li>
	<li>
    <p>
        <b>What should one do  to set up such a challenge, administer, and monitor it?</b>
    </p>

<p>The EC should probably circulate a call for actual problem scenarios involving big data.  If the matter of the overflow is as dire as represented, cases should be easy to find.  A few should be selected and then anonymized if needed.</p>

<p>The party with the use case would benefit by having hopefully the best work on it.  The contestants would benefit from having real world needs guide R&amp;D.  The EC would not have to do very much, except possibly use some money for funding the best proposals.  The winner would possibly get a large account and related sales and service income.  The contestants would have to be teams possibly involving many organizations; for example, development and first-line services and support could come from different companies along a systems integrator model such as is widely used in the US.</p>

<p>There may be a good benchmark at the time, possibly resulting from FP7 itself.  In such a case, the EC could offer a prize for winners.  Details would have to be worked out case by case.  Such a challenge could be repeated a few times, as benchmark-driven progress in databases or TREC for example have taken some years to reach a point of slowdown in progress.</p>

<p>Administrating such an activity should not be prohibitive, as most of the expertise can be found with the stakeholders.</p>

</li>
  </ol>
</li>
</ol>
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2009-09-01#1576">
  <rss:title>VLDB 2009 TPC Workshop (3 of 5)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-09-01T15:51:09Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Michael Stonebraker gave the keynote at the TPC workshop. His message was that the TPC, at the venerable age of 21, was already a decade late in reinventing itself. From the height of relevance at the time of the debit/credit benchmark twenty years back, it was slipping into the sunset of irrelevance unless it paid attention. Now we are great fans of the TPC and while we have not published results by the TPC book, we have extensively used TPC material for guiding optimization, as has pretty much everybody else. It is true that the rules encourage unrealistic configurations. The emphasis on random access from disk that is built into the rules leads to disk configurations that are very improbable in practice, such as 1PB of disks for 3TB of data, just so there are enough disk arms in parallel. Stonebraker also pointed out that replication and failover were ubiquitous in real life and that roll forward from logs was unrealistic as a recovery model since it took so long. Benchmarks should therefore include replication. Further, Stonebraker challenged the TPC to go for the new frontier, which he described as the huge data sets in science and on big web sites. Scientists, the ones who would save our planet from the diverse ills confronting it, do not like relational databases. They avoid them when can. They want arrays for physics, and graphs for biology and chemistry. MapReduce is eating database&#39;s lunch; what will you do about this? I later suggested incorporating an RDF metadata benchmark into the TPC suite. We&#39;ll see about this; we&#39;ll first have to come up with a suitable one. There is a great deal of pressure for making good RDF benchmarks but this is not yet in the center of the mainstream that TPC tends to cover. TPC&#39;s own talk was about the life cycle of benchmarks. A benchmark begins a bit ahead of the mainstream, with a problem that is difficult but not so difficult as to be uncommon. When the solution to this problem becomes commonplace, the benchmark&#39;s relevance gradually drops. There was a talk on robustness of query plans which was well to the point. Indeed, there are performance cliffs at certain points; for example, when passing from memory-only to disk-pageable data structures, or when switching from indexed access to table scans, or from loop to hash joins. Quite so. The analysis I really would have liked to see would have been one of what happens when passing from single server to a cluster, and from local joins to cross-partition ones. Also contrasting of cache fusion and partitioning. We have our own data and experience but we find we don&#39;t have time to measure all the other systems. Anyway it is good to raise the question of smooth and predictable performance.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Michael <a href="http://dbpedia.org/resource/Michael_Stonebraker" id="link-id0x15e5efe0">Stonebraker</a> gave the keynote at the <a href="http://www.tpc.org/" id="link-id0x18cee5f0">TPC</a> workshop.  His message was that the TPC, at the venerable age of 21, was already a decade late in reinventing itself.  From the height of relevance at the time of the debit/credit benchmark twenty years back, it was slipping into the sunset of irrelevance unless it paid attention.</p>

<p>Now we are great fans of the TPC and while we have not published results by the TPC book, we have extensively used TPC material for guiding <a href="http://dbpedia.org/resource/Program_optimization" id="link-id0x4e55368">optimization</a>, as has pretty much everybody else.</p>

<p>It is true that the rules encourage unrealistic configurations.  The emphasis on random access from disk that is built into the rules leads to disk configurations that are very improbable in practice, such as 1PB of disks for 3TB of <a href="http://dbpedia.org/resource/Data" id="link-id0x191cd880">data</a>, just so there are enough disk arms in parallel.  Stonebraker also pointed  out that replication and failover were ubiquitous in real life and that roll forward from logs was unrealistic as a recovery model since it took so long.  Benchmarks should therefore include replication.</p>

<p>Further, Stonebraker challenged the TPC to go for the new frontier, which he described as the huge data sets in science and on big web sites.  Scientists, the ones who would save our planet from the diverse ills confronting it, do not like relational databases.  They avoid them when can.  They want arrays for physics, and graphs for biology and chemistry.  <a href="http://dbpedia.org/resource/MapReduce" id="link-id0x53f6040">MapReduce</a> is eating database&#39;s lunch; what will you do about this?</p>

<p>I later suggested incorporating an <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x18902070">RDF</a> <a href="http://dbpedia.org/resource/Metadata" id="link-id0x3990af8">metadata</a> benchmark into the TPC suite.  We&#39;ll see about this; we&#39;ll first have to come up with a suitable one.  There is a great deal of pressure for making good RDF benchmarks but this is not yet in the center of the mainstream that TPC tends to cover.</p>

<p>TPC&#39;s own talk was about the life cycle of benchmarks.  A benchmark begins a bit ahead of the mainstream, with a problem that is difficult but not so difficult as to be uncommon.  When the solution to this problem becomes commonplace, the benchmark&#39;s relevance gradually drops.</p>

<p>There was a talk on robustness of query plans which was well to the point.  Indeed, there are performance cliffs at certain points; for example, when passing from memory-only to disk-pageable data structures, or when switching from indexed access to table scans, or from loop to hash joins.  Quite so.  The analysis I really would have liked to see would have been one of what happens when passing from single server to a cluster, and from local joins to cross-partition ones. Also contrasting of <a href="http://dbpedia.org/resource/Cache" id="link-id0x1942aca8">cache</a> fusion and partitioning.  We have our own data and experience but we find we don&#39;t have time to measure all the other systems.</p>

<p>Anyway it is good to raise the question of smooth and predictable performance.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2009-06-29#1562">
  <rss:title>Single Virtuoso host loads 110,500 triples-per-second on LUBM 8000</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-06-29T16:12:34Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">LUBM load speed still seems to be a metric that is quoted in comparisons of RDF stores. Consequently, we too measured the load time of LUBM 8000, 1,068-million triples, on the newest Virtuoso. The real time for the load was 161m 3s. The rate was 110,532 triples-per-second. The hardware was one machine with 2 x Xeon 5410 (quad core, 2.33 GHz) and 16G 667 MHz RAM. The software was Virtuoso 6 Cluster, configured into 8 partitions (processes) — one partition per CPU core. Each partition had its database striped over 6 disks total; the 6 disks on the system were shared between the 8 database processes. The load was done on 8 streams, one per server process. At the beginning of the load, the CPU usage was 740% with no disk; at the end, it was around 700% with 25% disk wait. 100% counts here for one CPU core or one disk being constantly busy. The RDF store was configured with the default two indices over quads, these being GSPO and OGPS. Text indexing of literals was not enabled. No materialization of entailed triples was made. We think that LUBM loading is not a realistic benchmark for the world but since other people publish such numbers, so do we.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>LUBM load speed still seems to be a metric that is quoted in comparisons of <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id142df6e8">RDF</a> stores. Consequently, we too measured the load time of LUBM 8000, 1,068-million triples, on the newest <a href="http://virtuoso.openlinksw.com" id="link-id1389dfa0">Virtuoso</a>.</p>
 
<p>The real time for the load was 161m 3s.  The rate was 110,532 triples-per-second. The hardware was one machine with 2 x Xeon 5410 (quad core, 2.33 GHz) and 16G 667 MHz RAM.  The software was Virtuoso 6 Cluster, configured into 8 partitions (processes) — one partition per CPU core.  Each partition had its database striped over 6 disks total; the 6 disks on the system were shared between the 8 database processes.</p>
 
<p>The load was done on 8 streams, one per server process.   At the beginning of the load, the CPU usage was 740% with no disk; at the end, it was around 700% with 25% disk wait. 100% counts here for one CPU core or one disk being constantly busy.</p>
 
<p>The RDF store was configured with the default two indices over quads, these being GSPO and OGPS.  Text indexing of literals was not enabled.  No materialization of entailed triples was made.</p>
 
<p>We think that LUBM loading is not a realistic benchmark for the world but since other people publish such numbers, so do we.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-12-17#1504">
  <rss:title>Virtuoso RDF:  A Getting Started Guide for the Developer</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-12-17T12:31:34Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">It is a long standing promise of mine to dispel the false impression that using Virtuoso to work with RDF is complicated. The purpose of this presentation is to show a programmer how to put RDF into Virtuoso and how to query it. This is done programmatically, with no confusing user interfaces. You should have a Virtuoso Open Source tree built and installed. We will look at the LUBM benchmark demo that comes with the package. All you need is a Unix shell. Running the shell under emacs (m-x shell) is the best. But the open source isql utility should have command line editing also. The emacs shell is however convenient for cutting and pasting things between shell and files. To get started, cd into binsrc/tests/lubm. To verify that this works, you can do ./test_server.sh virtuoso-t This will test the server with the LUBM queries. This should report 45 tests passed. After this we will do the tests step-by-step. Loading the Data The file lubm-load.sql contains the commands for loading the LUBM single university qualification database. The data files themselves are in lubm_8000, 15 files in RDFXML. There is also a little ontology called inf.nt. This declares the subclass and subproperty relations used in the benchmark. So now let&#39;s go through this procedure. Start the server: $ virtuoso-t -f &amp; This starts the server in foreground mode, and puts it in the background of the shell. Now we connect to it with the isql utility. $ isql 1111 dba dba This gives a SQL&gt; prompt. The default username and password are both dba. When a command is SQL, it is entered directly. If it is SPARQL, it is prefixed with the keyword sparql. This is how all the SQL clients work. Any SQL client, such as any ODBC or JDBC application, can use SPARQL if the SQL string starts with this keyword. The lubm-load.sql file is quite self-explanatory. It begins with defining an SQL procedure that calls the RDF/XML load function, DB..RDF_LOAD_RDFXML, for each file in a directory. Next it calls this function for the lubm_8000 directory under the server&#39;s working directory. sparql CLEAR GRAPH &lt;lubm&gt;; sparql CLEAR GRAPH &lt;inf&gt;; load_lubm ( server_root() || &#39;/lubm_8000/&#39; ); Then it verifies that the right number of triples is found in the &lt;lubm&gt; graph. sparql SELECT COUNT(*) FROM &lt;lubm&gt; WHERE { ?x ?y ?z } ; The echo commands below this are interpreted by the isql utility, and produce output to show whether the test was passed. They can be ignored for now. Then it adds some implied subOrganizationOf triples. This is part of setting up the LUBM test database. sparql PREFIX ub: &lt;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#&gt; INSERT INTO GRAPH &lt;lubm&gt; { ?x ub:subOrganizationOf ?z } FROM &lt;lubm&gt; WHERE { ?x ub:subOrganizationOf ?y . ?y ub:subOrganizationOf ?z . }; Then it loads the ontology file, inf.nt, using the Turtle load function, DB.DBA.TTLP. The arguments of the function are the text to load, the default namespace prefix, and the URI of the target graph. DB.DBA.TTLP ( file_to_string ( &#39;inf.nt&#39; ), &#39;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl&#39;, &#39;inf&#39; ) ; sparql SELECT COUNT(*) FROM &lt;inf&gt; WHERE { ?x ?y ?z } ; Then we declare that the triples in the &lt;inf&gt; graph can be used for inference at run time. To enable this, a SPARQL query will declare that it uses the &#39;inft&#39; rule set. Otherwise this has no effect. rdfs_rule_set (&#39;inft&#39;, &#39;inf&#39;); This is just a log checkpoint to finalize the work and truncate the transaction log. The server would also eventually do this in its own time. checkpoint; Now we are ready for querying. Querying the Data The queries are given in 3 different versions: The first file, lubm.sql, has the queries with most inference open coded as UNIONs. The second file, lubm-inf.sql, has the inference performed at run time using the ontology information in the &lt;inf&gt; graph we just loaded. The last, lubm-phys.sql, relies on having the entailed triples physically present in the &lt;lubm&gt; graph. These entailed triples are inserted by the SPARUL commands in the lubm-cp.sql file. If you wish to run all the commands in a SQL file, you can type load &lt;filename&gt;; (e.g., load lubm-cp.sql;) at the SQL&gt; prompt. If you wish to try individual statements, you can paste them to the command line. For example: SQL&gt; sparql PREFIX ub: &lt;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#&gt; SELECT * FROM &lt;lubm&gt; WHERE { ?x a ub:Publication . ?x ub:publicationAuthor &lt;http://www.Department0.University0.edu/AssistantProfessor0&gt; }; VARCHAR _______________________________________________________________________ http://www.Department0.University0.edu/AssistantProfessor0/Publication0 http://www.Department0.University0.edu/AssistantProfessor0/Publication1 http://www.Department0.University0.edu/AssistantProfessor0/Publication2 http://www.Department0.University0.edu/AssistantProfessor0/Publication3 http://www.Department0.University0.edu/AssistantProfessor0/Publication4 http://www.Department0.University0.edu/AssistantProfessor0/Publication5 6 Rows. -- 4 msec. To stop the server, simply type shutdown; at the SQL&gt; prompt. If you wish to use a SPARQL protocol end point, just enable the HTTP listener. This is done by adding a stanza like â [HTTPServer] ServerPort = 8421 ServerRoot = . ServerThreads = 2 â to the end of the virtuoso.ini file in the lubm directory. Then shutdown and restart (type shutdown; at the SQL&gt; prompt and then virtuoso-t -f &amp; at the shell prompt). Now you can connect to the end point with a web browser. The URL is http://localhost:8421/sparql. Without parameters, this will show a human readable form. With parameters, this will execute SPARQL. We have shown how to load and query RDF with Virtuoso using the most basic SQL tools. Next you can access RDF from, for example, PHP, using the PHP ODBC interface. To see how to use Jena or Sesame with Virtuoso, look at Native RDF Storage Providers. To see how RDF data types are supported, see Extension datatype for RDF To work with large volumes of data, you must add memory to the configuration file and use the row-autocommit mode, i.e., do log_enableÂ (2); before the load command. Otherwise Virtuoso will do the entire load as a single transaction, and will run out of rollback space. See documentation for more.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[
<p>It is a long standing promise of mine to dispel the false impression that using <a href="http://virtuoso.openlinksw.com/" id="link-id113506d0">Virtuoso</a> to work with <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id115d9528">RDF</a> is complicated.</p>

<p>The purpose of this presentation is to show a programmer how to put RDF into Virtuoso and how to query it.  This is done programmatically, with no confusing user interfaces.</p>

<p>You should have a Virtuoso Open Source tree built and installed.  We will look at the LUBM benchmark demo that comes with the package.  All you need is a Unix shell.  Running the shell under emacs (<code>m-x shell</code>) is the best.  But the open source <code>isql</code> utility should have command line editing also.  The emacs shell is however convenient for cutting and pasting things between shell and files.</p>

<p>To get started, cd into <code>binsrc/tests/lubm</code>.</p>

<p>To verify that this works, you can do </p>

<blockquote>
<pre>./test_server.sh virtuoso-t</pre></blockquote>

<p>This will test the server with the LUBM queries.  This should report 45 tests passed.  After this we will do the tests step-by-step.</p>

<h2>Loading the <a href="http://dbpedia.org/resource/Data" id="link-id10f7bd90">Data</a>
</h2> 

<p>The file <code>lubm-load.sql</code> contains the commands for loading the LUBM single university qualification database.</p>

<p>The data files themselves are in <code>lubm_8000</code>, 15 files in RDFXML.</p>

<p>There is also a little ontology called <code>inf.nt</code>.  This declares the subclass and subproperty relations used in the benchmark.</p>

<p>So now let&#39;s go through this procedure.</p>

<p>Start the server:</p>

<blockquote>
<pre>$ virtuoso-t -f &amp;
</pre></blockquote>

<p>This starts the server in foreground mode, and puts it in the background of the shell.</p>

<p>Now we connect to it with the isql utility.</p>

<blockquote>
<pre>$ isql 1111 dba dba 
</pre></blockquote>

<p>This gives a <code>SQL&gt;</code> prompt.  The default username and password are both <code>dba</code>.</p>

<p>When a command is <a href="http://dbpedia.org/resource/SQL" id="link-id1176ce70">SQL</a>, it is entered directly.  If it is <a href="http://dbpedia.org/resource/SPARQL" id="link-id156df468">SPARQL</a>, it is prefixed with the keyword <code>sparql</code>.  This is how all the SQL clients work.  Any SQL client, such as any <a href="http://dbpedia.org/resource/Open_Database_Connectivity" id="link-id152d0a00">ODBC</a> or <a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id157ad6a0">JDBC</a> application, can use SPARQL if the SQL string starts with this keyword.</p>

<p>The <code>lubm-load.sql</code> file is quite self-explanatory. It begins with defining an SQL procedure that calls the RDF/XML load function, <code>DB..RDF_LOAD_RDFXML</code>, for each file in a directory.</p>

<p>Next it calls this function for the <code>lubm_8000</code> directory under the server&#39;s working directory.</p>

<blockquote>
<pre>sparql 
   CLEAR GRAPH &lt;lubm&gt;;

sparql 
   CLEAR GRAPH &lt;inf&gt;;

load_lubm ( server_root() || &#39;/lubm_8000/&#39; );
</pre></blockquote>

<p>Then it verifies that the right number of triples is found in the &lt;lubm&gt; graph.</p>

<blockquote>
<pre>sparql 
   SELECT COUNT(*) 
     FROM &lt;lubm&gt; 
    WHERE { ?x ?y ?z } ;
</pre></blockquote>

<p>The echo commands below this are interpreted by the isql utility, and produce output to show whether the test was passed.  They can be ignored for now.</p>

<p>Then it adds some implied <code>subOrganizationOf</code> triples.  This is part of setting up the LUBM test database.</p>

<blockquote>
<pre>sparql 
   PREFIX  ub:  &lt;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#&gt;
   INSERT 
      INTO GRAPH &lt;lubm&gt; 
      { ?x  ub:subOrganizationOf  ?z } 
   FROM &lt;lubm&gt; 
   WHERE { ?x  ub:subOrganizationOf  ?y  . 
           ?y  ub:subOrganizationOf  ?z  . 
         };
</pre></blockquote>

<p>Then it loads the ontology file, <code>inf.nt</code>, using the Turtle load function, <code>DB.DBA.TTLP</code>.  The arguments of the function are the text to load, the default namespace prefix, and the <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id15835550">URI</a> of the target graph.</p>

<blockquote>
<pre>DB.DBA.TTLP ( file_to_string ( &#39;inf.nt&#39; ), 
              &#39;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl&#39;, 
              &#39;inf&#39; 
            ) ;
sparql 
   SELECT COUNT(*) 
     FROM &lt;inf&gt; 
    WHERE { ?x ?y ?z } ;
</pre></blockquote>

<p>Then we declare that the triples in the <code>&lt;inf&gt;</code> graph can be used for inference at run time.  To enable this, a SPARQL query will declare that it uses the <code>&#39;inft&#39;</code> rule set.  Otherwise this has no effect.</p>

<blockquote>
<pre>rdfs_rule_set (&#39;inft&#39;, &#39;inf&#39;);
</pre></blockquote>

<p>This is just a log checkpoint to finalize the work and truncate the transaction log.  The server would also eventually do this in its own time.</p>

<blockquote>
<pre>checkpoint;
</pre></blockquote>

<p>Now we are ready for querying.</p>

<h2>Querying the Data</h2> 

<p>The queries are given in 3 different versions: The first file, <code>lubm.sql</code>, has the queries with most inference open coded as <code>UNIONs</code>. The second file, <code>lubm-inf.sql</code>, has the inference performed at run time using the ontology <a href="http://dbpedia.org/resource/Information" id="link-id1109faf0">information</a> in the <code>&lt;inf&gt;</code> graph we just loaded.  The last, <code>lubm-phys.sql</code>, relies on having the entailed triples physically present in the <code>&lt;lubm&gt;</code> graph.  These entailed triples are inserted by the SPARUL commands in the <code>lubm-cp.sql</code> file.</p>

<p>If you wish to run all the commands in a SQL file, you can type <code>load &lt;filename&gt;;</code> (e.g., <code>load lubm-cp.sql;</code>) at the <code>SQL&gt;</code> prompt. If you wish to try individual statements, you can paste them to the command line.</p>

<p>For example: </p>

<blockquote>
<pre>SQL&gt; sparql 
   PREFIX ub: &lt;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#&gt;
   SELECT * 
     FROM &lt;lubm&gt;
    WHERE { ?x  a                     ub:Publication                                                . 
            ?x  ub:publicationAuthor  &lt;http://www.Department0.University0.edu/AssistantProfessor0&gt; 
          };

VARCHAR
_______________________________________________________________________

http://www.Department0.University0.edu/AssistantProfessor0/Publication0
http://www.Department0.University0.edu/AssistantProfessor0/Publication1
http://www.Department0.University0.edu/AssistantProfessor0/Publication2
http://www.Department0.University0.edu/AssistantProfessor0/Publication3
http://www.Department0.University0.edu/AssistantProfessor0/Publication4
http://www.Department0.University0.edu/AssistantProfessor0/Publication5

6 Rows. -- 4 msec.
</pre></blockquote>


<p>To stop the server, simply type <code>shutdown;</code> at the <code>SQL&gt;</code> prompt.</p>

<p>If you wish to use a <a href="http://www.w3.org/TR/rdf-sparql-protocol/" id="link-id11384668">SPARQL protocol</a> end point, just enable the HTTP listener.  This is done by adding a stanza like â</p>

<blockquote>
<pre>[HTTPServer]
ServerPort    = 8421
ServerRoot    = .
ServerThreads = 2
</pre></blockquote>

<p>â to the end of the <code>virtuoso.ini</code> file in the <code>lubm</code> directory.  Then shutdown and restart (type <code>shutdown;</code> at the <code>SQL&gt;</code> prompt and then <code>virtuoso-t -f &amp;</code> at the shell prompt).</p>

<p>Now you can connect to the end point with a web browser.  The <a href="http://dbpedia.org/resource/Uniform_Resource_Locator" id="link-id113d02d8">URL</a> is <code>http://localhost:8421/sparql</code>. Without parameters, this will show a human readable form.  With parameters, this will execute SPARQL.</p>

<p>We have shown how to load and query RDF with Virtuoso using the most basic SQL tools. Next you can access RDF from, for example, <a href="http://dbpedia.org/resource/PHP" id="link-id142d0ba0">PHP</a>, using the PHP ODBC interface.</p>

<p>To see how to use <a href="http://jena.sourceforge.net/" id="link-id117074f0">Jena</a> or <a href="http://sourceforge.net/projects/sesame/" id="link-id1103c9b0">Sesame</a> with Virtuoso, look at <a href="http://docs.openlinksw.com/virtuoso/rdfnativestorageproviders.html" id="link-id15488ce8">Native RDF Storage Providers</a>. To see how RDF data types are supported, see <a href="http://docs.openlinksw.com/virtuoso/VirtuosoDriverJDBC.html#jdbcrdf" id="link-id15784a40">Extension datatype for RDF</a>
</p>

<p>To work with large volumes of data, you must add memory to the configuration file and use the row-autocommit mode, i.e., do <code>log_enableÂ (2);</code> before the load command. Otherwise Virtuoso will do the entire load as a single transaction, and will run out of rollback space.  See <a href="http://docs.openlinksw.com/virtuoso/" id="link-id111410f0">documentation</a> for more.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-11-20#1484">
  <rss:title>Virtuoso Vs. MySQL:  Setting the Berlin Record Straight (update 2)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-11-20T11:06:11Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">In the context of the Berlin SPARQL Benchmark, I have repeatedly written about measurement procedures and steady state. The point is that the numbers at larger scales are unreliable due to cache behavior if one is not careful about measurement and does not have adequate warmup. Thus it came to pass that one cut of the BSBM paper had 3 seconds for MySQL and 100 for Virtuoso, basically through ignoring cache effects. So we decided to do it ourselves. The score is (updated with revised innodb_buffer_pool_size setting, based on advice noted down below): n-clients Virtuoso MySQL (with increased buffer pool size) MySQL (with default buffer poll size) 1 41,161.33 27,023.11 12,171.41 4 127,918.30 (pending) 37,566.82 8 218,162.29 105,524.23 51,104.39 16 214,763.58 98,852.42 47,589.18 The metric is the query mixes per hour from the BSBM test driver output. For the interested, the complete output is here. The benchmark is pure SQL, nothing to do with SPARQL or RDF. The hardware is 2 x Xeon 5345 (2 x quad core, 2.33 GHz), 16 G RAM. The OS is 64-bit Debian Linux. The benchmark was run at a scale of 200,000. Each run had 2000 warm-up query mixes and 500 measured query mixes, which gives steady state, eliminating any effects of OS disk cache and the like. Both databases were configured to use 8G for disk cache. The test effectively runs from memory. We ran an analyze table on each MySQL table but noticed that this had no effect. Virtuoso does the stats sampling on the go; possibly MySQL also since the explicit stats did not make any difference. The MySQL tables were served by the InnoDB engine. MySQL appears to cache results of queries in some cases. This was not apparent in the tests. The versions are 5.09 for Virtuoso and 5.1.29 for MySQL. You can download and examine -- Virtuoso configuration file MySQL configuration file Table definitions &amp; RDF views Indexes on MySQL tables MySQL ought to do better. We suspect that here, just as in the TPC-D experiment we made way back, the query plans are not quite right. Also we rarely saw over 300% CPU utilization for MySQL. It is possible there is a config parameter that affects this. The public is invited to tell us about such. Update: Andreas Schultz of the BSBM team advised us to increase the innodb_buffer_pool_size setting in the MySQL config. We did and it produced some improvement. Indeed, this is more like it, as we now see CPU utilization around 700% instead of the 300% in the previously published run, which rendered it suspect. Also, our experiments with TPC-D led us to expect better. We ran these things a few times so as to have warm cache. On the first run, we noticed that the Innodb warm up time was somewhere well in excess of 2000 query mixes. Another time, we should make a graph of throughput as a function of time for both MySQL and Virtuoso. We recently made a greedy prefetch hack that should give us some mileage there. For the next BSBM, all we can advise is to run larger scale system for half an hour first and then measure and then measure again. If the second measurement is the same as the first then it is good. As always, since MySQL is not our specialty, we confidently invite the public to tell us how to make it run faster. So, unless something more turns up, our next trial is a revisit of TPC-H.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>In the context of the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0xa5314d8">Berlin SPARQL Benchmark</a>, I have repeatedly written about measurement procedures and steady state.  The point is that the numbers at larger scales are unreliable due to cache behavior if one is not careful about measurement and does not have adequate warmup.  Thus it came to pass that one cut of the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x18482c20">BSBM</a> paper had 3 seconds for <a href="http://dbpedia.org/resource/MySQL" id="link-id0xb8c54de8">MySQL</a> and 100 for <a href="http://virtuoso.openlinksw.com" id="link-id0x189b2210">Virtuoso</a>, basically through ignoring cache effects.</p>

<p>So we decided to do it ourselves.</p>

<p>The score is (updated with revised <code>innodb_buffer_pool_size</code> setting, based on advice noted down below):</p>

<table border="1" cellspacing="2" cellpadding="5">
<tr>
    <th>n-clients</th>
    <th>Virtuoso</th>
    <th>MySQL <br /> (with increased buffer pool size)</th>
    <th>MySQL <br /> (with default buffer poll size)</th>
  </tr>
<tr align="right">
    <td>1</td>
    <td> 41,161.33</td>
    <td> 27,023.11 </td>
    <td> 12,171.41</td>
  </tr>
<tr align="right">
    <td>4</td>
    <td> 127,918.30</td>
    <td> (pending) </td>
    <td>  37,566.82</td>
  </tr>
<tr align="right">
    <td>8</td>
    <td> 218,162.29 </td>
    <td> 105,524.23 </td>
    <td>  51,104.39 </td>
  </tr>
<tr align="right">
    <td>16</td>
    <td> 214,763.58 </td>
    <td>  98,852.42 </td>
    <td>  47,589.18 </td>
  </tr>
</table>


<p>The metric is the query mixes per hour from the BSBM test driver output.  For the interested, the complete output is <a href="http://www.openlinksw.com/weblog/oerling/texts/bsbmres.txt" id="link-id1119f770">here</a>.</p>

<p>The benchmark is pure <a href="http://dbpedia.org/resource/SQL" id="link-id0x5257718">SQL</a>, nothing to do with <a href="http://dbpedia.org/resource/SPARQL" id="link-id0xb8c463e0">SPARQL</a> or <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x16e68d50">RDF</a>.</p>

<p>The hardware is 2 x Xeon 5345 (2 x quad core, 2.33 GHz), 16 G RAM.  The OS is 64-bit Debian Linux.</p>

<p>The benchmark was run at a scale of 200,000.  Each run had 2000 warm-up query mixes and 500 measured query mixes, which gives steady state, eliminating any effects of OS disk cache and the like.  Both databases were configured to use 8G for disk cache.  The test effectively runs from memory.  We ran an analyze table on each MySQL table but noticed that this had no effect.  Virtuoso does the stats sampling on the go; possibly MySQL also since the explicit stats did not make any difference.  The MySQL tables were served by the InnoDB engine.  MySQL appears to cache results of queries in some cases.  This was not apparent in the tests.</p>

<p>The versions are 5.09 for Virtuoso and 5.1.29 for MySQL.  You can download and examine --</p>
<ul> 
<li>
<a href="http://www.openlinksw.com/weblog/oerling/texts/virtuoso.ini" id="link-id14fe17f0">Virtuoso configuration file</a>
</li>
<li>
<a href="http://www.openlinksw.com/weblog/oerling/texts/my.cnf" id="link-id116fe490">MySQL configuration file</a>
</li>
<li>
    <a href="http://www.openlinksw.com/weblog/oerling/texts/create_tables_and_rdf_view.sql" id="link-id14ce9268">Table definitions &amp; RDF views</a> 
</li>
<li> <a href="http://www.openlinksw.com/weblog/oerling/texts/mysqlinx.sql" id="link-id1535e298">Indexes on MySQL tables</a>
</li>
</ul>

<p>
<strike>MySQL ought to do better.  We suspect that here, just as in the TPC-D experiment we made way back, the query plans are not quite right. Also we rarely saw over 300% CPU utilization for MySQL.  It is possible there is a config parameter that affects this.  The public is invited to tell us about such.</strike>
</p>

<p>
<b>Update:</b>
</p>

<p>Andreas Schultz of the BSBM team advised us to increase the <code>innodb_buffer_pool_size</code> setting in the MySQL config.  We did and it produced some improvement.  Indeed, this is more like it, as we now see CPU utilization around 700% instead of the 300% in the previously published run, which rendered it suspect. Also, our experiments with TPC-D led us to expect better.  We ran these things a few times so as to have warm cache.</p>

<p>On the first run, we noticed that the Innodb warm up time was somewhere well in excess of 2000 query mixes.  Another time, we should make a graph of throughput as a function of time for both MySQL and Virtuoso.  We recently made a greedy prefetch hack that should give us some mileage there.  For the next BSBM, all we can advise is to run larger scale system for half an hour first and then measure and then measure again.  If the second measurement is the same as the first then it is good.</p>

<p>As always, since MySQL is not our specialty, we confidently invite the public to tell us how to make it run faster. So, unless something more turns up, our next trial is a revisit of <a href="http://dbpedia.org/resource/TPC-H" id="link-id0x122eaa00">TPC-H</a>.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-11-04#1479">
  <rss:title>ISWC 2008: Some Questions</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-11-04T15:54:42Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Inference: Is it always forward chaining? We got a number of questions about Virtuoso&#39;s inference support. It seems that we are the odd one out, as we do not take it for granted that inference ought to consist of materializing entailment. Firstly, of course one can materialize all one wants with Virtuoso. The simplest way to do this is using SPARUL. With the recent transitivity extensions to SPARQL, it is also easy to materialize implications of transitivity with a single statement. Our point is that for trivial entailment such as subclass, sub-property, single transitive property, and owl:sameAs, we do not require materialization, as we can resolve these at run time also, with backward-chaining built into the engine. For more complex situations, one needs to materialize the entailment. At the present time, we know how to generalize our transitive feature to run arbitrary backward-chaining rules, including recursive ones. We could have a sort of Datalog backward-chaining embedded in our SQL/SPARQL and could run this with good parallelism, as the transitive feature already works with clustering and partitioning without dying of message latency. Exactly when and how we do this will be seen. Even if users want entailment to be materialized, such a rule system could be used for producing the materialization at good speed. We had a word with Ian Horrocks on the question. He noted that it is often naive on behalf of the community to tend to equate description of semantics with description of algorithm. The data need not always be blown up. The advantage of not always materializing is that the working set stays better. Once the working set is no longer in memory, response times jump disproportionately. Also, if the data changes or is retracted or is unreliable, one can end up doing a lot of extra work with materialization. Consider the effect of one malicious sameAs statement. This can lead to a lot of effects that are hard to retract. On the other hand, if running in memory with static data such as the LUBM benchmark, the queries run some 20% faster if entailment subclasses and sub-properties are materialized rather than done at run time. Genetic Algorithms for SPARQL? Our compliments for the wildest idea of the conference go to Eyal Oren, Christophe GuÃ©ret, and Stefan Schlobach, et al, for their paper on using genetic algorithms for guessing how variables in a SPARQL query ought to be instantiated. Prisoners of our &quot;conventional wisdom&quot; as we are, this might never have occurred to us. Schema Last? It is interesting to see how the industry comes to the semantic web conferences talking about schema last while at the same time the traditional semantic web people stress enforcing schema constraints and making more predictably performing and database friendlier logics. So do the extremes converge. There is a point to schema last. RDF is very good for getting a view of ad hoc or unknown data. One can just load and look at what there is. Also, additions of unforeseen optional properties or relations to the schema are easy and efficient. However, it seems that a really high traffic online application would always benefit from having some application specific data structures. Such could also save considerably in hardware. It is not a sharp divide between RDF and relational application oriented representation. We have the capabilities in our RDB to RDF mapping. We just need to show this and have SPARUL and data loading</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<h2>Inference: Is it always forward chaining?</h2>

<p>We got a number of questions about <a href="http://virtuoso.openlinksw.com" id="link-id0x131604a8">Virtuoso</a>&#39;s inference support. It seems that we are the odd one out, as we do not take it for granted that inference ought to consist of materializing entailment.</p>

<p>Firstly, of course one can materialize all one wants with Virtuoso. The simplest way to do this is using SPARUL. With the recent transitivity extensions to <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1422f910">SPARQL</a>, it is also easy to materialize implications of transitivity with a single statement. Our point is that for trivial entailment such as subclass, sub-property, single transitive property, and <a href="http://dbpedia.org/resource/Web_Ontology_Language" id="link-id0x145894a8">owl</a>:sameAs, we do not require materialization, as we can resolve these at run time also, with backward-chaining built into the engine.</p>

<p>For more complex situations, one needs to materialize the entailment. At the present time, we know how to generalize our transitive feature to run arbitrary backward-chaining rules, including recursive ones. We could have a sort of Datalog backward-chaining embedded in our <a href="http://dbpedia.org/resource/SQL" id="link-id0x1458a288">SQL</a>/SPARQL and could run this with good parallelism, as the transitive feature already works with clustering and partitioning without dying of message latency. Exactly when and how we do this will be seen. Even if users want entailment to be materialized, such a rule system could be used for producing the materialization at good speed.</p>

<p>We had a word with <a href="http://web.comlab.ox.ac.uk/people/Ian.Horrocks/" id="link-id117c99d0">Ian Horrocks</a> on the question. He noted that it is often naive on behalf of the community to tend to equate description of semantics with description of algorithm. The <a href="http://dbpedia.org/resource/Data" id="link-id0x14cf0b18">data</a> need not always be blown up.</p>

<p>The advantage of not always materializing is that the working set stays better. Once the working set is no longer in memory, response times jump disproportionately. Also, if the data changes or is retracted or is unreliable, one can end up doing a lot of extra work with materialization. Consider the effect of one malicious sameAs statement. This can lead to a lot of effects that are hard to retract. On the other hand, if running in memory with static data such as the LUBM benchmark, the queries run some 20% faster if entailment subclasses and sub-properties are materialized rather than done at run time.</p>

<h2>Genetic Algorithms for SPARQL?</h2>

<p>Our compliments for the wildest idea of the conference go to <a href="http://www.eyaloren.org/" id="link-id1a203af8">Eyal Oren</a>, <a href="http://www.few.vu.nl/~cgueret/" id="link-id16208758">Christophe GuÃ©ret</a>, and <a href="http://www.few.vu.nl/~schlobac/" id="link-id111923e0">Stefan Schlobach</a>, <i>et al</i>, for their <a href="http://www.informatik.uni-trier.de/~ley/db/conf/semweb/iswc2008.html#OrenGS08" id="link-id11793540">paper on using genetic algorithms for guessing how variables in a SPARQL query ought to be instantiated</a>. Prisoners of our &quot;conventional wisdom&quot; as we are, this might never have occurred to us.</p>

<h2>Schema Last?</h2>

<p>It is interesting to see how the industry comes to the <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x1154c1b0">semantic web</a> conferences talking about schema last while at the same time the traditional semantic web people stress enforcing schema constraints and making more predictably performing and database friendlier logics. So do the extremes converge.</p>

<p>There is a point to schema last. <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x14c6a930">RDF</a> is very good for getting a view of ad hoc or unknown data. One can just load and look at what there is. Also, additions of unforeseen optional properties or relations to the schema are easy and efficient. However, it seems that a really high traffic online application would always benefit from having some application specific data structures. Such could also save considerably in hardware.</p>

<p>It is not a sharp divide between RDF and relational application oriented representation. We have the capabilities in our RDB to RDF mapping. We just need to show this and have SPARUL and data loading</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-11-03#1471">
  <rss:title>ISWC 2008: The Scalable Knowledge Systems Workshop</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-11-03T13:16:47Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Mike Dean of BBN Technologies opened the Scalable Knowledge Systems Workshop with an invited talk. He reminded us of the facts of nature as concern the cost of distributed computing and running out of space for the working set. Developers in the semantic web field deplorably often ignore these facts, or alternatively recognize them and admit that they are unbeatable, that one just can&#39;t join across partitions. I gave a talk about the Virtuoso Cluster edition, wherein I repeated essentially the same ground facts as Mike and outlined how we (in spite of these) profit from distributed memory multiprocessing. To those not intimate with these questions, let me affirm that deriving benefit from threading in a symmetric multiprocessor box, let alone a cluster connected by a network, totally depends on having many relatively long running things going at a time and blocking as seldom as possible. Further, Mike Dean talked about ASIO, the BBN suite of semantic web tools. His most challenging statement was about the storage engine, a network-database-inspired triple-store using memory-mapped files. Will the CODASYL days come back, and will the linked list on disk be the way to store triples/quads? I would say that this will have, especially with a memory-mapped file, probably a better best-case as a B-tree but that this also will be less predictable with fragmentation. With Virtuoso, using a B-tree index, we see about 20-30% of CPU time spent on index lookup when running LUBM queries. With a disk-based memory-mapped linked-list storage, we would see some improvements in this while getting hit probably worse than now in the case of fragmentation. Plus compaction on the fly would not be nearly as easy and surely far less local, if there were pointers between pages. So it is my intuition that trees are a safer bet with varying workloads while linked lists can be faster in a query-dominated in-memory situation. Chris Bizer presented the Berlin SPARQL Benchmark (BSBM), which has already been discussed here in some detail. He did acknowledge that the next round of the race must have a real steady-state rule. This just means that the benchmark must be run long enough for the system under test to reach a state where the cache is full and the performance remains indefinitely at the same level. Reaching steady state can take 20-30 minutes in some cases. Regardless of steady state, BSBM has two generally valid conclusions: mapping relational to RDF, where possible, is faster than triple storage; and the equivalent relational solution can be some 10x faster than the pure triples representation. Mike Dean asked whether BSBM was a case of a setup to have triple stores fail. Not necessarily, I would say; we should understand that one motivation of BSBM is testing mapping technologies. Therefore it must have a workload where mapping makes sense. Of course there are workloads where triples are unchallenged â take the Billion Triples Challenge data set for one. Also, with BSBM, once should note that the query optimization time plays a fairly large role since most queries touch relatively little data. Also, even if the scale is large, the working set is not nearly the size of the database. This in fact penalizes mapping technologies against native SQL since the difference there is compiling the query, especially since parameters are not used. So, Chris, since we both like to map, let&#39;s make a benchmark that shows mapping closer to native SQL. Bridging the 10x Gap? When we run Virtuoso relational against Virtuoso triple store with the TPC-H workload, we see that the relational case is significantly faster. These are long queries, thus query optimization time is negligible; we are here comparing memory-based access times. Why is this? The answer is that a single index lookup gives multiple column values with almost no penalty for the extra column. Also, since the number of total joins is lower, the overhead coming from moving from join to next join is likewise lower. This is just a meter of count of executed instructions. A column store joins in principle just as much as a triple store. However, since the BI workload often consists of scanning over large tables, the joins tend to be local, the needed lookup can often use the previous location as a starting point. A triple store can do the same if queries have high locality. We do this in some SQL situations and can try this with triples also. The RDF workload is typically more random in its access pattern, though. The other factor is the length of control path. A column store has a simpler control flow if it knows that the column will have exactly one value per row. With RDF, this is not a given. Also, the column store&#39;s row is identified by a single number and not a multipart key. These two factors give the column store running with a fixed schema some edge over the more generic RDF quad store. There was some discussion on how much closer a triple store could come to a relational one. Some gains are undoubtedly possible. We will see. For the ideal row store workload, the RDBMS will continue to have some edge. Large online systems typically have a large part of the workload that is simple and repetitive. There is nothing to prevent one having special indices for supporting such workload, even while retaining the possibility of arbitrary triples elsewhere. Some degree of application-specific data structure does make sense. We just need to show how this is done. In this way, we have a continuum and not an either/or choice of triples vs. tables. Scale, Where Next? Concerning the future direction of the workshop, there were a few directions suggested. One of the more interesting ones was Mike Dean&#39;s suggestion about dealing with a large volume of same-as assertions, specifically a volume where materializing all the entailed triples was no longer practical. Of course, there is the question of scale. This time, we were the only ones focusing on a parallel database with no restrictions on joining.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Mike Dean of <a href="http://dbpedia.org/resource/BBN_Technologies" id="link-id0x21d04768">BBN Technologies</a> opened the Scalable <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x22348c58">Knowledge</a> Systems Workshop with an invited talk.  He reminded us of the facts of nature as concern the cost of distributed computing and running out of space for the working set. Developers in the <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x22570328">semantic web</a> field deplorably often ignore these facts, or alternatively recognize them and admit that they are unbeatable, that one just can&#39;t join across partitions.</p>

<p>I gave a talk about the <a href="http://virtuoso.openlinksw.com" id="link-id0x23f313f0">Virtuoso</a> Cluster edition, wherein I repeated essentially the same ground facts as Mike and outlined how we (in spite of these) profit from distributed memory multiprocessing.  To those not intimate with these questions, let me affirm that deriving benefit from threading in a symmetric multiprocessor box, let alone a cluster connected by a network, totally depends on having many relatively long running things going at a time and blocking as seldom as possible.</p>

<p>Further, Mike Dean talked about <a href="http://www.asio.bbn.com/" id="link-id0x1d74c108">ASIO</a>, the BBN suite of semantic web tools.  His most challenging statement was about the storage engine, a network-database-inspired triple-store using memory-mapped files. </p>

<p>Will the <a href="http://dbpedia.org/resource/CODASYL" id="link-id0x1f8ee860">CODASYL</a> days come back, and will the linked list on disk be the way to store triples/quads?  I would say that this will have, especially with a memory-mapped file, probably a better best-case as a B-tree but that this also will be less predictable with fragmentation. With Virtuoso, using a B-tree index, we see about 20-30% of CPU time spent on index lookup when running LUBM queries.  With a disk-based memory-mapped linked-list storage, we would see some improvements in this while getting hit probably worse than now in the case of fragmentation.  Plus compaction on the fly would not be nearly as easy and surely far less local, if there were pointers between pages.  So it is my intuition that trees are a safer bet with varying workloads while linked lists can be faster in a query-dominated in-memory situation.</p>

<p>Chris Bizer presented the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x1d670da0">Berlin SPARQL Benchmark</a> (<a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x21928808">BSBM</a>), which has already been discussed here in some detail.  He did acknowledge that the next round of the race must have a real steady-state rule.  This just means that the benchmark must be run long enough for the system under test to reach a state where the cache is full and the performance remains indefinitely at the same level. Reaching steady state can take 20-30 minutes in some cases.</p>

<p>Regardless of steady state, BSBM has two generally valid conclusions:
</p>
<ol>
<li>mapping relational to <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xab811020">RDF</a>, where possible, is faster than triple storage; and </li>
<li>the equivalent relational solution can be some 10x faster than the pure triples representation.</li>
</ol>

<p>Mike Dean asked whether BSBM was a case of a setup to have triple stores fail.  Not necessarily, I would say; we should understand that one motivation of BSBM is testing mapping technologies.  Therefore it must have a workload where mapping makes sense.  Of course there are workloads where triples are unchallenged â take the <a href="http://challenge.semanticweb.org/" id="link-id0x2538c3b8">Billion Triples Challenge</a> <a href="http://dbpedia.org/resource/Data" id="link-id0x1d673760">data</a> set for one.</p>

<p>Also, with BSBM, once should note that the query optimization time plays a fairly large role since most queries touch relatively little data.  Also, even if the scale is large, the working set is not nearly the size of the database.  This in fact penalizes mapping technologies against native <a href="http://dbpedia.org/resource/SQL" id="link-id0xac16cc10">SQL</a> since the difference there is compiling the query, especially since parameters are not used.  So, Chris, since we both like to map, let&#39;s make a benchmark that shows mapping closer to native SQL.</p>


<h2>Bridging the 10x Gap?</h2>

<p>When we run Virtuoso relational against Virtuoso triple store with the <a href="http://dbpedia.org/resource/TPC-H" id="link-id0x1d7dc518">TPC-H</a> workload, we see that the relational case is significantly faster.  These are long queries, thus query optimization time is negligible; we are here comparing memory-based access times.  Why is this?  The answer is that a single index lookup gives multiple column values with almost no penalty for the extra column.  Also, since the number of total joins is lower, the overhead coming from moving from join to next join is likewise lower.  This is just a meter of count of executed instructions.</p>

<p>A column store joins in principle just as much as a triple store. However, since the BI workload often consists of scanning over large tables, the joins tend to be local, the needed lookup can often use the previous location as a starting point.  A triple store can do the same if queries have high locality.  We do this in some SQL situations and can try this with triples also.  The RDF workload is typically more random in its access pattern, though.  The other factor is the length of control path.  A column store has a simpler control flow if it knows that the column will have exactly one value per row.  With RDF, this is not a given. Also, the column store&#39;s row is identified by a single number and not a multipart key. These two factors give the column store running with a fixed schema some edge over the more generic RDF quad store.</p>

<p>There was some discussion on how much closer a triple store could come to a relational one.  Some gains are undoubtedly possible.  We will see.  For the ideal row store workload, the <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x22e5b6f8">RDBMS</a> will continue to have some edge.  Large online systems typically have a large part of the workload that is simple and repetitive.  There is nothing to prevent one having special indices for supporting such workload, even while retaining the possibility of arbitrary triples elsewhere.  Some degree of application-specific data structure does make sense.  We just need to show how this is done.  In this way, we have a continuum and not an either/or choice of triples vs. tables.</p>
 
<h2>Scale, Where Next?</h2>

<p>Concerning the future direction of the workshop, there were a few directions suggested.  One of the more interesting ones was Mike Dean&#39;s suggestion about dealing with a large volume of same-as assertions, specifically a volume where materializing all the entailed triples was no longer practical.  Of course, there is the question of scale.  This time, we were the only ones focusing on a parallel database with no restrictions on joining.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-10-26#1465">
  <rss:title>Virtuoso - Are We Too Clever for Our Own Good? (updated)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-10-26T12:15:35Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">&quot;Physician, heal thyself,&quot; it is said. We profess to say what the messaging of the semantic web ought to be, but is our own perfect? I will here engage in some critical introspection as well as amplify on some answers given to Virtuoso-related questions in recent times. I use some conversations from the Vienna Linked Data Practitioners meeting as a starting point. These views are mine and are limited to the Virtuoso server. These do not apply to the ODS (OpenLink Data Spaces) applications line, OAT (OpenLink Ajax Toolkit), or ODE (OpenLink Data Explorer). &quot;It is not always clear what the main thrust is, we get the impression that you are spread too thin,&quot; said SÃ¶ren Auer. Well, personally, I am all for core competence. This is why I do not participate in all the online conversations and groups as much as I could, for example. Time and energy are critical resources and must be invested where they make a difference. In this case, the real core competence is running in the database race. This in itself, come to think of it, is a pretty broad concept. This is why we put a lot of emphasis on Linked Data and the Data Web for now, as this is the emerging game. This is a deliberate choice, not an outside imperative or built-in limitation. More specifically, this means exposing any pre-existing relational data as linked data plus being the definitive RDF store. We can do this because we own our database and SQL and data access middleware and have a history of connecting to any RDBMS out there. The principal message we have been hearing from the RDF field is the call for scale of triple storage. This is even louder than the call for relational mapping. We believe that in time mapping will exceed triple storage as such, once we get some real production strength mappings deployed, enough to outperform RDF warehousing. There are also RDF middleware things like RDF-ization and demand-driven web harvesting (i.e, the so-called Sponger). These are SPARQL options, thus accessed via standard interfaces. We have little desire to create our own languages or APIs, or to tell people how to program. This is why we recently introduced Sesame- and Jena-compatible APIs to our RDF store. From what we hear, these work. On the other hand, we do not hesitate to move beyond the standards when there is obvious value or necessity. This is why we brought SPARQL up to and beyond SQL expressivity. It is not a case of E3 (Embrace, Extend, Extinguish). Now, this message could be better reflected in our material on the web. This blog is a rather informal step in this direction; more is to come. For now we concentrate on delivering. The conventional communications wisdom is to split the message by target audience. For this, we should split the RDF, relational, and web services messages from each other. We believe that a challenger, like the semantic web technology stack, must have a compelling message to tell for it to be interesting. This is not a question of research prototypes. The new technology cannot lack something the installed technology takes for granted. This is why we do not tend to show things like how to insert and query a few triples: No business out there will insert and query triples for the sake of triples. There must be a more compelling story â for example, turning the whole world into a database. This is why our examples start with things like turning the TPC-H database into RDF, queries and all. Anything less is not interesting. Why would an enterprise that has business intelligence and integration issues way more complex than the rather stereotypical TPC-H even look at a technology that pretends to be all for integration and all for expressivity of queries, yet cannot answer the first question of the entry exam? The world out there is complex. But maybe we ought to make some simple tutorials? So, as a call to the people out there, tell us what a good tutorial would be. The question is more about figuring out what is out there and adapting these and making a sort of compatibility list. Jena and Sesame stuff ought to run as is. We could offer a webinar to all the data web luminaries showing how to promote the data web message with Virtuoso. After all, why not show it on the best platform? &quot;You are arrogant. When I read your papers or documentation, the impression I get is that you say you are smart and the reader is stupid.&quot; We should answer in multiple parts. For general collateral, like web sites and documentation: The web site gives a confused product image. For the Virtuoso product, we should divide at the top into Data web and RDF - Host linked data, expose relational assets as linked data; Relational Database - Full function, high performance, open source, Federated/Virtual Relational DBMS, expose heterogeneous RDB assets through one point of contact for integration; Web Services - access all the above over standard protocols, dynamic web pages, web hosting. For each point, one simple statement. We all know what the above things mean? Then we add a new point about scalability that impacts all the above, namely the Virtuoso version 6 Cluster, meaning that you can do all these things at 10 to 1000 times the scale. This means this much more data or in some cases this much more requests per second. This too is clear. Far as I am concerned, hosting Java or .NET does not have to be on the front page. Also, we have no great interest in going against Apache when it comes to a web server only situation. The fact that we have a web listener is important for some things but our claim to fame does not rest on this. Then for documentation and training materials: The documentation should be better. Specifically it should have more of a how-to dimension since nobody reads the whole thing anyhow. About online tutorials, the order of presentation should be different. They do not really reflect what is important at the present moment either. Now for conference papers: Since taking the data web as a focus area, we have submitted some papers and had some rejected because these do not have enough references and do not explain what is obvious to ourselves. I think that the communications failure in this case is that we want to talk about end to end solutions and the reviewers expect research. For us, the solution is interesting and exists only if there is an adequate functionality mix for addressing a specific use case. This is why we do not make a paper about query cost model alone because the cost model, while indispensable, is a thing that is taken for granted where we come from. So we mention RDF adaptations to cost model, as these are important to the whole but do not find these to be the justification for a whole paper. If we made papers on this basis, we would have to make five times as many. Maybe we ought to. &quot;Virtuoso is very big and very difficult&quot; One thing that is not obvious from the Virtuoso packaging is that the minimum installation is an executable under 10MB and a config file. Two files. This gives you SQL and SPARQL out of the box. Adding ODBC and JDBC clients is as simple as it gets. After this, there is basic database functionality. Tuning is a matter of a few parameters that are explained on this blog and elsewhere. Also, the full scale installation is available as an Amazon EC2 image, so no installation required. Now for the difficult side: Use SQL and SPARQL; use stored procedures whenever there is server side business logic. For some time critical web pages, use VSP. Do not use VSPX. Otherwise, use whatever you are used to â PHP or Java or anything else. For web services, simple is best. Stick to basics. &quot;The engineer is one who can invent a simple thing.&quot; Use SQL statements rather than admin UI. Know that you can start a server with no database file and you get an initial database with nothing extra. The demo database, the way it is produced by installers is cluttered. We should put this into a couple of use case oriented how-tos. Also, we should create a network of &quot;friendly local virtuoso geeks&quot; for providing basic training and services so we do not have to explain these things all the time. To all you data-web-ers out there â please sign up and we will provide instructions, etc. Contact YrjÃ¤nÃ¤ Rankka (ghard[at-sign]openlinksw.com), or go through the mailing lists; do not contact me directly. &quot;OK, we understand that you may be good at the large end of the spectrum but how do you reconcile this with the lightweight or embedded end, like the semantic desktop?&quot; Now, what is good for one end is usually good for the other. Namely, a database, no matter the scale, needs to have space efficient storage, fast index lookup, and correct query plans. Then there are things that occur only at the high-end, like clustering, but these are separate things. For embedding, the initial memory footprint needs to be small. With Virtuoso, this is accomplished by leaving out some 200 built-in tables and 100,000 lines of SQL procedures that are normally in by default, supporting things such as DAV and diverse other protocols. After all, if SPARQL is all one wants these are not needed. If one really wants to do one&#39;s server logic (like web listener and thread dispatching) oneself, this is not impossible but requires some advice from us. On the other hand, if one wants to have logic for security close to the data, then using stored procedures is recommended; these execute right next to the data, and support inline SPARQL and SQL. Depending on the license status of the other code, some special licensing arrangements may apply. We are talking about such things with different parties at present. &quot;How webby are you? What is webby?&quot; &quot;Webby means distributed, heterogeneous, open; not monolithic consolidation of everything.&quot; We are philosophically webby. We come from open standards; we are after all called OpenLink; our history consists of connecting things. We believe in choice â the user should be able to pick the best of breed for components and have them work together. We cannot and do not wish to force replacement of existing assets. Transforming data on the fly and connecting systems, leaving data where it originally resides, is the first preference. For the data web, the first preference is a federation of independent SPARQL end points. When there is harvesting, we prefer to do it on demand, as with our Sponger. With the immense amount of data out there we believe in finding what is relevant when it is relevant, preferably close at hand, leveraging things like social networks. With a data web, many things which are now siloized, such as marketplaces and social networks, will return to the open. Google-style crawling of everything becomes less practical if one needs to run complex ad hoc queries against the mass of data. For these types of scenarios, if one needs to warehouse, the data cloud will offer solutions where one pays for database on demand. While we believe in loosely coupled federation where possible, we have serious work on the scalability side for the data center and the compute-on-demand cloud. &quot;How does OpenLink see the next five years unfolding?&quot; Personally, I think we have the basics for the birth of a new inflection in the knowledge economy. The URI is the unit of exchange; its value and competitive edge lie in the data it links you with. A name without context is worth little, but as a name gets more use, more information can be found through that name. This is anything from financial statistics, to legal precedents, to news reporting or government data. Right now, if the SEC just added one line of markup to the XBRL template, this would instantaneously make all SEC-mandated reporting into linked data via GRDDL. The URI is a carrier of brand. An information brand gets traffic and references, and this can be monetized in diverse ways. The key word is context. Information overload is here to stay, and only better context offers the needed increase in productivity to stay ahead of the flood. Semantic technologies on the whole can help with this. Why these should be semantic web or data web technologies as opposed to just semantic is the linked data value proposition. Even smart islands are still islands. Agility, scale, and scope, depend on the possibility of combining things. Therefore common terminologies and dereferenceability and discoverability are important. Without these, we are at best dealing with closed systems even if they were smart. The expert systems of the 1980s are a case in point. Ever since the .com era, the URL has been a brand. Now it becomes a URI. Thus, entirely hiding the URI from the user experience is not always desirable. The URI is a sort of handle on the provenance and where more can be found; besides, people are already used to these. With linked data, information value-add products become easy to build and deploy. They can be basically just canned SPARQL queries combining data in a useful and insightful manner. And where there is traffic there can be monetization, whether by advertizing, subscription, or other means. Such possibilities are a natural adjunct to the blogosphere. To publish analysis, one no longer needs to be a think tank or media company. We could call this scenario the birth of a meshup economy. For OpenLink itself, this is our roadmap. The immediate future is about getting our high end offerings like clustered RDF storage generally available, both on the cloud and for private data centers. Ourselves, we will offer the whole Linked Open Data cloud as a database. The single feature to come in version 2 of this is fully automatic partitioning and repartitioning for on-demand scale; now, you have to choose how many partitions you have. This makes some things possible that were hard thus far. On the mapping front, we go for real-scale data integration scenarios where we can show that SPARQL can unify terms and concepts across databases, yet bring no added cost for complex queries. Enterprises can use their existing warehouses and have an added level of abstraction, the possibility of cross systems interlinking, the advantages of using the same taxonomies and ontologies across systems, and so forth. Then there will be developments in the direction of smarter web harvesting on demand with the Virtuoso Sponger, and federation of heterogeneous SPARQL end points. The federation is not so unlike clustering, except the time scales are 2 orders of magnitude longer. The work on SPARQL end point statistics and data set description and discovery is a good development in the community. Then there will be NLP integration, as exemplified by the Open Calais linked data wrapper and more. Can we pull this off or is this being spread too thin? We know from experience that all this can be accomplished. Scale is already here; we show it with the billion triples set. Mapping is here; we showed it last in the Berlin Benchmark. We will also show some TPC-H results after we get a little quiet after the ISWC event. Then there is ongoing maintenance but with this we have shown a steady turnaround and quick time to fix for pretty much anything.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>&quot;Physician, heal thyself,&quot; it is said. We profess to say what the messaging of the <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x1fa3da18">semantic web</a> ought to be, but is our own perfect?</p>

<p>I will here engage in some critical introspection as well as amplify on some answers given to <a href="http://virtuoso.openlinksw.com" id="link-id0x1e1eecf0">Virtuoso</a>-related questions in recent times.</p>

<p>I use some conversations from the <a href="http://dbpedia.org/resource/Vienna" id="link-id0x1ec0b2e0">Vienna</a> <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x2045ac10">Linked Data</a> Practitioners meeting as a starting point. These views are mine and are limited to the Virtuoso server. These do not apply to the <a href="http://dbpedia.org/resource/OpenLink_Data_Spaces" id="link-id0x2045ac38">ODS</a> (<a href="http://dbpedia.org/resource/OpenLink_Data_Spaces" id="link-id0x14f63c58">OpenLink Data Spaces</a>) applications line, <a href="http://oat.openlinksw.com/" id="link-id0x14f63c80">OAT</a> (<a href="http://oat.openlinksw.com/" id="link-id0x1e536928">OpenLink Ajax Toolkit</a>), or <a href="http://ode.openlinksw.com/" id="link-id0x1eaed7f8">ODE</a> (<a href="http://ode.openlinksw.com/" id="link-id0x1edfff88">OpenLink Data Explorer</a>).</p>

<h3>&quot;It is not always clear what the main thrust is, we get the impression that you are spread too thin,&quot; said <a href="http://www.informatik.uni-leipzig.de/~auer/foaf.rdf#me" id="link-id0x1b8a9580">SÃ¶ren Auer</a>.</h3>

<p>Well, personally, I am all for core competence. This is why I do not participate in all the online conversations and groups as much as I could, for example. Time and energy are critical resources and must be invested where they make a difference. In this case, the real core competence is running in the database race. This in itself, come to think of it, is a pretty broad concept.</p>

<p>This is why we put a lot of emphasis on Linked Data and the <a href="http://dbpedia.org/resource/Data" id="link-id0x1b85fa38">Data</a> Web for now, as this is the emerging game. This is a deliberate choice, not an outside imperative or built-in limitation. More specifically, this means exposing any pre-existing relational data as linked data plus being the definitive <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1f5b4468">RDF</a> store.</p>

<p>We can do this because we own our database and <a href="http://dbpedia.org/resource/SQL" id="link-id0x20076468">SQL</a> and data access middleware and have a history of connecting to any <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x1ffd6f98">RDBMS</a> out there.</p>

<p>The principal message we have been hearing from the RDF field is the call for scale of triple storage. This is even louder than the call for relational mapping. We believe that in time mapping will exceed triple storage as such, once we get some real production strength mappings deployed, enough to outperform RDF warehousing.</p>

<p>There are also RDF middleware things like RDF-ization and demand-driven web harvesting (i.e, the so-called Sponger). These are <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1316f720">SPARQL</a> options, thus accessed via standard interfaces. We have little desire to create our own languages or APIs, or to tell people how to program. This is why we recently introduced <a href="http://sourceforge.net/projects/sesame/" id="link-id0x20756a68">Sesame</a>- and <a href="http://jena.sourceforge.net/" id="link-id0x1ec01ac0">Jena</a>-compatible APIs to our RDF store. From what we hear, these work. On the other hand, we do not hesitate to move beyond the standards when there is obvious value or necessity. This is why we brought SPARQL up to and beyond SQL expressivity. It is not a case of E3 (Embrace, Extend, Extinguish).</p>

<p>Now, this message could be better reflected in our material on the web. This <a href="http://dbpedia.org/resource/Blog" id="link-id0x2027b410">blog</a> is a rather informal step in this direction; more is to come. For now we concentrate on delivering.</p>

<p>The conventional communications wisdom is to split the message by target audience. For this, we should split the RDF, relational, and web services messages from each other. We believe that a challenger, like the semantic web technology stack, must have a compelling message to tell for it to be interesting. This is not a question of research prototypes. The new technology cannot lack something the installed technology takes for granted.</p>

<p>This is why we do not tend to show things like how to insert and query a few triples: No business out there will insert and query triples for the sake of triples. There must be a more compelling story â for example, turning the whole world into a database. This is why our examples start with things like turning the <a href="http://dbpedia.org/resource/TPC-H" id="link-id0x2051ff98">TPC-H</a> database into RDF, queries and all. Anything less is not interesting. Why would an enterprise that has business intelligence and integration issues way more complex than the rather stereotypical TPC-H even look at a technology that pretends to be all for integration and all for expressivity of queries, yet cannot answer the first question of the entry exam?</p>

<p>The world out there is complex. But maybe we ought to make some simple tutorials? So, as a call to the people out there, tell us what a good tutorial would be. The question is more about figuring out what is out there and adapting these and making a sort of compatibility list.  Jena and Sesame stuff ought to run as is. We could offer a webinar to all the data web luminaries showing how to promote the data web message with Virtuoso. After all, why not show it on the best platform?</p>

<h3>&quot;You are arrogant. When I read your papers or documentation, the impression I get is that you say you are smart and the reader is stupid.&quot;</h3>

<p>We should answer in multiple  parts.</p>

<p>For general collateral, like web sites and documentation:</p>

<p>The web site gives a confused product image.  For the Virtuoso product, we should divide at the top into</p>

<ul>  
<li> Data web and RDF - Host linked data, expose relational assets as linked data;</li>
<li> Relational Database - Full function, high performance, open source, Federated/Virtual Relational DBMS, expose heterogeneous RDB assets through one point of contact for integration;</li>
<li> Web Services - access all the above over standard protocols, dynamic web pages, web hosting.</li>
</ul>

<p>For each point, one simple statement.  We all know what the above things mean?</p>

<p>Then we add a new point about scalability that impacts all the above, namely the Virtuoso version 6 Cluster, meaning that you can do all these things at 10 to 1000 times the scale. This means this much more data or in some cases this much more requests per second. This too is clear.</p>

<p>Far as I am concerned, hosting Java or .<a href="http://dbpedia.org/resource/.NET_Framework" id="link-id0x1f297540">NET</a> does not have to be on the front page. Also, we have no great interest in going against <a href="http://dbpedia.org/resource/Apache" id="link-id0x1ea29578">Apache</a> when it comes to a web server only situation. The fact that we have a web listener is important for some things but our claim to fame does not rest on this.</p>

<p>Then for documentation and training materials: The documentation should be better. Specifically it should have more of a how-to dimension since nobody reads the whole thing anyhow. About online tutorials, the order of presentation should be different. They do not really reflect what is important at the present moment either.</p>

<p>Now for conference papers: Since taking the data web as a focus area, we have submitted some papers and had some rejected because these do not have enough references and do not explain what is obvious to ourselves.</p>

<p>I think that the communications failure in this case is that we want to talk about end to end solutions and the reviewers expect research. For us, the solution is interesting and exists only if there is an adequate functionality mix for addressing a specific use case. This is why we do not make a paper about query cost model alone because the cost model, while indispensable, is a thing that is taken for granted where we come from. So we mention RDF adaptations to cost model, as these are important to the whole but do not find these to be the justification for a whole paper. If we made papers on this basis, we would have to make five times as many. Maybe we ought to.</p>

<h3>&quot;Virtuoso is very big and very difficult&quot;</h3>

<p>One thing that is not obvious from the Virtuoso packaging is that the minimum installation is an executable under 10MB and a config file. Two files.</p>

<p>This gives you SQL and SPARQL out of the box.  Adding <a href="http://dbpedia.org/resource/Open_Database_Connectivity" id="link-id0x20a2e7d0">ODBC</a> and <a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id0x1e4cceb8">JDBC</a> clients is as simple as it gets. After this, there is basic database functionality. Tuning is a matter of a few parameters that are explained on this blog and elsewhere. Also, the full scale installation is available as an Amazon EC2 image, so no installation required.</p>

<p>Now for the difficult side:</p>

<p>Use SQL and SPARQL; use stored procedures whenever there is server side business logic. For some time critical web pages, use VSP. Do not use VSPX. Otherwise, use whatever you are used to â <a href="http://dbpedia.org/resource/PHP" id="link-id0x20b03f08">PHP</a> or Java or anything else. For web services, simple is best. Stick to basics. &quot;The engineer is one who can invent a simple thing.&quot; Use SQL statements rather than admin UI.</p>

<p>Know that you can start a server with no database file and you get an initial database with nothing extra. The demo database, the way it is produced by installers is cluttered.</p>

<p>We should put this into a couple of use case oriented how-tos.</p>

<p>Also, we should create a network of &quot;friendly local virtuoso geeks&quot; for providing basic training and services so we do not have to explain these things all the time. To all you data-web-ers out there â please sign up and we will provide instructions, etc. Contact YrjÃ¤nÃ¤ Rankka (ghard[at-sign]openlinksw.com), or go through the mailing lists; do not contact me directly.</p>

<h3>&quot;OK, we understand that you may be good at the large end of the spectrum but how do you reconcile this with the lightweight or embedded end, like the semantic desktop?&quot;</h3>

<p>Now, what is good for one end is usually good for the other. Namely, a database, no matter the scale, needs to have space efficient storage, fast index lookup, and correct query plans. Then there are things that occur only at the high-end, like clustering, but these are separate things. For embedding, the initial memory footprint needs to be small. With Virtuoso, this is accomplished by leaving out some 200 built-in tables and 100,000 lines of SQL procedures that are normally in by default, supporting things such as DAV and diverse other protocols. After all, if SPARQL is all one wants these are not needed.</p>

<p>If one really wants to do one&#39;s server logic (like web listener and thread dispatching) oneself, this is not impossible but requires some advice from us. On the other hand, if one wants to have logic for security close to the data, then using stored procedures is recommended; these execute right next to the data, and support inline SPARQL and SQL. Depending on the license status of the other code, some special licensing arrangements may apply.</p>

<p>We are talking about such things with different parties at present.</p>

<h3>&quot;How webby are you?  What is webby?&quot;</h3>

<p>&quot;Webby means distributed, heterogeneous, open; not monolithic consolidation of everything.&quot;</p>

<p>We are philosophically webby. We come from open standards; we are after all called OpenLink; our history consists of connecting things. We believe in choice â the user should be able to pick the best of breed for components and have them work together. We cannot and do not wish to force replacement of existing assets. Transforming data on the fly and connecting systems, leaving data where it originally resides, is the first preference. For the data web, the first preference is a federation of independent SPARQL end points. When there is harvesting, we prefer to do it on demand, as with our Sponger. With the immense amount of data out there we believe in finding what is relevant <i>when</i> it is relevant, preferably close at hand, leveraging things like social networks. With a data web, many things which are now siloized, such as marketplaces and social networks, will return to the open.</p>

<p>Google-style crawling of everything becomes less practical if one needs to run complex <i>ad hoc</i> queries against the mass of data. For these types of scenarios, if one needs to warehouse, the data cloud will offer solutions where one pays for database on demand. While we believe in loosely coupled federation where possible, we have serious work on the scalability side for the data center and the compute-on-demand cloud.</p>

<h3>&quot;How does OpenLink see the next five years unfolding?&quot;</h3>

<p>Personally, I think we have the basics for the birth of a new inflection in the <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x2018bd98">knowledge</a> economy. The <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id0x1ec110d8">URI</a> is the unit of exchange; its value and competitive edge lie in the data it links you with. A name without context is worth little, but as a name gets more use, more <a href="http://dbpedia.org/resource/Information" id="link-id0x1ecfba08">information</a> can be found through that name. This is anything from financial statistics, to legal precedents, to news reporting or government data. Right now, if the SEC just added one line of markup to the XBRL template, this would instantaneously make all SEC-mandated reporting into linked data via GRDDL.</p>

<p>The URI is a carrier of brand. An information brand gets traffic and references, and this can be monetized in diverse ways. The key word is <i>context</i>. Information overload is here to stay, and only better context offers the needed increase in productivity to stay ahead of the flood.</p>

<p>Semantic technologies on the whole can help with this. Why these should be semantic web or data web technologies as opposed to just semantic is the linked data value proposition. Even smart islands are still islands. Agility, scale, and scope, depend on the possibility of combining things. Therefore common terminologies and dereferenceability and discoverability are important. Without these, we are at best dealing with closed systems even if they were smart. The expert systems of the 1980s are a case in point.</p>

<p>Ever since the .com era, the <a href="http://dbpedia.org/resource/Uniform_Resource_Locator" id="link-id0x1c4c9248">URL</a> has been a brand. Now it becomes a URI. Thus, entirely hiding the URI from the user experience is not always desirable. The URI is a sort of handle on the provenance and where more can be found; besides, people are already used to these.</p>

<p>With linked data, information value-add products become easy to build and deploy. They can be basically just canned SPARQL queries combining data in a useful and insightful manner. And where there is traffic there can be monetization, whether by advertizing, subscription, or other means. Such possibilities are a natural adjunct to the blogosphere. To publish analysis, one no longer needs to be a think tank or media company. We could call this scenario the birth of a meshup economy.</p>

<p>For OpenLink itself, this is our roadmap. The immediate future is about getting our high end offerings like clustered RDF storage generally available, both on the cloud and for private data centers. Ourselves, we will offer the whole <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x20791bf0">Linked Open Data</a> cloud as a database. The single feature to come in version 2 of this is fully automatic partitioning and repartitioning for on-demand scale; now, you have to choose how many partitions you have.</p>

<p>This makes some things possible that were hard thus far.</p>

<p>On the mapping front, we go for real-scale data integration scenarios where we can show that SPARQL can unify terms and concepts across databases, yet bring no added cost for complex queries. Enterprises can use their existing warehouses and have an added level of abstraction, the possibility of cross systems interlinking, the advantages of using the same taxonomies and ontologies across systems, and so forth.</p>

<p>Then there will be developments in the direction of smarter web harvesting on demand with the Virtuoso <a href="http://virtuoso.openlinksw.com/Whitepapers/html/VirtSpongerWhitePaper.html" id="link-id0x1f27e6d8">Sponger</a>, and federation of heterogeneous SPARQL end points. The federation is not so unlike clustering, except the time scales are 2 orders of magnitude longer. The work on SPARQL end point statistics and data set description and discovery is a good development in the community.</p>

<p>Then there will be NLP integration, as exemplified by the Open Calais linked data wrapper and more.</p>

<p>Can we pull this off or is this being spread too thin? We know from experience that all this can be accomplished. Scale is already here; we show it with the billion triples set. Mapping is here; we showed it last in the Berlin Benchmark. We will also show some TPC-H results after we get a little quiet after the ISWC event.  Then there is ongoing maintenance but with this we have shown a steady turnaround and quick time to fix for pretty much anything.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-10-02#1448">
  <rss:title>Virtuoso Update, Billion Triples and Outlook</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-10-02T09:31:17Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">I will say a few things about what we have been doing and where we can go. Firstly, we have a fairly scalable platform with Virtuoso 6 Cluster. It was most recently tested with the workload discussed in the previous Billion Triples post. There is an updated version of the paper about this. This will be presented at the web scale workshop of ISWC 2008 in Karlsruhe. Right now, we are polishing some things in Virtuoso 6 -- some optimizations for smarter balancing of interconnect traffic over multiple network interfaces, and some more SQL optimizations specific to RDF. The must-have basics, like parallel running of sub-queries and aggregates, and all-around unrolling of loops of every kind into large partitioned batches, is all there and proven to work. We spent a lot of time around the Berlin SPARQL Benchmark story, so we got to the more advanced stuff like the Billion Triples Challenge rather late. We did along the way also run BSBM with an Oracle back-end, with Virtuoso mapping SPARQL to SQL. This merits its own analysis in the near future. This will be the basic how-to of mapping OLTP systems to RDF. Depending on the case, one can use this for lookups in real-time or ETL. RDF will deliver value in complex situations. An example of a complex relational mapping use case came from Ordnance Survey, presented at the RDB2RDF XG. Examples of complex warehouses include the Neurocommons database, the Billion Triples Challenge, and the Garlik DataPatrol. In comparison, the Berlin workload is really simple and one where RDF is not at its best, as amply discussed on the Linked Data forum. BSBM&#39;s primary value is as a demonstrator for the basic mapping tasks that will be repeated over and over for pretty much any online system when presence on the data web becomes as indispensable as presence on the HTML web. I will now talk about the complex warehouse/web-harvesting side. I will come to the mapping in another post. Now, all the things shown in the Billion Triples post can be done with a relational system specially built for each purpose. Since we are a general purpose RDBMS, we use this capability where it makes sense. For example, storing statistics about which tags or interests occur with which other tags or interests as RDF blank nodes makes no sense. We do not even make the experiment; we know ahead of time that the result is at least an order of magnitude in favor of the relational row-oriented solution in both space and time. Whenever there is a data structure specially made for answering one specific question, like joint occurrence of tags, RDB and mapping is the way to go. With Virtuoso, this can fully-well coexist with physical triples, and can still be accessed in SPARQL and mixed with triples. This is territory that we have not extensively covered yet, but we will be giving some examples about this later. The real value of RDF is in agility. When there is no time to design and load a new warehouse for every new question, RDF is unparalleled. Also SPARQL, once it has the necessary extensions of aggregating and sub-queries, is nicer than SQL, especially when we have sub-classes and sub-properties, transitivity, and &quot;same as&quot; enabled. These things have some run time cost and if there is a report one is hitting absolutely all the time, then chances are that resolving terms and identity at load-time and using materialized views in SQL is the reasonable thing. If one is inventing a new report every time, then RDF has a lot more convenience and flexibility. We are just beginning to explore what we can do with data sets such as the online conversation space, linked data, and the open ontologies of UMBEL and OpenCyc. It is safe to say that we can run with real world scale without loss of query expressivity. There is an incremental cost for performance but this is not prohibitive. Serving the whole billion triples set from memory would cost about $32K in hardware. $8K will do if one can wait for disk part of the time. One can use these numbers as a basis for costing larger systems. For online search applications, one will note that running the indexes pretty much from memory is necessary for flat response time. For back office analytics this is not necessarily as critical. It all depends on the use case. We expect to be able to combine geography, social proximity, subject matter, and named entities, with hierarchical taxonomies and traditional full text, and to present this through a simple user interface. We expect to do this with online response times if we have a limited set of starting points and do not navigate more than 2 or 3 steps from each starting point. An example would be to have a full text pattern and news group, and get the cloud of interests from the authors of matching posts. Another would be to make a faceted view of the properties of the 1000 people most closely connected to one person. Queries like finding the fastest online responders to questions about romance across the global board-scape, or finding the person who initiates the most long running conversations about crime, take a bit longer but are entirely possible. The genius of RDF is to be able to do these things within a general purpose database, ad hoc, in a single query language, mostly without materializing intermediate results. Any of these things could be done with arbitrary efficiency in a custom built system. But what is special now is that the cost of access to this type of information and far beyond drops dramatically as we can do these things in a far less labor intensive way, with a general purpose system, with no redesigning and reloading of warehouses at every turn. The query becomes a commodity. Still, one must know what to ask. In this respect, the self-describing nature of RDF is unmatched. A query like list the top 10 attributes with the most distinct values for all persons cannot be done in SQL. SQL simply does not allow the columns to be variable. Further, we can accept queries as text, the way people are used to supplying them, and use structure for drill-down or result-relevance, and also recognize named entities and subject matter concepts in query text. Very simple NLP will go a long way towards keeping SPARQL out of the user experience. The other way of keeping query complexity hidden is to publish hand-written SPARQL as parameter-fed canned reports. Between now and ISWC 2008, the last week of October, we will put out demos showing some of these things. Stay tuned.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>I will say a few things about what we have been doing and where we can go.</p>

<p>Firstly, we have a fairly scalable platform with <a href="http://virtuoso.openlinksw.com" id="link-id0xa412e450">Virtuoso</a> 6 Cluster. It was most recently tested with the workload discussed in the previous <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1445" id="link-id1638a5b8">Billion Triples post</a>.</p>

<p>There is an updated version of <a href="http://www.openlinksw.com/weblog/oerling/2008webscale_rdf.pdf" id="link-id16280a68">the paper about this</a>. This will be presented at the web scale workshop of ISWC 2008 in Karlsruhe.</p>

<p>Right now, we are polishing some things in Virtuoso 6 -- some optimizations for smarter balancing of interconnect traffic over multiple network interfaces, and some more <a href="http://dbpedia.org/resource/SQL" id="link-id0x1c1c5f48">SQL</a> optimizations specific to <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1bcb6108">RDF</a>. The must-have basics, like parallel running of sub-queries and aggregates, and all-around unrolling of loops of every kind into large partitioned batches, is all there and proven to work.</p>

<p>We spent a lot of time around the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x3a4e17c8">Berlin SPARQL Benchmark</a> story, so we got to the more advanced stuff like the <a href="http://challenge.semanticweb.org/" id="link-id0x1a66c568">Billion Triples Challenge</a> rather late. We did along the way also run <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x188c2608">BSBM</a> with an <a href="http://dbpedia.org/resource/Oracle_Database" id="link-id0x1aa97f98">Oracle</a> back-end, with Virtuoso mapping <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1abd87a0">SPARQL</a> to SQL. This merits its own analysis in the near future. This will be the basic how-to of mapping OLTP systems to RDF. Depending on the case, one can use this for lookups in real-time or ETL.</p>

<p>RDF will deliver value in complex situations. An example of a complex relational mapping use case came from Ordnance Survey, presented at the <a href="http://www.w3.org/2005/Incubator/rdb2rdf/" id="link-id0x1a941678">RDB2RDF XG</a>. Examples of complex warehouses include the <a href="http://neurocommons.org/page/Main_Page" id="link-id0x1aa5a9f8">Neurocommons</a> database, the Billion Triples Challenge, and the <a href="http://www.garlik.com/" id="link-id0x372df7b0">Garlik DataPatrol</a>.</p>

<p>In comparison, the Berlin workload is really simple and one where RDF is not at its best, as amply discussed on the <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x1a671cf0">Linked Data</a> forum. BSBM&#39;s primary value is as a demonstrator for the basic mapping tasks that will be repeated over and over for pretty much any online system when presence on the <a href="http://dbpedia.org/resource/Data" id="link-id0x1ab83dd0">data</a> web becomes as indispensable as presence on the HTML web.</p>

<p>I will now talk about the complex warehouse/web-harvesting side. I will come to the mapping in another post.</p>

<p>Now, all the things shown in the <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1445" id="link-id14de1d18">Billion Triples post</a> can be done with a relational system specially built for each purpose. Since we are a general purpose <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x340d3470">RDBMS</a>, we use this capability where it makes sense. For example, storing statistics about which tags or interests occur with which other tags or interests as RDF blank nodes makes no sense. We do not even make the experiment; we know ahead of time that the result is at least an order of magnitude in favor of the relational row-oriented solution in both space and time.</p>

<p>Whenever there is a data structure specially made for answering one specific question, like joint occurrence of tags, RDB and mapping is the way to go. With Virtuoso, this can fully-well coexist with physical triples, and can still be accessed in SPARQL and mixed with triples. This is territory that we have not extensively covered yet, but we will be giving some examples about this later.</p>

<p>The real value of RDF is in agility. When there is no time to design and load a new warehouse for every new question, RDF is unparalleled. Also SPARQL, once it has the necessary extensions of aggregating and sub-queries, is nicer than SQL, especially when we have sub-classes and sub-properties, transitivity, and &quot;same as&quot; enabled. These things have some run time cost and if there is a report one is hitting absolutely all the time, then chances are that resolving terms and identity at load-time and using materialized views in SQL is the reasonable thing. If one is inventing a new report every time, then RDF has a lot more convenience and flexibility.</p>

<p>We are just beginning to explore what we can do with data sets such as the online conversation space, linked data, and the open ontologies of <a href="http://umbel.org/about/" id="link-id0x19cabf38">UMBEL</a> and <a href="http://dbpedia.org/resource/Cyc" id="link-id0x19cecd10">OpenCyc</a>. It is safe to say that we can run with real world scale without loss of query expressivity. There is an incremental cost for performance but this is not prohibitive. Serving the whole billion triples set from memory would cost about $32K in hardware. $8K will do if one can wait for disk part of the time. One can use these numbers as a basis for costing larger systems. For online search applications, one will note that running the indexes pretty much from memory is necessary for flat response time. For back office analytics this is not necessarily as critical. It all depends on the use case.</p>

<p>We expect to be able to combine geography, social proximity, subject matter, and <a href="http://dbpedia.org/resource/Named_entity_recognition" id="link-id0x1a8202e8">named entities</a>, with hierarchical taxonomies and traditional full text, and to present this through a simple user interface.</p>

<p>We expect to do this with online response times if we have a limited set of starting points and do not navigate more than 2 or 3 steps from each starting point. An example would be to have a full text pattern and news group, and get the cloud of interests from the authors of matching posts. Another would be to make a faceted view of the properties of the 1000 people most closely connected to one person.</p>

<p>Queries like finding the fastest online responders to questions about romance across the global board-scape, or finding the person who initiates the most long running conversations about crime, take a bit longer but are entirely possible.</p>

<p>The genius of RDF is to be able to do these things within a general purpose database, ad hoc, in a single query language, mostly without materializing intermediate results. Any of these things could be done with arbitrary efficiency in a custom built system. But what is special now is that the cost of access to this type of <a href="http://dbpedia.org/resource/Information" id="link-id0x1ab0a918">information</a> and far beyond drops dramatically as we can do these things in a far less labor intensive way, with a general purpose system, with no redesigning and reloading of warehouses at every turn. The query becomes a commodity.</p>

<p>Still, one must know what to ask. In this respect, the self-describing nature of RDF is unmatched. A query like <i>list the top 10 attributes with the most distinct values for all persons</i> cannot be done in SQL. SQL simply does not allow the columns to be variable.</p>

<p>Further, we can accept queries as text, the way people are used to supplying them, and use structure for drill-down or result-relevance, and also recognize named entities and subject matter concepts in query text. Very simple NLP will go a long way towards keeping SPARQL out of the user experience.</p>

<p>The other way of keeping query complexity hidden is to publish hand-written SPARQL as parameter-fed canned reports.</p>

<p>Between now and ISWC 2008, the last week of October, we will put out demos showing some of these things. Stay tuned.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-08-27#1422">
  <rss:title>A quick look at SP2B, the SPARQL Performance Benchmark</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-08-27T16:00:07Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">I finally got around to running the SP2B SPARQL Performance Benchmark on the current Virtuoso Open Source Edition, v5.0.8. I ran it with the 5M triples scale, which is the highest scale for which the authors give numbers. I got a run time of 25 minutes for the 12 queries, giving an arithmetic mean of the query time of 125 seconds. This is better than the 800 or so seconds that the authors had measured. Also, Q6 of the set had failed for the authors, but we have since fixed this; the fix is in the v5.0.8 cut. I also tried it with a scale of 25M, but this became I/O bound and took a bit longer. I will try this with v6 and v7 cluster later, which are vastly better at anything I/O bound. The machine was a 2GHz Xeon with 8G RAM. The query text was the one from the authors, with an explicit FROM clause added; the client was the command line Interactive SQL (iSQL). If one does the test with the default index layout without specifying a graph, things will not work very well. Also, returning the million-row results of these queries over the SPARQL protocol is not practical. I will say something more about SP2B when I get to have a closer look.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>I finally got around to running the <a href="http://dbis.informatik.uni-freiburg.de/index.php?project=SP2B" id="link-id17bac628">SP<sup>2</sup>B SPARQL Performance Benchmark</a> on the current <a href="http://virtuoso.openlinksw.com" id="link-id0x1d2a6838">Virtuoso</a> Open Source Edition, v5.0.8.</p>
<p>I ran it with the 5M triples scale, which is the highest scale for which the authors give numbers.</p>
<p>I got a run time of 25 minutes for the 12 queries, giving an arithmetic mean of the query time of 125 seconds.  This is better than the 800 or so seconds that the authors had measured.  Also, Q6 of the set had failed for the authors, but we have since fixed this; the fix is in the v5.0.8 cut.</p>
<p>I also tried it with a scale of 25M, but this became I/O bound and took a bit longer.  I will try this with v6 and v7 cluster later, which are vastly better at anything I/O bound.</p>
<p>The machine was a 2GHz Xeon with 8G RAM.  The query text was the one from the authors, with an explicit <code>FROM</code> clause added; the client was the command line Interactive <a href="http://dbpedia.org/resource/SQL" id="link-id0x19e74ce0">SQL</a> (iSQL).</p>
<p>If one does the test with the default index layout without specifying a graph, things will not work very well.  Also, returning the million-row results of these queries over the <a href="http://www.w3.org/TR/rdf-sparql-protocol/" id="link-id0x1c4231a0">SPARQL protocol</a> is not practical.</p>
<p>I will say something more about SP<sup>2</sup>B when I get to have a closer look.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-08-25#1418">
  <rss:title>Configuring Virtuoso for Benchmarking</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-08-25T14:05:46Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">I will here summarize what should be known about running benchmarks with Virtuoso. Physical Memory For 8G RAM, in the [Parameters] stanza of virtuoso.ini, set â [Parameters] ... NumberOfBuffers = 550000 For 16G RAM, double thisâ [Parameters] ... NumberOfBuffers = 1100000 Transaction Isolation For most cases, certainly all RDF cases, Read Committed should be the default transaction isolation. In the [Parameters] stanza of virtuoso.ini, set â [Parameters] ... DefaultIsolation = 2 Multiuser Workload If ODBC, JDBC, or similarly connected client applications are used, there must be more ServerThreads available than there will be client connections. In the [Parameters] stanza of virtuoso.ini, set â [Parameters] ... ServerThreads = 100 With web clients (unlike ODBC, JDBC, or similar clients), it may be justified to have fewer ServerThreads than there are concurrent clients. The MaxKeepAlives should be the maximum number of expected web clients. This can be more than the ServerThreads count. In the [HTTPServer] stanza of virtuoso.ini, set â [HTTPServer] ... ServerThreads = 100 MaxKeepAlives = 1000 KeepAliveTimeout = 10 Note â The [HTTPServer] ServerThreads are taken from the total pool made available by the [Parameters] ServerThreads. Thus, the [Parameters] ServerThreads should always be at least as large as (and is best set greater than) the [HTTPServer] ServerThreads, and if using the closed-source Commercial Version, should not exceed the licensed thread count. Disk Use The basic rule is to use one stripe (file) per distinct physical device (not per file system), using no RAID. For example, one might stripe a database over 6 files (6 physical disks), with an initial size of 60000 pages (the files will grow as needed). For the above described example, in the [Database] stanza of virtuoso.ini, set â [Database] ... Striping = 1 MaxCheckpointRemap = 2000000 â and in the [Striping] stanza, on one line per SegmentName, set â [Striping] ... Segment1 = 60000 , /virtdev/db/virt-seg1.db = q1 , /data1/db/virt-seg1-str2.db = q2 , /data2/db/virt-seg1-str3.db = q3 , /data3/db/virt-seg1-str4.db = q4 , /data4/db/virt-seg1-str5.db = q5 , /data5/db/virt-seg1-str6.db = q6 As can be seen here, each file gets a background IO thread (the = qxxx clause). It should be noted that all files on the same physical device should have the same qxxx value. This is not directly relevant to the benchmarking scenario above, because we have only one file per device, and thus only one file per IO queue. SQL Optimization If queries have lots of joins but access little data, as with the Berlin SPARQL Benchmark, the SQL compiler must be told not to look for better plans if the best plan so far is quicker than the compilation time expended so far. Thus, in the [Parameters] stanza of virtuoso.ini, set â [Parameters] ... StopCompilerWhenXOverRunTime = 1</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>I will here summarize what should be known about running benchmarks with <a href="http://virtuoso.openlinksw.com" id="link-id0xc53af18">Virtuoso</a>.</p>

<h2>Physical Memory</h2>

<p>For 8G RAM, in the <code>[Parameters]</code> stanza of <code>virtuoso.ini</code>, set â</p>

<blockquote>
<code>
[Parameters]<br />
...<br />
NumberOfBuffers = 550000
</code>
</blockquote> 
<p>For 16G RAM, double thisâ</p>

<blockquote>
<code>
[Parameters]<br />
...<br />
NumberOfBuffers = 1100000
</code>
</blockquote> 

<h2>Transaction Isolation</h2>
<p>For most cases, certainly all <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xc2f07a0">RDF</a> cases, <i>Read Committed</i> should be the default transaction isolation.  In the <code>[Parameters]</code> stanza of <code>virtuoso.ini</code>, set â</p> 
<blockquote>
<code>
[Parameters]<br />
...<br />
DefaultIsolation = 2 
</code>
</blockquote> 

<h2>Multiuser Workload</h2>

<p>If <a href="http://dbpedia.org/resource/Open_Database_Connectivity" id="link-id0xc1c7178">ODBC</a>, <a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id0xd16fb40">JDBC</a>, or similarly connected client applications are used, there must be more <code>ServerThreads</code> available than there will be client connections.  In the <code>[Parameters]</code> stanza of <code>virtuoso.ini</code>, set â</p> 
<blockquote>
<code> 
[Parameters]<br />
...<br />
ServerThreads = 100
</code>
</blockquote> 

<p>With web clients (unlike ODBC, JDBC, or similar clients), it may be justified to have fewer <code>ServerThreads</code> than there are concurrent clients.  The <code>MaxKeepAlives</code> should be the maximum number of expected web clients.  This can be more than the <code>ServerThreads</code> count.  In the <code>[HTTPServer]</code> stanza of <code>virtuoso.ini</code>, set â</p> 
<blockquote>
<code> 
[HTTPServer]<br />
...<br />
ServerThreads    = 100 <br />
MaxKeepAlives    = 1000 <br />
KeepAliveTimeout = 10
</code>
</blockquote> 

<p>
<i><b>Note</b> â The <code>[HTTPServer] ServerThreads</code> are taken from the total pool made available by the <code>[Parameters] ServerThreads</code>.  Thus, the <code>[Parameters] ServerThreads</code> should always be at least as large as (and is best set greater than) the <code>[HTTPServer] ServerThreads</code>, and if using the closed-source Commercial Version, should not exceed the licensed thread count.</i>
</p> 

<h2>Disk Use</h2>

<p>The basic rule is to use one stripe (file) per distinct physical device (not per file system), using no RAID.  For example, one might stripe a database over 6 files (6 physical disks), with an initial size of 60000 pages (the files will grow as needed).  </p>

<p>For the above described example, in the <code>[Database]</code> stanza of <code>virtuoso.ini</code>, set â</p> 
<blockquote>
<code>
[Database]<br />
...<br />
Striping = 1<br />
MaxCheckpointRemap 	= 2000000 
</code>
</blockquote> 

<p>â and in the <code>[Striping]</code> stanza, on one line per <code>SegmentName</code>, set â</p> 
<blockquote>
<code>
[Striping]<br />
...<br />
Segment1 = 60000 , /virtdev/db/virt-seg1.db = q1 , /data1/db/virt-seg1-str2.db = q2 , /data2/db/virt-seg1-str3.db = q3 , /data3/db/virt-seg1-str4.db = q4 , /data4/db/virt-seg1-str5.db = q5 , /data5/db/virt-seg1-str6.db = q6</code>
</blockquote> 

<p>As can be seen here, each file gets a background IO thread (the <code>= q<i>xxx</i></code> clause).  It should be noted that all files on the same physical device should have the same <code>q<i>xxx</i></code> value.  This is not directly relevant to the benchmarking scenario above, because we have only one file per device, and thus only one file per IO queue.</p>

<h2>
<a href="http://dbpedia.org/resource/SQL" id="link-id0xc9fa298">SQL</a> Optimization</h2>

<p>If queries have lots of joins but access little <a href="http://dbpedia.org/resource/Data" id="link-id0xb4e0aa0">data</a>, as with the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0xb2de990">Berlin SPARQL Benchmark</a>, the SQL compiler must be told not to look for better plans if the best plan so far is quicker than the compilation time expended so far.  Thus, in the <code>[Parameters]</code> stanza of <code>virtuoso.ini</code>, set â</p> 
<blockquote>
<code>
[Parameters]<br />
...<br />
StopCompilerWhenXOverRunTime = 1
</code>
</blockquote> 
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-08-06#1409">
  <rss:title>BSBM With Triples and Mapped Relational Data</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-08-06T19:35:27Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">The special contribution of the Berlin SPARQL Benchmark (BSBM) to the RDF world is to raise the question of doing OLTP with RDF. Of course, here we immediately hit the question of comparisons with relational databases. To this effect, BSBM also specifies a relational schema and can generate the data as either triples or SQL inserts. The benchmark effectively simulates the case of exposing an existing RDBMS as RDF. OpenLink Software calls this RDF Views. Oracle is beginning to call this semantic covers. The RDB2RDF XG, a W3C incubator group, has been active in this area since Spring, 2008. But why an OLTP workload with RDF to begin with? We believe this is relevant because RDF promises to be the interoperability factor between potentially all of traditional IS. If data is online for human consumption, it may be online via a SPARQL end-point as well. The economic justification will come from discoverability and from applications integrating multi-source structured data. Online shopping is a fine use case. Warehousing all the world&#39;s publishable data as RDF is not our first preference, nor would it be the publisher&#39;s. Considerations of duplicate infrastructure and maintenance are reason enough. Consequently, we need to show that mapping can outperform an RDF warehouse, which is what we&#39;ll do here. What We Got First, we found that making the query plan took much too long in proportion to the run time. With BSBM this is an issue because the queries have lots of joins but access relatively little data. So we made a faster compiler and along the way retouched the cost model a bit. But the really interesting part with BSBM is mapping relational data to RDF. For us, BSBM is a great way of showing that mapping can outperform even the best triple store. A relational row store is as good as unbeatable with the query mix. And when there is a clear mapping, there is no reason the SPARQL could not be directly translated. If Chris Bizer et al launched the mapping ship, we will be the ones to pilot it to harbor! We filled two Virtuoso instances with a BSBM200000 data set, for 100M triples. One was filled with physical triples; the other was filled with the equivalent relational data plus mapping to triples. Performance figures are given in &quot;query mixes per hour&quot;. (An update or follow-on to this post will provide elapsed times for each test run.) With the unmodified benchmark we got: Physical Triples: Â  Â  1297 qmph Mapped Triples: Â  Â  3144 qmph In both cases, most of the time was spent on Q6, which looks for products with one of three words in the label. We altered Q6 to use text index for the mapping, and altered the databases accordingly. (There is no such thing as an e-commerce site without a text index, so we are amply justified in making this change.) The following were measured on the second run of a 100 query mix series, single test driver, warm cache. Physical Triples: Â  Â  5746 qmph Mapped Triples: Â  Â  7525 qmph We then ran the same with 4 concurrent instances of the test driver. The qmph here is 400 / the longest run time. Physical Triples: Â  Â  19459 qmph Mapped Triples: Â  Â  24531 qmph The system used was 64-bit Linux, 2GHz dual-Xeon 5130 (8 cores) with 8G RAM. The concurrent throughputs are a little under 4 times the single thread throughput, which is normal for SMP due to memory contention. The numbers do not evidence significant overhead from thread synchronization. The query compilation represents about 1/3 of total server side CPU. In an actual online application of this type, queries would be parameterized, so the throughputs would be accordingly higher. We used the StopCompilerWhenXOverRunTime = 1 option here to cut needless compiler overhead, the queries being straightforward enough. We also see that the advantage of mapping can be further increased by more compiler optimizations, so we expect in the end mapping will lead RDF warehousing by a factor of 4 or so. Suggestions for BSBM Reporting Rules. The benchmark spec should specify a form for disclosure of test run data, TPC style. This includes things like configuration parameters and exact text of queries. There should be accepted variants of query text, as with the TPC. Multiuser operation. The test driver should get a stream number as parameter, so that each client makes a different query sequence. Also, disk performance in this type of benchmark can only be reasonably assessed with a naturally parallel multiuser workload. Add business intelligence. SPARQL has aggregates now, at least with Jena and Virtuoso, so let&#39;s use these. The BSBM business intelligence metric should be a separate metric off the same data. Adding synthetic sales figures would make more interesting queries possible. For example, producing recommendations like &quot;customers who bought this also bought xxx.&quot; For the SPARQL community, BSBM sends the message that one ought to support parameterized queries and stored procedures. This would be a SPARQL protocol extension; the SPARUL syntax should also have a way of calling a procedure. Something like select proc (??, ??) would be enough, where ?? is a parameter marker, like ? in ODBC/JDBC. Add transactions.Especially if we are contrasting mapping vs. storing triples, having an update flow is relevant. In practice, this could be done by having the test driver send web service requests for order entry and the SUT could implement these as updates to the triples or a mapped relational store. This could use stored procedures or logic in an app server. Comments on Query Mix The time of most queries is less than linear to the scale factor. Q6 is an exception if it is not implemented using a text index. Without the text index, Q6 will inevitably come to dominate query time as the scale is increased, and thus will make the benchmark less relevant at larger scales. Next We include the sources of our RDF view definitions and other material for running BSBM with our forthcoming Virtuoso Open Source 5.0.8 release. This also includes all the query optimization work done for BSBM. This will be available in the coming days.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>The special contribution of the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id10039db0">Berlin SPARQL Benchmark</a> (<a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id106b2538">BSBM</a>) to the <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id101a75f8">RDF</a> world is to raise the question of doing OLTP with <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xb230eb0">RDF</a>.</p>

<p>Of course, here we immediately hit the question of comparisons with relational databases.  To this effect, <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0xa832da8">BSBM</a> also specifies a relational schema and can generate the <a href="http://dbpedia.org/resource/Data" id="link-id1206c378">data</a> as either triples or <a href="http://dbpedia.org/resource/SQL" id="link-id1667f040">SQL</a> inserts.</p>

<p>The benchmark effectively simulates the case of exposing an existing <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id10a93518">RDBMS</a> as RDF.  <a href="http://www.openlinksw.com/dataspace/organization/openlink#this" id="link-id13e46d80">OpenLink Software</a> calls this <i>RDF Views</i>.  <a href="http://dbpedia.org/resource/Oracle_Database" id="link-id12027578">Oracle</a> is beginning to call this <i>semantic covers</i>.  The <a href="http://www.w3.org/2005/Incubator/rdb2rdf/" id="link-id161dc678">RDB2RDF XG</a>, a W3C incubator group, has been active in this area since Spring, 2008.</p>

<h3>But why an OLTP workload with RDF to begin with?</h3>

<p>We believe this is relevant because RDF promises to be the interoperability factor between potentially all of traditional IS.  If <a href="http://dbpedia.org/resource/Data" id="link-id0xabe48a0">data</a> is online for human consumption, it may be online via a <a href="http://dbpedia.org/resource/SPARQL" id="link-id106a8908">SPARQL</a> end-point as well.  The economic justification will come from discoverability and from applications integrating multi-source structured data.  Online shopping is a fine use case.</p>

<p>Warehousing all the world&#39;s publishable data as RDF is not our first preference, nor would it be the publisher&#39;s.  Considerations of duplicate infrastructure and maintenance are reason enough.  Consequently, we need to show that mapping can outperform an RDF warehouse, which is what we&#39;ll do here.</p>

<h3>What We Got </h3>

<p>First, we found that <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1400" id="link-id150ea748">making the query plan took much too long</a> in proportion to the run time.  With BSBM this is an issue because the queries have lots of joins but access relatively little data.  So we made a faster compiler and along the way retouched the cost model a bit.</p>

<p>But the really interesting part with BSBM is mapping relational data to RDF.  For us, BSBM is a great way of showing that mapping can outperform even the best triple store.  A relational row store is as good as unbeatable with the query mix.  And when there is a clear mapping, there is no reason the <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x96bb5e0">SPARQL</a> could not be directly translated.</p>

<p>If Chris Bizer et al launched the mapping ship, we will be the ones to pilot it to harbor!</p>

<p>We filled two <a href="http://virtuoso.openlinksw.com" id="link-id12dbdc70">Virtuoso</a> instances with a BSBM200000 data set, for 100M triples.  One was filled with physical triples; the other was filled with the equivalent relational data plus mapping to triples.  Performance figures are given in &quot;query mixes per hour&quot;.  (An update or follow-on to this post will provide elapsed times for each test run.)</p>

<p>With the unmodified benchmark we got:</p>
<blockquote>
<table>
<tr>
   <td><i>Physical Triples:</i>
   </td>
    <td>Â  Â </td>
    <td>1297 qmph</td>
  </tr>
<tr>
   <td><i>Mapped Triples:</i>
   </td>
    <td>Â  Â </td>
   <td><b>3144 qmph</b>
   </td>
  </tr>
</table>
</blockquote>
<p>In both cases, most of the time was spent on Q6, which looks for products with one of three words in the label.  We altered Q6  to use text index for the mapping, and altered the databases accordingly. (There is no such thing as an e-commerce site without a text index, so we are amply justified in making this change.)</p>

<p>The following were measured on the second run of a 100 query mix series, single test driver, warm cache.</p>
<blockquote>
<table>
<tr>
   <td><i>Physical Triples:</i>
   </td>
    <td>Â  Â </td>
    <td> 5746 qmph</td>
  </tr>
<tr>
   <td><i>Mapped Triples:</i>
   </td>
    <td>Â  Â </td>
   <td> <b>7525 qmph</b>
   </td>
  </tr>
</table>
</blockquote>
<p>We then ran the same with 4 concurrent instances of the test driver. The qmph here is 400 / the longest run time.</p>
<blockquote>
<table>
<tr>
   <td><i>Physical Triples:</i>
   </td>
    <td>Â  Â </td>
    <td> 19459 qmph</td>
  </tr>
<tr>
   <td><i>Mapped Triples:</i>
   </td>
    <td>Â  Â </td>
   <td> <b>24531 qmph</b>
   </td>
  </tr>
</table>
</blockquote>

<p>The system used was 64-bit Linux, 2GHz dual-Xeon 5130 (8 cores) with 8G RAM.  The concurrent throughputs are a little under 4 times the single thread throughput, which is normal for SMP due to memory contention.  The numbers do not evidence significant overhead from thread synchronization.</p>

<p>The query compilation represents about 1/3 of total server side CPU. In an actual online application of this type, queries would be parameterized, so the throughputs would be accordingly higher.  We used the <code>StopCompilerWhenXOverRunTime = 1</code> option here to cut needless compiler overhead, the queries being straightforward enough.</p>

<p>We also see that the advantage of mapping can be further increased by more compiler optimizations, so we expect in the end mapping will lead RDF warehousing by a factor of 4 or so.</p>

<h3>Suggestions for BSBM</h3>

<ul>
 <li>
  <p>
    <b>Reporting Rules.</b> The benchmark spec should specify a form for disclosure of test run data, TPC style.  This includes things like configuration parameters and exact text of queries.  There should be accepted variants of query text, as with the TPC.</p>
 </li>

<li>
  <p>
    <b>Multiuser operation.</b>  The test driver should get a stream number as parameter, so that each client makes a different query sequence. Also, disk performance in this type of benchmark can only be reasonably assessed with a naturally parallel multiuser workload.</p>
</li>

<li>
  <p>
    <b>Add business intelligence.</b>  SPARQL has aggregates now, at least with <a href="http://jena.sourceforge.net/" id="link-id11a25ac0">Jena</a> and <a href="http://virtuoso.openlinksw.com" id="link-id0xa83f490">Virtuoso</a>, so let&#39;s use these.  The BSBM business intelligence metric should be a separate metric off the same data.  Adding synthetic sales figures would make more interesting queries possible.  For example, producing recommendations like &quot;customers who bought this also bought xxx.&quot;</p>
</li>

<li>
  <p>
    <b>For the SPARQL community</b>, BSBM sends the message that one ought to support parameterized queries and stored procedures.  This would be a <a href="http://www.w3.org/TR/rdf-sparql-protocol/" id="link-id109e2448">SPARQL protocol</a> extension; the SPARUL syntax should also have a way of calling a procedure.  Something like <code>select proc (??, ??)</code> would be enough, where <code>??</code> is a parameter marker, like <code>?</code> in <a href="http://dbpedia.org/resource/Open_Database_Connectivity" id="link-id13febf48">ODBC</a>/<a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id120416a8">JDBC</a>.</p>
</li>

<li>
  <p>
    <b>Add transactions.</b>Especially if we are contrasting mapping vs. storing triples, having an update flow is relevant.  In practice, this could be done by having the test driver send web service requests for order entry and the SUT could implement these as updates to the triples or a mapped relational store.  This could use stored procedures or logic in an app server.</p>
</li>
</ul>

<h3>Comments on Query Mix</h3>

<p>The time of most queries is less than linear to the scale factor.  Q6 is an exception if it is not implemented using a text index.  Without the text index, Q6 will inevitably come to dominate query time as the scale is increased, and thus will make the benchmark less relevant at larger scales.</p>

<h2>Next</h2>

<p>We include the sources of our RDF view definitions and other material for running BSBM with our forthcoming Virtuoso Open Source 5.0.8 release.  This also includes all the query optimization work done for BSBM.  This will be available in the coming days.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-07-30#1400">
  <rss:title>Virtuoso Optimizations for the Berlin SPARQL Benchmark </rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-07-30T18:17:54Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We had a look at Chris Bizer&#39;s initial results with the Berlin SPARQL Benchmark (BSBM) on Virtuoso. The first results were rather bad, as nearly all of the run time was spent optimizing the SPARQL statements and under 10% actually running them. So I spent a couple of days on the SPARQL/SQL compiler, to the effect of making it do a better guess of initial execution plan and streamlining some operations. In fact, many of the queries in BSBM are not particularly sensitive to execution plan, as they access a very small portion of the database. So to close the matter, I put in a flag that makes the SQL compiler give up on devising new plans if the time of the best plan so far is less than the time spent compiling so far. With these changes, available now as a diff on top of 5.0.7, we run quite well, several times better than initially. With the compiler time cut-off in place (ini parameter StopCompilerWhenXOverRunTime = 1), we get the following times, output from the BSBM test driver: Starting test... 0: 1031.22 ms, total: 1151 ms 1: 982.89 ms, total: 1040 ms 2: 923.27 ms, total: 968 ms 3: 898.37 ms, total: 932 ms 4: 855.70 ms, total: 865 ms Scale factor: 10000 Number of query mix runs: 5 times min/max Query mix runtime: 0.8557 s / 1.0312 s Total runtime: 4.691 seconds QMpH: 3836.77 query mixes per hour CQET: 0.93829 seconds average runtime of query mix CQET (geom.): 0.93625 seconds geometric mean runtime of query mix Metrics for Query 1: Count: 5 times executed in whole run AQET: 0.012212 seconds (arithmetic mean) AQET(geom.): 0.009934 seconds (geometric mean) QPS: 81.89 Queries per second minQET/maxQET: 0.00684000s / 0.03115700s Average result count: 7.0 min/max result count: 3 / 10 Metrics for Query 2: Count: 35 times executed in whole run AQET: 0.030490 seconds (arithmetic mean) AQET(geom.): 0.029776 seconds (geometric mean) QPS: 32.80 Queries per second minQET/maxQET: 0.02467300s / 0.06753000s Average result count: 22.5 min/max result count: 15 / 30 Metrics for Query 3: Count: 5 times executed in whole run AQET: 0.006947 seconds (arithmetic mean) AQET(geom.): 0.006905 seconds (geometric mean) QPS: 143.95 Queries per second minQET/maxQET: 0.00580000s / 0.00795100s Average result count: 4.0 min/max result count: 0 / 10 Metrics for Query 4: Count: 5 times executed in whole run AQET: 0.008858 seconds (arithmetic mean) AQET(geom.): 0.008829 seconds (geometric mean) QPS: 112.89 Queries per second minQET/maxQET: 0.00804400s / 0.01019500s Average result count: 3.4 min/max result count: 0 / 10 Metrics for Query 5: Count: 5 times executed in whole run AQET: 0.087542 seconds (arithmetic mean) AQET(geom.): 0.087327 seconds (geometric mean) QPS: 11.42 Queries per second minQET/maxQET: 0.08165600s / 0.09889200s Average result count: 5.0 min/max result count: 5 / 5 Metrics for Query 6: Count: 5 times executed in whole run AQET: 0.131222 seconds (arithmetic mean) AQET(geom.): 0.131216 seconds (geometric mean) QPS: 7.62 Queries per second minQET/maxQET: 0.12924200s / 0.13298200s Average result count: 3.6 min/max result count: 3 / 5 Metrics for Query 7: Count: 20 times executed in whole run AQET: 0.043601 seconds (arithmetic mean) AQET(geom.): 0.040890 seconds (geometric mean) QPS: 22.94 Queries per second minQET/maxQET: 0.01984400s / 0.06012600s Average result count: 26.4 min/max result count: 5 / 96 Metrics for Query 8: Count: 10 times executed in whole run AQET: 0.018168 seconds (arithmetic mean) AQET(geom.): 0.016205 seconds (geometric mean) QPS: 55.04 Queries per second minQET/maxQET: 0.01097600s / 0.05066900s Average result count: 12.8 min/max result count: 6 / 20 Metrics for Query 9: Count: 20 times executed in whole run AQET: 0.043813 seconds (arithmetic mean) AQET(geom.): 0.043807 seconds (geometric mean) QPS: 22.82 Queries per second minQET/maxQET: 0.04274900s / 0.04504100s Average result count: 0.0 min/max result count: 0 / 0 Metrics for Query 10: Count: 15 times executed in whole run AQET: 0.030697 seconds (arithmetic mean) AQET(geom.): 0.029651 seconds (geometric mean) QPS: 32.58 Queries per second minQET/maxQET: 0.02072000s / 0.03975700s Average result count: 1.1 min/max result count: 0 / 4 real 0 m 5.485 s user 0 m 2.233 s sys 0 m 0.170 s Of the approximately 5.5 seconds of running five query mixes, the test driver spends 2.2 s. The server side processing time is 3.1 s, of which SQL compilation is 1.35 s. The rest is miscellaneous system time. The measurement is on 64-bit Linux, 2GHz dual-Xeon 5130 (8 cores) with 8G RAM. We note that this type of workload would be done with stored procedures or prepared, parameterized queries in the SQL world. There will be some further tuning still but this addresses the bulk of the matter. There will be a separate message about the patch containing these improvements.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We had a look at Chris Bizer&#39;s initial results with the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id105c9f78">Berlin SPARQL Benchmark</a> (<a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id102d62b0">BSBM</a>) on <a href="http://virtuoso.openlinksw.com" id="link-id13eb9780">Virtuoso</a>.  The first results were rather bad, as nearly all of the run time was spent optimizing the <a href="http://dbpedia.org/resource/SPARQL" id="link-id14a51258">SPARQL</a> statements and under 10% actually running them.</p>
<p>So I spent a couple of days on the <a href="http://dbpedia.org/resource/SPARQL" id="link-id0xa5a8d0e8">SPARQL</a>/<a href="http://dbpedia.org/resource/SQL" id="link-id108745b0">SQL</a> compiler, to the effect of making it do a better guess of initial execution plan and streamlining some operations.  In fact, many of the queries in <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0xaf04af8">BSBM</a> are not particularly sensitive to execution plan, as they access a very small portion of the database.  So to close the matter, I put in a flag that makes the <a href="http://dbpedia.org/resource/SQL" id="link-id0x1e8d2360">SQL</a> compiler give up on devising new plans if the time of the best plan so far is less than the time spent compiling so far.</p>
<p>With these changes, available now as a diff on top of 5.0.7, we run quite well, several times better than initially.  With the compiler time cut-off in place (ini parameter <code>StopCompilerWhenXOverRunTime = 1</code>), we get the following times, output from the BSBM test driver:</p>
<blockquote>
<pre>
Starting test...

0: 1031.22 ms, total: 1151 ms
1:  982.89 ms, total: 1040 ms
2:  923.27 ms, total:  968 ms
3:  898.37 ms, total:  932 ms
4:  855.70 ms, total:  865 ms

Scale factor:               10000
Number of query mix runs:   5 times
min/max Query mix runtime:  0.8557 s / 1.0312 s
Total runtime:              4.691 seconds
QMpH:                       3836.77 query mixes per hour
CQET:                       0.93829 seconds average runtime 
                                       of query mix
CQET (geom.):               0.93625 seconds geometric mean 
                                       runtime of query mix

Metrics for Query 1:
   Count:                 5 times executed in whole run
   AQET:                  0.012212 seconds (arithmetic mean)
   AQET(geom.):           0.009934 seconds (geometric mean)
   QPS:                   81.89 Queries per second
   minQET/maxQET:         0.00684000s / 0.03115700s
   Average result count:  7.0
   min/max result count:  3 / 10

Metrics for Query 2:
   Count:                 35 times executed in whole run
   AQET:                  0.030490 seconds (arithmetic mean)
   AQET(geom.):           0.029776 seconds (geometric mean)
   QPS:                   32.80 Queries per second
   minQET/maxQET:         0.02467300s / 0.06753000s
   Average result count:  22.5
   min/max result count:  15 / 30

Metrics for Query 3:
   Count:                 5 times executed in whole run
   AQET:                  0.006947 seconds (arithmetic mean)
   AQET(geom.):           0.006905 seconds (geometric mean)
   QPS:                   143.95 Queries per second
   minQET/maxQET:         0.00580000s / 0.00795100s
   Average result count:  4.0
   min/max result count:  0 / 10

Metrics for Query 4:
   Count:                 5 times executed in whole run
   AQET:                  0.008858 seconds (arithmetic mean)
   AQET(geom.):           0.008829 seconds (geometric mean)
   QPS:                   112.89 Queries per second
   minQET/maxQET:         0.00804400s / 0.01019500s
   Average result count:  3.4
   min/max result count:  0 / 10

Metrics for Query 5:
   Count:                 5 times executed in whole run
   AQET:                  0.087542 seconds (arithmetic mean)
   AQET(geom.):           0.087327 seconds (geometric mean)
   QPS:                   11.42 Queries per second
   minQET/maxQET:         0.08165600s / 0.09889200s
   Average result count:  5.0
   min/max result count:  5 / 5

Metrics for Query 6:
   Count:                 5 times executed in whole run
   AQET:                  0.131222 seconds (arithmetic mean)
   AQET(geom.):           0.131216 seconds (geometric mean)
   QPS:                   7.62 Queries per second
   minQET/maxQET:         0.12924200s / 0.13298200s
   Average result count:  3.6
   min/max result count:  3 / 5

Metrics for Query 7:
   Count:                 20 times executed in whole run
   AQET:                  0.043601 seconds (arithmetic mean)
   AQET(geom.):           0.040890 seconds (geometric mean)
   QPS:                   22.94 Queries per second
   minQET/maxQET:         0.01984400s / 0.06012600s
   Average result count:  26.4
   min/max result count:  5 / 96

Metrics for Query 8:
   Count:                 10 times executed in whole run
   AQET:                  0.018168 seconds (arithmetic mean)
   AQET(geom.):           0.016205 seconds (geometric mean)
   QPS:                   55.04 Queries per second
   minQET/maxQET:         0.01097600s / 0.05066900s
   Average result count:  12.8
   min/max result count:  6 / 20

Metrics for Query 9:
   Count:                 20 times executed in whole run
   AQET:                  0.043813 seconds (arithmetic mean)
   AQET(geom.):           0.043807 seconds (geometric mean)
   QPS:                   22.82 Queries per second
   minQET/maxQET:         0.04274900s / 0.04504100s
   Average result count:  0.0
   min/max result count:  0 / 0

Metrics for Query 10:
   Count:                 15 times executed in whole run
   AQET:                  0.030697 seconds (arithmetic mean)
   AQET(geom.):           0.029651 seconds (geometric mean)
   QPS:                   32.58 Queries per second
   minQET/maxQET:         0.02072000s / 0.03975700s
   Average result count:  1.1
   min/max result count:  0 / 4

   real  0 m 5.485 s
   user  0 m 2.233 s
   sys   0 m 0.170 s
</pre></blockquote>
<p>Of the approximately 5.5 seconds of running five query mixes, the test driver spends 2.2 s.  The server side processing time is 3.1 s, of which SQL compilation is 1.35 s.  The rest is miscellaneous system time.  The measurement is on 64-bit Linux, 2GHz dual-Xeon 5130 (8 cores) with 8G RAM. </p>
<p>We note that this type of workload would be done with stored procedures or prepared, parameterized queries in the SQL world.</p>
<p>There will be some further tuning still but this addresses the bulk of the matter.  There will be a separate message about the patch containing these improvements.</p>
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-05-09#1358">
  <rss:title>DBpedia Benchmark Revisited</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-05-09T19:27:00Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We ran the DBpedia benchmark queries again with different configurations of Virtuoso. I had not studied the details of the matter previously but now did have a closer look at the queries. Comparing numbers given by different parties is a constant problem. In the case reported here, we loaded the full DBpedia 3, all languages, with about 198M triples, onto Virtuoso v5 and Virtuoso Cluster v6, all on the same 4 core 2GHz Xeon with 8G RAM. All databases were striped on 6 disks. The Cluster configuration was with 4 processes in the same box. We ran the queries in two variants: With graph specified in the SPARQL FROM clause, using the default indices. With no graph specified anywhere, using an alternate indexing scheme. The times below are for the sequence of 5 queries; individual query times are not reported. I did not do a line-by-line review of the execution plans since they seem to run well enough. We could get some extra mileage from cost model tweaks, especially for the numeric range conditions, but we will do this when somebody comes up with better times. First, about Virtuoso v5: Because there is a query in the set that specifies no condition on S or O and only P, this simply cannot be done with the default indices. With Virtuoso Cluster v6 it sort-of can, because v6 is more space efficient. So we added the index: create bitmap index rdf_quad_pogs on rdf_quad (p, o, g, s); Â  Virtuoso v5 with gspo, ogps, pogs Virtuoso Cluster v6 with gspo, ogps Virtuoso Cluster v6 with gspo, ogps, pogs cold 210 s 136 s 33.4 s warm 0.600 s 4.01 s 0.628 s OK, so now let us do it without a graph being specified. For all platforms, we drop any existing indices, and -- create table r2 (g iri_id_8, s, iri_id_8, p iri_id_8, o any, primary key (s, p, o, g)) alter index R2 on R2 partition (s int (0hexffff00)); log_enable (2); insert into r2 (g, s, p, o) select g, s, p, o from rdf_quad; drop table rdf_quad; alter table r2 rename RDF_QUAD; create bitmap index rdf_quad_opgs on rdf_quad (o, p, g, s) partition (o varchar (-1, 0hexffff)); create bitmap index rdf_quad_pogs on rdf_quad (p, o, g, s) partition (o varchar (-1, 0hexffff)); create bitmap index rdf_quad_gpos on rdf_quad (g, p, o, s) partition (o varchar (-1, 0hexffff)); The code is identical for v5 and v6, except that with v5 we use iri_id (32 bit) for the type, not iri_id_8 (64 bit). We note that we run out of IDs with v5 around a few billion triples, so with v6 we have double the ID length and still manage to be vastly more space efficient. With the above 4 indices, we can query the data pretty much in any combination without hitting a full scan of any index. We note that all indices that do not begin with s end with s as a bitmap. This takes about 60% of the space of a non-bitmap index for data such as DBpedia. If you intend to do completely arbitrary RDF queries in Virtuoso, then chances are you are best off with the above index scheme. Â  Virtuoso v5 with gspo, ogps, pogs Virtuoso Cluster v6 with spog, pogs, opgs, gpos warm 0.595 s 0.617 s The cold times were about the same as above, so not reproduced. Graph or No Graph? It is in the SPARQL spirit to specify a graph and for pretty much any application, there are entirely sensible ways of keeping the data in graphs and specifying which ones are concerned by queries. This is why Virtuoso is set up for this by default. On the other hand, for the open web scenario, dealing with an unknown large number of graphs, enumerating graphs is not possible and questions like which graph of which source asserts x become relevant. We have two distinct use cases which warrant different setups of the database, simple as that. The latter use case is not really within the SPARQL spec, so implementations may or may not support this. For example Oracle or Vertica would not do this well since they partition data according to graph or predicate, respectively. On the other hand, stores that work with one quad table, which is most of the ones out there, should do it maybe with some configuring, as shown above. Frameworks like Jena are not to my knowledge geared towards having a wildcard for graph, although I would suppose this can be arranged by adding some &quot;super-graph&quot; object, a graph of all graphs. I don&#39;t think this is directly supported and besides most apps would not need it. Once the indices are right, there is no difference between specifying a graph and not specifying a graph with the queries considered. With more complex queries, specifying a graph or set of graphs does allow some optimizations that cannot be done with no graph specified. For example, bitmap intersections are possible only when all leading key parts are given. Conclusions The best warm cache time is with v5; the five queries run under 600 ms after the first go. This is noted to show that all-in-memory with a single thread of execution is hard to beat. Cluster v6 performs the same queries in 623 ms. What is gained in parallelism is lost in latency if all operations complete in microseconds. On the other hand, Cluster v6 leaves v5 in the dust in any situation that has less than 100% hit rate. This is due to actual benefit from parallelism if operations take longer than a few microseconds, such as in the case of disk reads. Cluster v6 has substantially better data layout on disk, as well as fewer pages to load for the same content. This makes it possible to run the queries without the pogs index on Cluster v6 even when v5 takes prohibitively long. The morale of the story is to have a lot of RAM and space-efficient data representation. The DBpedia benchmark does not specify any random access pattern that would give a measure of sustained throughput under load, so we are left with the extremes of cold and warm cache of which neither is quite realistic. Chris Bizer and I have talked on and off about benchmarks and I have made suggestions that we will see incorporated into the Berlin SPARQL benchmark, which will, I believe, be much more informative. Appendix: Query Text For reference, the query texts specifying the graph are below. To run without specifying the graph, just drop the FROM &lt;http://dbpedia.org&gt; from each query. The returned row counts are indicated below each query&#39;s text. sparql SELECT ?p ?o FROM &lt;http://dbpedia.org&gt; WHERE { &lt;http://dbpedia.org/resource/Metropolitan_Museum_of_Art&gt; ?p ?o }; -- 1337 rows sparql PREFIX p: &lt;http://dbpedia.org/property/&gt; SELECT ?film1 ?actor1 ?film2 ?actor2 FROM &lt;http://dbpedia.org&gt; WHERE { ?film1 p:starring &lt;http://dbpedia.org/resource/Kevin_Bacon&gt; . ?film1 p:starring ?actor1 . ?film2 p:starring ?actor1 . ?film2 p:starring ?actor2 . }; -- 23910 rows sparql PREFIX p: &lt;http://dbpedia.org/property/&gt; SELECT ?artist ?artwork ?museum ?director FROM &lt;http://dbpedia.org&gt; WHERE { ?artwork p:artist ?artist . ?artwork p:museum ?museum . ?museum p:director ?director }; -- 303 rows sparql PREFIX geo: &lt;http://www.w3.org/2003/01/geo/wgs84_pos#&gt; PREFIX foaf: &lt;http://xmlns.com/foaf/0.1/&gt; PREFIX xsd: &lt;http://www.w3.org/2001/XMLSchema#&gt; SELECT ?s ?homepage FROM &lt;http://dbpedia.org&gt; WHERE { &lt;http://dbpedia.org/resource/Berlin&gt; geo:lat ?berlinLat . &lt;http://dbpedia.org/resource/Berlin&gt; geo:long ?berlinLong . ?s geo:lat ?lat . ?s geo:long ?long . ?s foaf:homepage ?homepage . FILTER ( ?lat &lt;= ?berlinLat + 0.03190235436 &amp;&amp; ?long &gt;= ?berlinLong - 0.08679199218 &amp;&amp; ?lat &gt;= ?berlinLat - 0.03190235436 &amp;&amp; ?long &lt;= ?berlinLong + 0.08679199218) }; -- 56 rows sparql PREFIX geo: &lt;http://www.w3.org/2003/01/geo/wgs84_pos#&gt; PREFIX foaf: &lt;http://xmlns.com/foaf/0.1/&gt; PREFIX xsd: &lt;http://www.w3.org/2001/XMLSchema#&gt; PREFIX p: &lt;http://dbpedia.org/property/&gt; SELECT ?s ?a ?homepage FROM &lt;http://dbpedia.org&gt; WHERE { &lt;http://dbpedia.org/resource/New_York_City&gt; geo:lat ?nyLat . &lt;http://dbpedia.org/resource/New_York_City&gt; geo:long ?nyLong . ?s geo:lat ?lat . ?s geo:long ?long . ?s p:architect ?a . ?a foaf:homepage ?homepage . FILTER ( ?lat &lt;= ?nyLat + 0.3190235436 &amp;&amp; ?long &gt;= ?nyLong - 0.8679199218 &amp;&amp; ?lat &gt;= ?nyLat - 0.3190235436 &amp;&amp; ?long &lt;= ?nyLong + 0.8679199218) }; -- 13 rows</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We ran the <a href="http://dbpedia.org/resource/DBpedia" id="link-id0x1b7f9688">DBpedia</a> benchmark queries again with different
configurations of <a href="http://virtuoso.openlinksw.com" id="link-id0x1cca2e00">Virtuoso</a>. I had not studied the details of the
matter previously but now did have a closer look at the
queries.</p>
<p>Comparing numbers given by different parties is a constant
problem. In the case reported here, we loaded the full DBpedia 3,
all languages, with about 198M triples, onto Virtuoso v5 and Virtuoso Cluster v6,
all on the same 4 core 2GHz Xeon with 8G RAM. All databases were
striped on 6 disks. The Cluster configuration was with 4 processes
in the same box.</p>
<p>We ran the queries in two variants:</p> 
<ul>
<li>With graph
specified in the <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1b77f758">SPARQL</a> <code>FROM</code> clause, using the default indices.</li>
<li>With no graph specified anywhere, using an
alternate indexing scheme.</li>
</ul>
<p>The times below are for the sequence of 5 queries; individual
query times are not reported. I did not do a line-by-line review of
the execution plans since they seem to run well enough. We could
get some extra mileage from cost model tweaks, especially for the
numeric range conditions, but we will do this when somebody comes up
with better times.</p>
<p>First, about Virtuoso v5: Because there is a query in the set that
specifies no condition on S or O and only P, this simply cannot be
done with the default indices. With Virtuoso Cluster v6 it sort-of can, because v6 is
more space efficient.</p>
<p>So we added the index:</p>
<blockquote>
<code>
create bitmap index <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1cb0b180">rdf</a>_quad_pogs on rdf_quad (p, o, g, s);
</code>
</blockquote>

<table>
 <tr>
  <td>Â </td>
  <td align="center"><b>Virtuoso v5 with<br /> gspo, ogps, pogs</b>
  </td>
  <td align="center"><b>Virtuoso Cluster v6 with <br />gspo, ogps</b>
  </td>
  <td align="center"><b>Virtuoso Cluster v6 with <br />gspo, ogps, pogs</b>
  </td>
 </tr>
<tr>
  <td><b>cold</b>
  </td>
  <td align="center">210 s</td>
  <td align="center">136 s</td>
  <td align="center">33.4 s</td>
</tr>
<tr>
  <td><b>warm</b>
  </td>
  <td align="center">0.600 s</td>
  <td align="center">4.01 s</td>
  <td align="center">0.628 s</td>
</tr>
</table>

<p>OK, so now let us do it without a graph being specified. For
all platforms, we drop any existing indices, and --</p>
<blockquote>
<code>
create table r2 (g iri_id_8, s, iri_id_8, p iri_id_8, o any, primary key (s, p, o, g)) <br />
alter index R2 on R2 partition (s int (0hexffff00)); <br />
 <br />
log_enable (2); <br />
insert into r2 (g, s, p, o) select g, s, p, o from rdf_quad; <br />
 <br />
drop table rdf_quad; <br />
alter table r2 rename RDF_QUAD; <br />
create bitmap index rdf_quad_opgs on rdf_quad (o, p, g, s) partition (o varchar (-1, 0hexffff)); <br />
create bitmap index rdf_quad_pogs on rdf_quad (p, o, g, s) partition (o varchar (-1, 0hexffff)); <br />
create bitmap index rdf_quad_gpos on rdf_quad (g, p, o, s) partition (o varchar (-1, 0hexffff));
</code>
</blockquote>
<p>The code is identical for v5 and v6, except that with v5 we use
<code>iri_id (32 bit)</code> for the type, not <code>iri_id_8 (64 bit)</code>. We note that
we run out of IDs with v5 around a few billion triples, so with v6
we have double the ID length and still manage to be vastly more
space efficient.</p>
<p>With the above 4 indices, we can query the <a href="http://dbpedia.org/resource/Data" id="link-id0x6339b80">data</a> pretty much in
any combination without hitting a full scan of any index. We note
that all indices that do not begin with s end with s as a bitmap.
This takes about 60% of the space of a non-bitmap index for data such
as DBpedia.</p>
<p>If you intend to do completely arbitrary RDF queries in
Virtuoso, then chances are you are best off with the above index
scheme.</p>

<table>
 <tr>
  <td>Â </td>
  <td align="center"><b> Virtuoso v5 with<br /> gspo, ogps, pogs</b>
  </td>
  <td align="center"><b> Virtuoso Cluster v6 with <br /> spog, pogs, opgs, gpos </b>
  </td>
 </tr>
<tr>
  <td><b>warm</b>
  </td>
  <td align="center">0.595 s</td>
  <td align="center">0.617 s</td>
</tr>
</table>

<p>The cold times were about the same as above, so not
reproduced.</p>
<h3>Graph or No Graph?</h3>
<p>It is in the SPARQL spirit to specify a graph and for pretty
much any application, there are entirely sensible ways of keeping
the data in graphs and specifying which ones are concerned by
queries. This is why Virtuoso is set up for this by default.</p>
<p>On the other hand, for the open web scenario, dealing with an
unknown large number of graphs, enumerating graphs is not possible
and questions like which graph of which source asserts x become
relevant. We have two distinct use cases which warrant different
setups of the database, simple as that.</p>
<p>The latter use case is not really within the SPARQL spec, so
implementations may or may not support this. For example <a href="http://dbpedia.org/resource/Oracle_Database" id="link-id0x11ed7028">Oracle</a> or
Vertica would not do this well since they partition data according
to graph or predicate, respectively. On the other hand, stores that
work with one quad table, which is most of the ones out there,
should do it maybe with some configuring, as shown above.</p>
<p>Frameworks like Jena are not to my <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x1a49ded0">knowledge</a> geared towards
having a wildcard for graph, although I would suppose this can be
arranged by adding some &quot;super-graph&quot; object, a graph of all
graphs. I don&#39;t think this is directly supported and besides most
apps would not need it.</p>
<p>Once the indices are right, there is no difference between
specifying a graph and not specifying a graph with the queries considered. With
more complex queries, specifying a graph or set of graphs does
allow some optimizations that cannot be done with no graph specified.
For example, bitmap intersections are possible only when all
leading key parts are given.</p>
<h3>Conclusions</h3>
<p>The best warm cache time is with v5; the five queries run under
600 ms after the first go. This is noted to show that all-in-memory with
a single thread of execution is hard to beat.</p>
<p>Cluster v6 performs the same queries in 623 ms. What is gained in
parallelism is lost in latency if all operations complete in
microseconds. On the other hand, Cluster v6 leaves v5 in the dust in
any situation that has less than 100% hit rate. This is due to
actual benefit from parallelism if operations take longer than a
few microseconds, such as in the case of disk reads. Cluster v6 has
substantially better data layout on disk, as well as fewer pages to
load for the same content.</p>
<p>This makes it possible to run the queries without the pogs
index on Cluster v6 even when v5 takes prohibitively long.</p>
<p>The morale of the story is to have a lot of RAM and space-efficient data representation.</p>
<p>The DBpedia benchmark does not specify any random access
pattern that would give a measure of sustained throughput under
load, so we are left with the extremes of cold and warm cache of
which neither is quite realistic.</p>
<p>Chris Bizer and I have talked on and off about benchmarks and
I have made suggestions that we will see incorporated into the
Berlin SPARQL benchmark, which will, I believe, be much more
informative.</p>
<h3>Appendix: Query Text</h3>
<p>For reference, the query texts specifying the graph are below. To
run without specifying the graph, just drop the <code>FROM
&lt;<a href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x1905bfd0">http</a>://dbpedia.org&gt;</code> from each query. The returned row counts are indicated
below each query&#39;s text.</p>
<blockquote>
 <code><pre>
sparql SELECT ?p ?o FROM &lt;http://dbpedia.org&gt; WHERE {
  &lt;http://dbpedia.org/resource/Metropolitan_Museum_of_Art&gt; ?p ?o };

-- 1337 rows

sparql PREFIX p: &lt;http://dbpedia.org/property/&gt;
SELECT ?film1 ?actor1 ?film2 ?actor2
FROM &lt;http://dbpedia.org&gt; WHERE {
  ?film1 p:starring &lt;http://dbpedia.org/resource/Kevin_Bacon&gt; .
  ?film1 p:starring ?actor1 .
  ?film2 p:starring ?actor1 .
  ?film2 p:starring ?actor2 . };

--  23910 rows

sparql PREFIX p: &lt;http://dbpedia.org/property/&gt;
SELECT ?artist ?artwork ?museum ?director FROM &lt;http://dbpedia.org&gt; 
WHERE {
  ?artwork p:artist ?artist .
  ?artwork p:museum ?museum .
  ?museum p:director ?director };

-- 303 rows

sparql PREFIX geo: &lt;http://www.w3.org/2003/01/geo/wgs84_pos#&gt;
PREFIX foaf: &lt;http://xmlns.com/foaf/0.1/&gt;
PREFIX xsd: &lt;http://www.w3.org/2001/XMLSchema#&gt;
SELECT ?s ?homepage FROM &lt;http://dbpedia.org&gt;  WHERE {
   &lt;http://dbpedia.org/resource/Berlin&gt; geo:lat ?berlinLat .
   &lt;http://dbpedia.org/resource/Berlin&gt; geo:long ?berlinLong . 
   ?s geo:lat ?lat .
   ?s geo:long ?long .
   ?s foaf:homepage ?homepage .
   FILTER (
     ?lat        &lt;=     ?berlinLat + 0.03190235436 &amp;&amp;
     ?long       &gt;=     ?berlinLong - 0.08679199218 &amp;&amp;
     ?lat        &gt;=     ?berlinLat - 0.03190235436 &amp;&amp; 
     ?long       &lt;=     ?berlinLong + 0.08679199218) };

-- 56 rows

sparql PREFIX geo: &lt;http://www.w3.org/2003/01/geo/wgs84_pos#&gt;
PREFIX foaf: &lt;http://xmlns.com/foaf/0.1/&gt;
PREFIX xsd: &lt;http://www.w3.org/2001/XMLSchema#&gt;
PREFIX p: &lt;http://dbpedia.org/property/&gt;
SELECT ?s ?a ?homepage FROM &lt;http://dbpedia.org&gt;  WHERE {
   &lt;http://dbpedia.org/resource/New_York_City&gt; geo:lat ?nyLat .
   &lt;http://dbpedia.org/resource/New_York_City&gt; geo:long ?nyLong . 
   ?s geo:lat ?lat .
   ?s geo:long ?long .
   ?s p:architect ?a .
   ?a foaf:homepage ?homepage .
   FILTER (
     ?lat        &lt;=     ?nyLat + 0.3190235436 &amp;&amp;
     ?long       &gt;=     ?nyLong - 0.8679199218 &amp;&amp;
     ?lat        &gt;=     ?nyLat - 0.3190235436 &amp;&amp; 
     ?long       &lt;=     ?nyLong + 0.8679199218) };

-- 13 rows
</pre>
 </code>
</blockquote>
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-03-06#1321">
  <rss:title>TPC H as Linked Data (Updated 2)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-03-06T16:22:03Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We have a new demo online at http://demo.openlinksw.com/tpc-h. This takes the industry standard TPC-H benchmark data and presents it as linked data with a SPARQL end point and dereferenceable URIs. This is an example of using Virtuoso&#39;s relational-to-RDF mapping for publishing business data, for browsing using the linked data principles and opening it to analytics queries in SPARQL. As noted before, we have extended SPARQL with aggregation and nested queries, thus making it a viable SQL substitute for decision support queries. The article at http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VOSTPCHLinkedData gives details and the source code for the implementation. We are still working on some aspects of the more complex TPC-H queries, thus the demo is not complete with all the 22 queries. This is however enough to see a representative sample of how analytics queries work with SPARQL and Virtuoso&#39;s SQL-to-RDF mapping. The demo will be part of the next Virtuoso Open Source download, probably out next week.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We have a new demo online at <a href="http://demo.openlinksw.com/tpc-h" id="link-id1829c9a0">http://demo.openlinksw.com/tpc-h</a>. This takes the industry standard <a href="http://dbpedia.org/resource/TPC-H" id="link-id0xeb7e460">TPC-H</a> benchmark <a href="http://dbpedia.org/resource/Data" id="link-id0xb40fcb8">data</a> and presents it as <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x9edbd128">linked data</a> with a <a href="http://dbpedia.org/resource/SPARQL" id="link-id0xf566a50">SPARQL</a> end point and dereferenceable URIs. </p>
<p>This is an example of using <a href="http://virtuoso.openlinksw.com" id="link-id0x11e59f80">Virtuoso</a>&#39;s relational-to-<a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xfc93c70">RDF</a> mapping for publishing business data, for browsing using the linked data principles and opening it to analytics queries in SPARQL.</p>
<p> As noted before, we have extended SPARQL with aggregation and nested queries, thus making it a viable <a href="http://dbpedia.org/resource/SQL" id="link-id0xffe4520">SQL</a> substitute for decision support queries. </p>
<p>The article at <a href="http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VOSTPCHLinkedData" id="link-id10799d10">http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VOSTPCHLinkedData</a> gives details and the source code for the implementation.</p>
<p> We are still working on some aspects of the more complex TPC-H queries, thus the demo is not complete with all the 22 queries. This is however enough to see a representative sample of how analytics queries work with SPARQL and Virtuoso&#39;s SQL-to-RDF mapping. The demo will be part of the next Virtuoso Open Source download, probably out next week.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-02-04#1308">
  <rss:title>LUBM results with Virtuoso 6.0</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-02-04T09:58:03Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We have now run the LUBM benchmark on Virtuoso v6, with the same configuration as discussed last Friday. We had a database of 8000 universities, and we ran 8 clients on slices of 100, 1000 and 8000 universities — same data but different sizes of working set. 100 universities: 35.3 qps 1000 universities: 26.3 qps 8000 universities: 13.1 qps The 100 universities slice is about the same as with v5.0.5 (35.3 vs 33.1 qps). The 8000 universities set is almost 3x better (13.1 vs. 4.8 qps). This comes from the fact that the v6 database takes half of the space of the v5.0.5 one.  Further, this is with 64-bit IDs for everything.  If the 5.5 database were with 64-bit IDs, we&#39;d have a difference of over 3x.  This is worth something if it lets you get by with only 1 terabyte of RAM for the 100 billion  triple application, instead of 3 TB. In a few more days, we&#39;ll give the results for Virtuoso v6 Cluster.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We have now run the LUBM benchmark on <a href="http://virtuoso.openlinksw.com" id="link-id0x1a6cb3c8">Virtuoso</a> v6, with the same configuration <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1302" id="link-id107f0238">as discussed last Friday</a>.</p>
<p>We had a database of 8000 universities, and we ran 8 clients on slices of 100, 1000 and 8000 universities — same <a href="http://dbpedia.org/resource/Data" id="link-id0x12ac6cc8">data</a> but different sizes of working set.</p>
<blockquote>
<pre>
 100 universities: 35.3 qps
1000 universities: 26.3 qps
8000 universities: 13.1 qps</pre></blockquote>
<p>The 100 universities slice is about the same as with v5.0.5 (35.3 vs 33.1 qps). <br />The 8000 universities set is almost 3x better (13.1 vs. 4.8 qps).</p>
<p>This comes from the fact that the v6 database takes half of the space of the v5.0.5 one.  Further, this is with 64-bit IDs for everything.  If the 5.5 database were with 64-bit IDs, we&#39;d have a difference of over 3x.  This is worth something if it lets you get by with only 1 terabyte of RAM for the 100 billion  triple application, instead of 3 TB.</p>
<p>
<a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1358" id="link-id15fb4d38">In a few more days</a>, we&#39;ll give the results for Virtuoso v6 Cluster.</p>

]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-02-01#1304">
  <rss:title>Latest LUBM Benchmark results for Virtuoso</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-02-01T14:39:04Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We have now taken a close look at the query side of the LUBM benchmark, as promised a couple of blog posts ago. We load 8000 universities and run a query mix consisting of the 14 LUBM queries with different numbers of clients against different portions of the database. When it is all in memory, we get 33 queries per second with 8 concurrent clients; when it is so I/O bound that 7.7 of 8 threads wait for disk, we get 5 qps. This was run in 8G RAM with 2 Xeon 5130. We adapted some of the queries so that they do not run over the whole database. In terms of retrieving triples per second, this would be about 330000 for the rate of 33 qps, with 4 cores at 2GHz. This is a combination of random access and linear scans and bitmap merge intersections; lookups for non-found triples are not counted. The rate of random lookups alone based on known G, S, P, O, without any query logic, is about 250000 random lookups per core per second. The article LUBM and Virtuoso gives the details. In the process of going through the workload we made some cost model adjustments and optimized the bitmap intersection join. In this way we can quickly determine which subjects are, for example, professors holding a degree from a given university. So the benchmark served us well in that it provided an incentive to further optimize some things. Now, what has been said about RDF benchmarking previously still holds. What does it mean to do so many LUBM queries per second? What does this say about the capacity to run an online site off RDF data? Or about information integration? Not very much. But then this was not the aim of the authors either. So we still need to make a benchmark for online queries and search, and another for E-science and business intelligence. But we are getting there. In the immediate future, we have the general availability of Virtuoso Open Source 5.0.5 early next week. This comes with a LUBM test driver and a test suite running against the LUBM qualification database. After this we will give some numbers for the cluster edition with LUBM and TPC-H.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We have now taken a close look at the query side of the LUBM benchmark, <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1296" id="link-id10a98120">as promised a couple of blog posts ago.</a>
</p>
<p>We load 8000 universities and run a query mix consisting of the 14 LUBM queries with different numbers of clients against different portions of the database.</p>
<p>When it is all in memory, we get 33 queries per second with 8 concurrent clients; when it is so I/O bound that 7.7 of 8 threads wait for disk, we get 5 qps. This was run in 8G RAM with 2 Xeon 5130.</p>
<p>We adapted some of the queries so that they do not run over the whole database. In terms of retrieving triples per second, this would be about 330000 for the rate of 33 qps, with 4 cores at 2GHz. This is a combination of random access and linear scans and bitmap merge intersections; lookups for non-found triples are not counted. The rate of random lookups alone based on known G, S, P, O, without any query logic, is about 250000 random lookups per core per second.</p>
<p>The article <a href="http://virtuoso.openlinksw.com/wiki/main/Main/VOSArticleLUBMBenchmark" id="link-id10237708">LUBM and Virtuoso</a> gives the details.</p>
<p>In the process of going through the workload we made some cost model adjustments and optimized the bitmap intersection join. In this way we can quickly determine which subjects are, for example, professors holding a degree from a given university. So the benchmark served us well in that it provided an incentive to further optimize some things.</p>
<p>Now, what has been said about <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x104257c0">RDF</a> benchmarking previously still holds. What does it mean to do so many LUBM queries per second? What does this say about the capacity to run an online site off RDF <a href="http://dbpedia.org/resource/Data" id="link-id0x7376478">data</a>? Or about <a href="http://dbpedia.org/resource/Information" id="link-id0x13fd3f30">information</a> integration? Not very much. But then this was not the aim of the authors either.</p>
<p>So we still need to make a benchmark for online queries and search, and another for E-science and business intelligence. But we are getting there.</p>
<p>In the immediate future, we have the general availability of <a href="http://virtuoso.openlinksw.com" id="link-id0x193509e8">Virtuoso</a> Open Source 5.0.5 early next week. This comes with a LUBM test driver and a test suite running against the LUBM qualification database.</p>
<p>After this we will give some numbers for the cluster edition with LUBM and <a href="http://dbpedia.org/resource/TPC-H" id="link-id0x1b8d1348">TPC-H</a>.</p>
]]></content:encoded>
 </rss:item>
</rdf:RDF>