Details

OpenLink Software
Burlington, United States

Subscribe

Post Categories

Recent Articles

Community Member Blogs

Display Settings

articles per page.
order.

Translate

Showing posts in all categories RefreshRefresh
Transaction Semantics in RDF and Relational Models [ Orri Erling ]

As a part of defining benchmark audit for testing ACID properties on RDF stores, we will here examine different RDF scenarios where lack of concurrency control causes inconsistent results. In so doing, we consider common implementation techniques and implications as concern locking (pessimistic) and multi-version (optimistic) concurrency control schemes.

In the following, we will talk in terms of triples, but the discussion can be trivially generalized to quads. We will use numbers for IRIs and literals. In most implementations, the internal representation for these is indeed a number (or at least some data type that has a well defined collation order). For ease of presentation, we consider a single index with key parts SPO. Any other index-like setting with any possible key order will have similar issues.

Insert (Create) and Delete

INSERT and DELETE as defined in SPARQL are queries which generate a result set which is then used for instantiating triple patterns. We note that a DELETE may delete a triple which the DELETE has not read; thus the delete set is not a subset of the read set. The SQL equivalent is the

DELETE FROM table WHERE key IN 
   ( SELECT key1 FROM other_table )

expression, supposing it were implemented as a scan of other_table and an index lookup followed by DELETE on table.

The meaning of INSERT is that the triples in question exist after the operation, and the meaning of DELETE is that said triples do not exist. In a transactional context, this means that the after-image of the transaction is guaranteed either to have or not-have said triples.

Suppose that the triples { 1 0 0 }, { 1 5 6 }, and { 1 5 7 } exist in the beginning. If we DELETE { 1 ?x ?y } and concurrently INSERT { 1 2 4 . 1 2 3 . 1 3 5 }, then whichever was considered to be first by the concurrency control of the DBMS would complete first, and the other after that. Thus the end state would either have no triples with subject 1 or would have the three just inserted.

Suppose the INSERT inserts the first triple, { 1 2 4 }. The DELETE at the same time reads all triples with subject 1. The exclusive read waits for the uncommitted INSERT. The INSERT then inserts the second triple, { 1 2 3 }. Depending on the isolation of the read, this either succeeds, since no { 1 2 3 } was read, or causes a deadlock. The first corresponds to REPEATABLE READ isolation; the second to SERIALIZABLE.

We would not get the desired end-state of either all the inserted triples or no triples with subject 1 if the read or the DELETE were not serializable.

Furthermore if a DELETE template produced a triple that did not exist in the pre-image, the DELETE semantics still imply that this also does not exist in the after-image, which implies serializability.

Read and Update

Let us consider the prototypical transaction example of transferring funds from one account to another. Two balances are updated, and a history record is inserted.

The initial state is

a  balance  10
b  balance  10

We transfer 1 from a to b, and at the same time transfer 2 from b to a. The end state must have a at 11 and b at 9.

A relational database needs REPEATABLE READ isolation for this.

With RDF, txn1 reads that a has a balance of 10. At the same time, txn1 reads the balance of a. txn2 waits because the read of txn1 is exclusive. txn1 proceeds and read the balance of b. It then updates the balance of a and b.

All goes without the deadlock which is always cited in this scenario, because the locks are acquired in the same order. The act of updating the balance of a, since RDF does not really have an update-in-place, consists of deleting { a balance 10 } and inserting { a balance 9 }. This gets done and txn1 commits. At this point, txn2 proceeds after its wait on the row that stated { a balance 10 }. This row is now gone, and txn2 sees that a has no balance, which is quite possible in RDF's schema-less model.

We see that REPEATABLE READ is not adequate with RDF, even though it is with relational. The reason why there is no UPDATE-in-place is that the PRIMARY KEY of the triple includes all the parts, including the object. Even in a RDBMS, an UPDATE of a primary key part amounts to a DELETE-plus-INSERT. One could here argue that an implementation might still UPDATE-in-place if the key order were not changed. This would resolve the special case of the accounts but not a more general case.

Thus we see that the read of the balance must be SERIALIZABLE. This means that the read locks the space before the first balance, so that no insertion may take place. In this way the read of txn2 waits on the lock that is conceptually before the first possible match of { a balance ?x }.

locking order and OLTP

To implement TPC-C, I would update the table with the highest cardinality first, and then all tables in descending order of cardinality. In this way, the locks with the highest likelihood for contention are held for the least time. If locking multiple rows of a table, these should be locked in a deterministic order, e.g., lowest key-value first. In this way, the workload would not deadlock. In actual fact, with clusters and parallel execution, the lock acquisition will not be guaranteed to be serial, so deadlocks do not entirely go away, but still may get fewer. Besides, any outside transaction might still lock in the wrong order and cause deadlocks, which is why the OLTP application must in any case be built to deal with the possibility of deadlock.

This is the conventional relational view of the matter. In more recent times, in-memory schemes with deterministic lock acquisition (Abadi VLDB 2010) or single-threaded atomic execution of transactions (Uni Munich BIRTE workshop at VLDB2010, VoltDB) have been proposed. There the transaction is described as a stored procedure, possibly with extra annotations. These techniques might apply to RDF also. RDF is however an unlikely model for transaction-intensive applications, so we will not for now examine these further.

RDBMS usually implement row-level locking. This means that once a column of a row has an uncommitted state, any other transaction is prevented from changing the row. This has no ready RDF equivalent. RDF is usually implemented as a row-per-triple system and applying row-level locking to this does not give the semantic one expects of a relational row.

I would argue that it is not essential to enforce transactional guarantees in units of rows. The guarantees must apply between data that is read and written by a transaction. It does not need to apply to columns that the transaction does not reference. To take the TPC-C example, the new order transaction updates the stock level and the delivery transaction updates the delivery count on the stock table. In practice, a delivery and a new order falling on the same row of stock will lock each other out, but nothing in the semantics of the workload mandates this.

It does not seem a priori necessary to recreate the row as a unit of concurrency control in RDF. One could say that a multi-attribute whole (such as an address) ought to be atomic for concurrency control, but then applications updating addresses will most likely read and update all the fields together even if only the street name changes.

Pessimistic Vs. Optimistic Concurrency Control

We have so far spoken only in terms of row-level locking, which is to my knowledge the most widely used model in RDBMS, and one we implement ourselves. Some databases (e.g., MonetDB and VectorWise) implement optimistic concurrency control. The general idea is that each transaction has a read and write set and when a transaction commits, any other transactions whose read or write set intersects with the write set of the committing transaction are marked un-committable. Once a transaction thus becomes un-committable, it may presumably continue reading indefinitely but may no longer commit its updates. Optimistic concurrency is generally coupled with multi-version semantics where the pre-image of a transaction is a clean committed state of the database as of a specific point in time, i.e., snapshot isolation.

To implement SERIALIZABLE isolation, i.e., the guarantee that if a transaction twice performs a COUNT the result will be the same, one locks also the row that precedes the set of selected rows and marks each lock so as to prevent an insert to the right of the lock in key order. The same thing may be done in an optimistic setting.

Positional Handling of Updates in Column Stores [Heman, Zukowski, CWI science library] discusses management of multiple consecutive snapshots in some detail. The paper does not go into the details of different levels of isolation but nothing there suggests that serializability could not be supported. There is some complexity in marking the space between ordered rows as non-insertable across multiple versions but this should be feasible enough.

The issue of optimistic Vs. pessimistic concurrency does not seem to be affected by the differences between RDF and relational models. We note that an OLTP workload can be made to run with very few transaction aborts (deadlocks) by properly ordering operations when using a locking scheme. The same does not work with optimistic concurrency since updates happen immediately and transaction aborts occur whenever the writes of one intersect the reads or writes of another, regardless of the order in which these were made.

Developers seldom understand transactions; therefore DBMS should, within the limits of the possible, optimize locking order for locking schemes. A simple example is locking in key order when doing an operation on a set of values. A more complex variant would consist of analyzing data dependencies in stored procedures and reordering updates so as to get the highest cardinality tables first. We note that this latter trick also benefits optimistic schemes.

In RDF, the same principles apply but distinguishing cardinality of an updated set will have to rely on statistics of predicate cardinality. Such are anyhow needed for query optimization.

Eventual Consistency

Web scale systems that need to maintain consistent state across multiple data centers sometimes use "eventual consistency" schemes. Two-phase-commit becomes very inefficient as latency increases, thus strict transactional semantics have prohibitive cost if the system is more distributed than a cluster with a fast interconnect.

Eventual consistency schemes (Amazon Dynamo, Yahoo! PNUTS) maintain history information on the record which is the unit of concurrency control. The record is typically a non-first normal form chunk of related data that it makes sense to store together from the application's viewpoint. Application logic can then be applied to reconciling differing copies of the same logical record.

Such a scheme seems a priori ill-suited for RDF, where the natural unit of concurrency control would seem to be the quad. We first note that only recently changed (i.e., DELETEd + INSERTed quads, as there is no UPDATE-in-place) need history information. This history information can be stored away from the quad itself, thus not disrupting compression. When detecting that one site has INSERTed a quad that another has DELETEd in the same general time period, application logic can still be applied for reading related quads in order to arrive at a decision on how to reconcile two databases that have diverged. The same can apply to conflicting values of properties that for the application should be single-valued. Comparing time-stamped transaction logs on quads is not fundamentally different from comparing record histories in Dynamo or PNUTS.

As we overcome the data size penalties that have until recently been associated with RDF, RDF becomes even more interesting as a data model for large online systems such as social network platforms where frequent application changes lead to volatility of schema. Key value stores are currently found in such applications, but they generally do not provide the query flexibility at which RDF excels.

Conclusions

We have gone over basic aspects of the endlessly complex and variable topic of transactions, and drawn parallels as well as outlined two basic differences between relational and RDF systems: What used to be REPEATABLE READ becomes SERIALIZABLE; and row-level locking becomes locking at the level of a single attribute value. For the rest, we see that the optimistic and pessimistic modes of concurrency control, as well as guidelines for writing transaction procedures, remain much the same.

Based on this overview, it should be possible to design an ACID test for describing the ACID behavior of benchmarked systems. We do not intend to make transaction support a qualification requirement for an RDF benchmark, but information on transaction support will still be valuable in comparing different systems.

# PermaLink Comments [0]
03/22/2011 19:55 GMT Modified: 03/22/2011 18:24 GMT
Transaction Semantics in RDF and Relational Models [ Virtuso Data Space Bot ]

As a part of defining benchmark audit for testing ACID properties on RDF stores, we will here examine different RDF scenarios where lack of concurrency control causes inconsistent results. In so doing, we consider common implementation techniques and implications as concern locking (pessimistic) and multi-version (optimistic) concurrency control schemes.

In the following, we will talk in terms of triples, but the discussion can be trivially generalized to quads. We will use numbers for IRIs and literals. In most implementations, the internal representation for these is indeed a number (or at least some data type that has a well defined collation order). For ease of presentation, we consider a single index with key parts SPO. Any other index-like setting with any possible key order will have similar issues.

Insert (Create) and Delete

INSERT and DELETE as defined in SPARQL are queries which generate a result set which is then used for instantiating triple patterns. We note that a DELETE may delete a triple which the DELETE has not read; thus the delete set is not a subset of the read set. The SQL equivalent is the

DELETE FROM table WHERE key IN 
   ( SELECT key1 FROM other_table )

expression, supposing it were implemented as a scan of other_table and an index lookup followed by DELETE on table.

The meaning of INSERT is that the triples in question exist after the operation, and the meaning of DELETE is that said triples do not exist. In a transactional context, this means that the after-image of the transaction is guaranteed either to have or not-have said triples.

Suppose that the triples { 1 0 0 }, { 1 5 6 }, and { 1 5 7 } exist in the beginning. If we DELETE { 1 ?x ?y } and concurrently INSERT { 1 2 4 . 1 2 3 . 1 3 5 }, then whichever was considered to be first by the concurrency control of the DBMS would complete first, and the other after that. Thus the end state would either have no triples with subject 1 or would have the three just inserted.

Suppose the INSERT inserts the first triple, { 1 2 4 }. The DELETE at the same time reads all triples with subject 1. The exclusive read waits for the uncommitted INSERT. The INSERT then inserts the second triple, { 1 2 3 }. Depending on the isolation of the read, this either succeeds, since no { 1 2 3 } was read, or causes a deadlock. The first corresponds to REPEATABLE READ isolation; the second to SERIALIZABLE.

We would not get the desired end-state of either all the inserted triples or no triples with subject 1 if the read or the DELETE were not serializable.

Furthermore if a DELETE template produced a triple that did not exist in the pre-image, the DELETE semantics still imply that this also does not exist in the after-image, which implies serializability.

Read and Update

Let us consider the prototypical transaction example of transferring funds from one account to another. Two balances are updated, and a history record is inserted.

The initial state is

a  balance  10
b  balance  10

We transfer 1 from a to b, and at the same time transfer 2 from b to a. The end state must have a at 11 and b at 9.

A relational database needs REPEATABLE READ isolation for this.

With RDF, txn1 reads that a has a balance of 10. At the same time, txn1 reads the balance of a. txn2 waits because the read of txn1 is exclusive. txn1 proceeds and read the balance of b. It then updates the balance of a and b.

All goes without the deadlock which is always cited in this scenario, because the locks are acquired in the same order. The act of updating the balance of a, since RDF does not really have an update-in-place, consists of deleting { a balance 10 } and inserting { a balance 9 }. This gets done and txn1 commits. At this point, txn2 proceeds after its wait on the row that stated { a balance 10 }. This row is now gone, and txn2 sees that a has no balance, which is quite possible in RDF's schema-less model.

We see that REPEATABLE READ is not adequate with RDF, even though it is with relational. The reason why there is no UPDATE-in-place is that the PRIMARY KEY of the triple includes all the parts, including the object. Even in a RDBMS, an UPDATE of a primary key part amounts to a DELETE-plus-INSERT. One could here argue that an implementation might still UPDATE-in-place if the key order were not changed. This would resolve the special case of the accounts but not a more general case.

Thus we see that the read of the balance must be SERIALIZABLE. This means that the read locks the space before the first balance, so that no insertion may take place. In this way the read of txn2 waits on the lock that is conceptually before the first possible match of { a balance ?x }.

locking order and OLTP

To implement TPC-C, I would update the table with the highest cardinality first, and then all tables in descending order of cardinality. In this way, the locks with the highest likelihood for contention are held for the least time. If locking multiple rows of a table, these should be locked in a deterministic order, e.g., lowest key-value first. In this way, the workload would not deadlock. In actual fact, with clusters and parallel execution, the lock acquisition will not be guaranteed to be serial, so deadlocks do not entirely go away, but still may get fewer. Besides, any outside transaction might still lock in the wrong order and cause deadlocks, which is why the OLTP application must in any case be built to deal with the possibility of deadlock.

This is the conventional relational view of the matter. In more recent times, in-memory schemes with deterministic lock acquisition (Abadi VLDB 2010) or single-threaded atomic execution of transactions (Uni Munich BIRTE workshop at VLDB2010, VoltDB) have been proposed. There the transaction is described as a stored procedure, possibly with extra annotations. These techniques might apply to RDF also. RDF is however an unlikely model for transaction-intensive applications, so we will not for now examine these further.

RDBMS usually implement row-level locking. This means that once a column of a row has an uncommitted state, any other transaction is prevented from changing the row. This has no ready RDF equivalent. RDF is usually implemented as a row-per-triple system and applying row-level locking to this does not give the semantic one expects of a relational row.

I would argue that it is not essential to enforce transactional guarantees in units of rows. The guarantees must apply between data that is read and written by a transaction. It does not need to apply to columns that the transaction does not reference. To take the TPC-C example, the new order transaction updates the stock level and the delivery transaction updates the delivery count on the stock table. In practice, a delivery and a new order falling on the same row of stock will lock each other out, but nothing in the semantics of the workload mandates this.

It does not seem a priori necessary to recreate the row as a unit of concurrency control in RDF. One could say that a multi-attribute whole (such as an address) ought to be atomic for concurrency control, but then applications updating addresses will most likely read and update all the fields together even if only the street name changes.

Pessimistic Vs. Optimistic Concurrency Control

We have so far spoken only in terms of row-level locking, which is to my knowledge the most widely used model in RDBMS, and one we implement ourselves. Some databases (e.g., MonetDB and VectorWise) implement optimistic concurrency control. The general idea is that each transaction has a read and write set and when a transaction commits, any other transactions whose read or write set intersects with the write set of the committing transaction are marked un-committable. Once a transaction thus becomes un-committable, it may presumably continue reading indefinitely but may no longer commit its updates. Optimistic concurrency is generally coupled with multi-version semantics where the pre-image of a transaction is a clean committed state of the database as of a specific point in time, i.e., snapshot isolation.

To implement SERIALIZABLE isolation, i.e., the guarantee that if a transaction twice performs a COUNT the result will be the same, one locks also the row that precedes the set of selected rows and marks each lock so as to prevent an insert to the right of the lock in key order. The same thing may be done in an optimistic setting.

Positional Handling of Updates in Column Stores [Heman, Zukowski, CWI science library] discusses management of multiple consecutive snapshots in some detail. The paper does not go into the details of different levels of isolation but nothing there suggests that serializability could not be supported. There is some complexity in marking the space between ordered rows as non-insertable across multiple versions but this should be feasible enough.

The issue of optimistic Vs. pessimistic concurrency does not seem to be affected by the differences between RDF and relational models. We note that an OLTP workload can be made to run with very few transaction aborts (deadlocks) by properly ordering operations when using a locking scheme. The same does not work with optimistic concurrency since updates happen immediately and transaction aborts occur whenever the writes of one intersect the reads or writes of another, regardless of the order in which these were made.

Developers seldom understand transactions; therefore DBMS should, within the limits of the possible, optimize locking order for locking schemes. A simple example is locking in key order when doing an operation on a set of values. A more complex variant would consist of analyzing data dependencies in stored procedures and reordering updates so as to get the highest cardinality tables first. We note that this latter trick also benefits optimistic schemes.

In RDF, the same principles apply but distinguishing cardinality of an updated set will have to rely on statistics of predicate cardinality. Such are anyhow needed for query optimization.

Eventual Consistency

Web scale systems that need to maintain consistent state across multiple data centers sometimes use "eventual consistency" schemes. Two-phase-commit becomes very inefficient as latency increases, thus strict transactional semantics have prohibitive cost if the system is more distributed than a cluster with a fast interconnect.

Eventual consistency schemes (Amazon Dynamo, Yahoo! PNUTS) maintain history information on the record which is the unit of concurrency control. The record is typically a non-first normal form chunk of related data that it makes sense to store together from the application's viewpoint. Application logic can then be applied to reconciling differing copies of the same logical record.

Such a scheme seems a priori ill-suited for RDF, where the natural unit of concurrency control would seem to be the quad. We first note that only recently changed (i.e., DELETEd + INSERTed quads, as there is no UPDATE-in-place) need history information. This history information can be stored away from the quad itself, thus not disrupting compression. When detecting that one site has INSERTed a quad that another has DELETEd in the same general time period, application logic can still be applied for reading related quads in order to arrive at a decision on how to reconcile two databases that have diverged. The same can apply to conflicting values of properties that for the application should be single-valued. Comparing time-stamped transaction logs on quads is not fundamentally different from comparing record histories in Dynamo or PNUTS.

As we overcome the data size penalties that have until recently been associated with RDF, RDF becomes even more interesting as a data model for large online systems such as social network platforms where frequent application changes lead to volatility of schema. Key value stores are currently found in such applications, but they generally do not provide the query flexibility at which RDF excels.

Conclusions

We have gone over basic aspects of the endlessly complex and variable topic of transactions, and drawn parallels as well as outlined two basic differences between relational and RDF systems: What used to be REPEATABLE READ becomes SERIALIZABLE; and row-level locking becomes locking at the level of a single attribute value. For the rest, we see that the optimistic and pessimistic modes of concurrency control, as well as guidelines for writing transaction procedures, remain much the same.

Based on this overview, it should be possible to design an ACID test for describing the ACID behavior of benchmarked systems. We do not intend to make transaction support a qualification requirement for an RDF benchmark, but information on transaction support will still be valuable in comparing different systems.

# PermaLink Comments [0]
03/22/2011 19:55 GMT Modified: 03/22/2011 18:24 GMT
RDF and Transactions [ Orri Erling ]

I will here talk about RDF and transactions for developers in general. The next one talks about specifics and is for specialists.

Transactions are certainly not the first thing that comes to mind when one hears "RDF". We have at times used a recruitment questionnaire where we ask applicants to define a transaction. Many vaguely remember that it is a unit of work, but usually not more than that. We sometimes get questions from users about why they get an error message that says "deadlock". "Deadlock" is what happens when multiple users concurrently update balances on multiple bank accounts in the wrong order. What does this have to do with RDF?

There are in fact users who even use XA with a Virtuoso-based RDF application. Franz also has publicized their development of full ACID capabilities for AllegroGraph. RDF is a database schema model, and transactions will inevitably become an issue in databases.

At the same time, the developer population trained with MySQL and PHP is not particularly transaction-aware. Transactions have gone out of style, declares the No-SQL crowd. Well, it is not so much SQL they object to but ACID, i.e., transactional guarantees. We will talk more about this in the next post. The SPARQL language and protocol do not go into transactions, except for expressing the wish that an UPDATE request to an end-point be atomic. But beware -- atomicity is a gateway drug, and soon one finds oneself on full ACID.

If one says that a thing will either happen in its entirety or not at all, which is what (A) atomicity means, then the question arises of (I) isolation; that is, what happens if somebody else does something to the same data at the same time? Then comes the question of whether a thing, once having happened, will stay that way; i.e., (D) durability. Finally, there is (C) consistency, which means that the transaction's result must not contradict restrictions the database is supposed to enforce. RDF usually has no restrictions; thus consistency mostly means that the internal state of the DBMS must be consistent, e.g., different indices on triples/quads should contain the same data.

There are, of course, database-like consistency criteria that one can express in RDF Schema and OWL, concerning data types, mandatory presence of properties, or restrictions on cardinality (i.e., one may only have one spouse at a time, and the like).

If one indeed did enforce them all, then RDF would be very like the relational model -- with all the restrictions, but without the 40 years of work on RDBMS performance. For this reason, RDF use tends to involve data that is not structured enough to be a good fit for RDBMS.

There is of course the OWL side, where consistency is important but is defined in such complex ways that they again are not a good fit for RDBMS. RDF could be seen to be split between the schema-last world and the knowledge representation world. I will here focus on the schema-last side.

Transactions are relevant in RDF in two cases: 1. If data is trickle loaded in small chunks, one likes to know that the chunks do not get lost or corrupted; 2. If the application has any semantics that reserve resources, then these operations need transactions. The latter is not so common with RDF but examples include read-write situations, like checking if a seat is available and then reserving it. Transactionality guarantees that the same seat does not get reserved twice.

Web people argue with some justification that since the four cardinal virtues of database never existed on the web to begin with, applying strict ACID to web data is beside the point, like locking the stable after the horse has long since run away. This may be so; yet the systems used for processing data, whether that data is dirty or not, benefit from predictable operation under concurrency and from not losing data.

Analytics workloads are not primarily about transactions, but still need to specify what happens with updates. Analyzing data from measurements may not have concurrent updates, but there the transaction issue is replaced by the question of making explicit how the data was acquired and what processing has been applied to it before storage.

As mentioned before, the LOD2 project is at the crossroads of RDF and database. I construe its mission to be the making of RDF into a respectable database discipline. Database respectability in turn is as good as inconceivable without addressing the very bedrock on which this science was founded: transactions.

As previously argued, we need well-defined and auditable benchmarks. This again brings up the topic of transactions. Once we embark on the database benchmark route, there is no way around this. TPC-H mandates that the system under test support transactions, and the audit involves a test for this. We can do no less.

This has led me to more closely examine the issue of RDF and transactions, and whether there exist differences between transactions applied to RDF and to relational data.

As concerns Virtuoso, our position has been that one can get full ACID in Virtuoso, whether in SQL or SPARQL, by using a connected client (e.g., ODBC, JDBC, or the Jena or Sesame frameworks), and setting the isolation options on the connection. Having taken this step, one then must take the next step, which consists of dealing with deadlocks; i.e., with concurrent utilization, it may happen that the database at any time notifies the client that the transaction got aborted and the client must retry.

Web developers especially do not like this, because this is not what MySQL has taught them to expect. MySQL does have transactional back-ends like InnoDB, but often gets used without transactions.

With the March 2011 Virtuoso releases, we have taken a closer look at transactions with RDF. It is more practical to reduce the possibility of errors than to require developers to pay attention. For this reason we have automated isolation settings for RDF, greatly reduced the incidence of deadlocks, and even incorporated automatic deadlock retries where applicable.

If all users lock resources they need in the same order, there will be no deadlocks. This is what we do with RDF load in Virtuoso 7; thus any mix of concurrent INSERTs and DELETEs, if these are under a certain size (normally 10000 quads) are guaranteed never to fail due to locking. These could still fail due to running out of space, though. With previous versions, there always was a possibility of having an INSERT or DELETE fail because of deadlock with multiple users. Vectored INSERT and DELETE are sufficient for making web crawling or archive maintenance practically deadlock free, since there the primary transaction is the INSERT or DELETE of a small graph.

Furthermore, since the SPARQL protocol has no way of specifying transactions consisting of multiple client-server exchanges, the SPARQL end-point may deal with deadlocks by itself. If all else fails, it can simply execute requests one after the other, thus eliminating any possibility of locking. We note that many statements will be intrinsically free of deadlocks by virtue of always locking in key order, but this cannot be universally guaranteed with arbitrary size operations; thus concurrent operations might still sometimes deadlock. Anyway, vectored execution as introduced in Virtuoso 7, besides getting easily double-speed random access, also greatly reduces deadlocks by virtue of ordering operations.

In the next post we will talk about what transactions mean with RDF and whether there is any difference with the relational model.

# PermaLink Comments [0]
03/22/2011 18:52 GMT Modified: 03/22/2011 17:44 GMT
RDF and Transactions [ Virtuso Data Space Bot ]

I will here talk about RDF and transactions for developers in general. The next one talks about specifics and is for specialists.

Transactions are certainly not the first thing that comes to mind when one hears "RDF". We have at times used a recruitment questionnaire where we ask applicants to define a transaction. Many vaguely remember that it is a unit of work, but usually not more than that. We sometimes get questions from users about why they get an error message that says "deadlock". "Deadlock" is what happens when multiple users concurrently update balances on multiple bank accounts in the wrong order. What does this have to do with RDF?

There are in fact users who even use XA with a Virtuoso-based RDF application. Franz also has publicized their development of full ACID capabilities for AllegroGraph. RDF is a database schema model, and transactions will inevitably become an issue in databases.

At the same time, the developer population trained with MySQL and PHP is not particularly transaction-aware. Transactions have gone out of style, declares the No-SQL crowd. Well, it is not so much SQL they object to but ACID, i.e., transactional guarantees. We will talk more about this in the next post. The SPARQL language and protocol do not go into transactions, except for expressing the wish that an UPDATE request to an end-point be atomic. But beware -- atomicity is a gateway drug, and soon one finds oneself on full ACID.

If one says that a thing will either happen in its entirety or not at all, which is what (A) atomicity means, then the question arises of (I) isolation; that is, what happens if somebody else does something to the same data at the same time? Then comes the question of whether a thing, once having happened, will stay that way; i.e., (D) durability. Finally, there is (C) consistency, which means that the transaction's result must not contradict restrictions the database is supposed to enforce. RDF usually has no restrictions; thus consistency mostly means that the internal state of the DBMS must be consistent, e.g., different indices on triples/quads should contain the same data.

There are, of course, database-like consistency criteria that one can express in RDF Schema and OWL, concerning data types, mandatory presence of properties, or restrictions on cardinality (i.e., one may only have one spouse at a time, and the like).

If one indeed did enforce them all, then RDF would be very like the relational model -- with all the restrictions, but without the 40 years of work on RDBMS performance. For this reason, RDF use tends to involve data that is not structured enough to be a good fit for RDBMS.

There is of course the OWL side, where consistency is important but is defined in such complex ways that they again are not a good fit for RDBMS. RDF could be seen to be split between the schema-last world and the knowledge representation world. I will here focus on the schema-last side.

Transactions are relevant in RDF in two cases: 1. If data is trickle loaded in small chunks, one likes to know that the chunks do not get lost or corrupted; 2. If the application has any semantics that reserve resources, then these operations need transactions. The latter is not so common with RDF but examples include read-write situations, like checking if a seat is available and then reserving it. Transactionality guarantees that the same seat does not get reserved twice.

Web people argue with some justification that since the four cardinal virtues of database never existed on the web to begin with, applying strict ACID to web data is beside the point, like locking the stable after the horse has long since run away. This may be so; yet the systems used for processing data, whether that data is dirty or not, benefit from predictable operation under concurrency and from not losing data.

Analytics workloads are not primarily about transactions, but still need to specify what happens with updates. Analyzing data from measurements may not have concurrent updates, but there the transaction issue is replaced by the question of making explicit how the data was acquired and what processing has been applied to it before storage.

As mentioned before, the LOD2 project is at the crossroads of RDF and database. I construe its mission to be the making of RDF into a respectable database discipline. Database respectability in turn is as good as inconceivable without addressing the very bedrock on which this science was founded: transactions.

As previously argued, we need well-defined and auditable benchmarks. This again brings up the topic of transactions. Once we embark on the database benchmark route, there is no way around this. TPC-H mandates that the system under test support transactions, and the audit involves a test for this. We can do no less.

This has led me to more closely examine the issue of RDF and transactions, and whether there exist differences between transactions applied to RDF and to relational data.

As concerns Virtuoso, our position has been that one can get full ACID in Virtuoso, whether in SQL or SPARQL, by using a connected client (e.g., ODBC, JDBC, or the Jena or Sesame frameworks), and setting the isolation options on the connection. Having taken this step, one then must take the next step, which consists of dealing with deadlocks; i.e., with concurrent utilization, it may happen that the database at any time notifies the client that the transaction got aborted and the client must retry.

Web developers especially do not like this, because this is not what MySQL has taught them to expect. MySQL does have transactional back-ends like InnoDB, but often gets used without transactions.

With the March 2011 Virtuoso releases, we have taken a closer look at transactions with RDF. It is more practical to reduce the possibility of errors than to require developers to pay attention. For this reason we have automated isolation settings for RDF, greatly reduced the incidence of deadlocks, and even incorporated automatic deadlock retries where applicable.

If all users lock resources they need in the same order, there will be no deadlocks. This is what we do with RDF load in Virtuoso 7; thus any mix of concurrent INSERTs and DELETEs, if these are under a certain size (normally 10000 quads) are guaranteed never to fail due to locking. These could still fail due to running out of space, though. With previous versions, there always was a possibility of having an INSERT or DELETE fail because of deadlock with multiple users. Vectored INSERT and DELETE are sufficient for making web crawling or archive maintenance practically deadlock free, since there the primary transaction is the INSERT or DELETE of a small graph.

Furthermore, since the SPARQL protocol has no way of specifying transactions consisting of multiple client-server exchanges, the SPARQL end-point may deal with deadlocks by itself. If all else fails, it can simply execute requests one after the other, thus eliminating any possibility of locking. We note that many statements will be intrinsically free of deadlocks by virtue of always locking in key order, but this cannot be universally guaranteed with arbitrary size operations; thus concurrent operations might still sometimes deadlock. Anyway, vectored execution as introduced in Virtuoso 7, besides getting easily double-speed random access, also greatly reduces deadlocks by virtue of ordering operations.

In the next post we will talk about what transactions mean with RDF and whether there is any difference with the relational model.

# PermaLink Comments [0]
03/22/2011 18:52 GMT Modified: 03/22/2011 17:44 GMT
Benchmarks, Redux (part 14): BSBM BI Mix [ Orri Erling ]

In this post, we look at how we run the BSBM-BI mix. We consider the 100 Mt and 1000 Mt scales with Virtuoso 7 using the same hardware and software as in the previous posts. The changes to workload and metric are given in the previous post.

Our intent here is to look at whether the metric works, and to see what results will look like in general. We are as much testing the benchmark as we are testing the system-under-test (SUT). The results shown here will likely not be comparable with future ones because we will most likely change the composition of the workload since it seems a bit out of balance. Anyway, for the sake of disclosure, we attach the query templates. The test driver we used will be made available soon, so the interested may still try a comparison with their systems. If you practice with this workload for the coming races, the effort will surely not be wasted.

Once we have come up with a rules document, we will redo all that we have published so far by-the-book, and have it audited as part of the LOD2 service we plan for this (see previous posts in this series). This will introduce comparability; but before we get that far with the BI workload, the workload needs to evolve a bit.

Below we show samples of test driver output; the whole output is downloadable.

100 Mt Single User

bsbm/testdriver   -runs 1   -w 0 -idir /bs/1  -drill  \  
   -ucf bsbm/usecases/businessIntelligence/sparql.txt  \  
   -dg http://bsbm.org http://localhost:8604/sparql
0: 43348.14ms, total: 43440ms

Scale factor:           284826
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Seed:                   808080
Number of query mix runs (without warmups): 1 times
min/max Querymix runtime:    43.3481s / 43.3481s
Elapsed runtime:        43.348 seconds
QMpH:                   83.049 query mixes per hour
CQET:                   43.348 seconds average runtime of query mix
CQET (geom.):           43.348 seconds geometric mean runtime of query mix
AQET (geom.):           0.492 seconds geometric mean runtime of query
Throughput:             1494.874 BSBM-BI throughput: qph*scale
BI Power:               7309.820 BSBM-BI Power: qph*scale (geom)

100 Mt 8 User

Thread 6: query mix 3: 195793.09ms, total: 196086.18ms
Thread 8: query mix 0: 197843.84ms, total: 198010.50ms
Thread 7: query mix 4: 201806.28ms, total: 201996.26ms
Thread 2: query mix 5: 221983.93ms, total: 222105.96ms
Thread 4: query mix 7: 225127.55ms, total: 225317.49ms
Thread 3: query mix 6: 225860.49ms, total: 226050.17ms
Thread 5: query mix 2: 230884.93ms, total: 231067.61ms
Thread 1: query mix 1: 237836.61ms, total: 237959.11ms
Benchmark run completed in 237.985427s

Scale factor:           284826
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Number of clients:      8
Seed:                   808080
Number of query mix runs (without warmups): 8 times
min/max Querymix runtime:    195.7931s / 237.8366s
Total runtime (sum):    1737.137 seconds
Elapsed runtime:        1737.137 seconds
QMpH:                   121.016 query mixes per hour
CQET:                   217.142 seconds average runtime of query mix
CQET (geom.):           216.603 seconds geometric mean runtime of query mix
AQET (geom.):           2.156 seconds geometric mean runtime of query
Throughput:             2178.285 BSBM-BI throughput: qph*scale
BI Power:               1669.745 BSBM-BI Power: qph*scale (geom)

1000 Mt Single User

0: 608707.03ms, total: 608768ms

Scale factor:           2848260
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Seed:                   808080
Number of query mix runs (without warmups): 1 times
min/max Querymix runtime:    608.7070s / 608.7070s
Elapsed runtime:        608.707 seconds
QMpH:                   5.914 query mixes per hour
CQET:                   608.707 seconds average runtime of query mix
CQET (geom.):           608.707 seconds geometric mean runtime of query mix
AQET (geom.):           5.167 seconds geometric mean runtime of query
Throughput:             1064.552 BSBM-BI throughput: qph*scale
BI Power:               6967.325 BSBM-BI Power: qph*scale (geom)

1000 Mt 8 User

bsbm/testdriver   -runs 8 -mt 8  -w 0 -idir /bs/10  -drill  \
   -ucf bsbm/usecases/businessIntelligence/sparql.txt   \
   -dg http://bsbm.org http://localhost:8604/sparql
Thread 3: query mix 4: 2211275.25ms, total: 2211371.60ms
Thread 4: query mix 0: 2212316.87ms, total: 2212417.99ms
Thread 8: query mix 3: 2275942.63ms, total: 2276058.03ms
Thread 5: query mix 5: 2441378.35ms, total: 2441448.66ms
Thread 6: query mix 7: 2804001.05ms, total: 2804098.81ms
Thread 2: query mix 2: 2808374.66ms, total: 2808473.71ms
Thread 1: query mix 6: 2839407.12ms, total: 2839510.63ms
Thread 7: query mix 1: 2889199.23ms, total: 2889263.17ms
Benchmark run completed in 2889.302566s

Scale factor:           2848260
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Number of clients:      8
Seed:                   808080
Number of query mix runs (without warmups): 8 times
min/max Querymix runtime:    2211.2753s / 2889.1992s
Total runtime (sum):    20481.895 seconds
Elapsed runtime:        20481.895 seconds
QMpH:                   9.968 query mixes per hour
CQET:                   2560.237 seconds average runtime of query mix
CQET (geom.):           2544.284 seconds geometric mean runtime of query mix
AQET (geom.):           13.556 seconds geometric mean runtime of query
Throughput:             1794.205 BSBM-BI throughput: qph*scale
BI Power:               2655.678 BSBM-BI Power: qph*scale (geom)

Metrics for Query:      1
Count:                  8 times executed in whole run
Time share              2.120884% of total execution time
AQET:                   54.299656 seconds (arithmetic mean)
AQET(geom.):            34.607302 seconds (geometric mean)
QPS:                    0.13 Queries per second
minQET/maxQET:          11.71547600s / 148.65379700s

Metrics for Query:      2
Count:                  8 times executed in whole run
Time share              0.207382% of total execution time
AQET:                   5.309462 seconds (arithmetic mean)
AQET(geom.):            2.737696 seconds (geometric mean)
QPS:                    1.34 Queries per second
minQET/maxQET:          0.78729800s / 25.80948200s

Metrics for Query:      3
Count:                  8 times executed in whole run
Time share              17.650472% of total execution time
AQET:                   451.893890 seconds (arithmetic mean)
AQET(geom.):            410.481088 seconds (geometric mean)
QPS:                    0.02 Queries per second
minQET/maxQET:          171.07262500s / 721.72939200s

Metrics for Query:      5
Count:                  32 times executed in whole run
Time share              6.196565% of total execution time
AQET:                   39.661685 seconds (arithmetic mean)
AQET(geom.):            6.849882 seconds (geometric mean)
QPS:                    0.18 Queries per second
minQET/maxQET:          0.15696500s / 189.00906200s

Metrics for Query:      6
Count:                  8 times executed in whole run
Time share              0.119916% of total execution time
AQET:                   3.070136 seconds (arithmetic mean)
AQET(geom.):            2.056059 seconds (geometric mean)
QPS:                    2.31 Queries per second
minQET/maxQET:          0.41524400s / 7.55655300s

Metrics for Query:      7
Count:                  40 times executed in whole run
Time share              1.577963% of total execution time
AQET:                   8.079921 seconds (arithmetic mean)
AQET(geom.):            1.342079 seconds (geometric mean)
QPS:                    0.88 Queries per second
minQET/maxQET:          0.02205800s / 40.27761500s

Metrics for Query:      8
Count:                  40 times executed in whole run
Time share              72.126818% of total execution time
AQET:                   369.323481 seconds (arithmetic mean)
AQET(geom.):            114.431863 seconds (geometric mean)
QPS:                    0.02 Queries per second
minQET/maxQET:          5.94377300s / 1824.57867400s

The CPU for the multiuser runs stays above 1500% for the whole run. The CPU for the single user 100 Mt run is 630%; for the 1000 Mt run, this is 574%. This can be improved since the queries usually have a lot of data to work on. But final optimization is not our goal yet; we are just surveying the race track. The difference between a warm single user run and a cold single user run is about 15% with data on SSD; with data on disk, this would be more. The numbers shown are with warm cache. The single-user and multi-user Throughput difference, 1064 single-user vs. 1794 multi-user, is about what one would expect from the CPU utilization.

With these numbers, the CPU does not appear badly memory-bound, else the increase would be less; also core multi-threading seems to bring some benefit. If the single-user run was at 800%, the Throughput would be 1488. The speed in excess of this may be attributed to core multi-threading, although we must remember that not every query mix is exactly the same length, so the figure is not exact. Core multi-threading does not seem to hurt, at the very least. Comparison of the same numbers with the column store will be interesting since it misses the cache a lot less and accordingly has better SMP scaling. The Intel Nehalem memory subsystem is really pretty good.

For reference, we show a run with Virtuoso 6 at 100Mt.

0: 424754.40ms, total: 424829ms

Scale factor:           284826
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Seed:                   808080
Number of query mix runs (without warmups): 1 times
min/max Querymix runtime:    424.7544s / 424.7544s
Elapsed runtime:        424.754 seconds
QMpH:                   8.475 query mixes per hour
CQET:                   424.754 seconds average runtime of query mix
CQET (geom.):           424.754 seconds geometric mean runtime of query mix
AQET (geom.):           1.097 seconds geometric mean runtime of query
Throughput:             152.559 BSBM-BI throughput: qph*scale
BI Power:               3281.150 BSBM-BI Power: qph*scale (geom)

and 8 user

Thread 5: query mix 3: 616997.86ms, total: 617042.83ms
Thread 7: query mix 4: 625522.18ms, total: 625559.09ms
Thread 3: query mix 7: 626247.62ms, total: 626304.96ms
Thread 1: query mix 0: 629675.17ms, total: 629724.98ms
Thread 4: query mix 6: 667633.36ms, total: 667670.07ms
Thread 8: query mix 2: 674206.07ms, total: 674256.72ms
Thread 6: query mix 5: 695020.21ms, total: 695052.29ms
Thread 2: query mix 1: 701824.67ms, total: 701864.91ms
Benchmark run completed in 701.909341s

Scale factor:           284826
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Number of clients:      8
Seed:                   808080
Number of query mix runs (without warmups): 8 times
min/max Querymix runtime:    616.9979s / 701.8247s
Total runtime (sum):    5237.127 seconds
Elapsed runtime:        5237.127 seconds
QMpH:                   41.031 query mixes per hour
CQET:                   654.641 seconds average runtime of query mix
CQET (geom.):           653.873 seconds geometric mean runtime of query mix
AQET (geom.):           2.557 seconds geometric mean runtime of query
Throughput:             738.557 BSBM-BI throughput: qph*scale
BI Power:               1408.133 BSBM-BI Power: qph*scale (geom)

Having the numbers, let us look at the metric and its scaling. We take the geometric mean of the single-user Power and the multiuser Throughput.

 100 Mt: sqrt ( 7771 * 2178 ); = 4114

1000 Mt: sqrt ( 6967 * 1794 ); = 3535

Scaling seems to work; the results are in the same general ballpark. The real times for the 1000 Mt run are a bit over 10x the times for the 100Mt run, as expected. The relative percentages of the queries are about the same on both scales, with the drill-down in Q8 alone being 77% and 72% respectively. The Q8 drill-down starts at the root of the product hierarchy. If we made this start one level from the top, its share would drop. This seems reasonable.

Conversely, Q2 is out of place, with far too little share of the time. It takes a product as a starting point and shows a list of products with common features, sorted by descending count of common features. This would more appropriately be applied to a leaf product category instead, measuring how many of the products in the category have the top 20 features found in this category, to name an example.

Also there should be more queries.

At present it appears that BSBM-BI is definitely runnable, but a cursory look suffices to show that the workload needs more development and variety. We remember that I dreamt up the business questions last fall without much analysis, and that these questions were subsequently translated to SPARQL by FU Berlin. So, on one hand, BSBM-BI is of crucial importance because it is the first attempt at doing a benchmark with long running queries in SPARQL. On the other hand, BSBM-BI is not very good as a benchmark; TPC-H is a lot better. This stands to reason, as TPC-H has had years and years of development and participation by many people.

Benchmark queries are trick questions: For example, TPC-H Q18 cannot be done without changing an IN into a JOIN with the IN subquery in the outer loop and doing streaming aggregation. Q13 cannot be done without a well-optimized HASH JOIN which besides must be partitioned at the larger scales.

Having such trick questions in an important benchmark eventually results in everybody doing the optimizations that the benchmark clearly calls for. Making benchmarks thus entails a responsibility ultimately to the end user, because an irrelevant benchmark might in the worst case send developers chasing things that are beside the point.

In the following, we will look at what BSBM-BI requires from the database and how these requirements can be further developed and extended.

BSBM-BI does not have any clear trick questions, at least not premeditatedly. BSBM-BI just requires a cost model that can guess the fanout of a JOIN and the cardinality of a GROUP BY; it is enough to distinguish smaller from greater; the guess does not otherwise have to be very good. Further, the queries are written in the benchmark text so that joining from left to right would work, so not even a cost-based optimizer is strictly needed. I did however have to add some cardinality statistics to get reasonable JOIN order since we always reorder the query regardless of the source formulation.

BSBM-BI does have variable selectivity from the drill-downs; thus these may call for different JOIN orders for different parameter values. I have not looked into whether this really makes a difference, though.

There are places in BSBM-BI where using a HASH JOIN makes sense. We do not use HASH JOINs with RDF because there is an index for everything and making a HASH JOIN in the wrong place can have a large up-front cost, so one is more robust against cost model errors if one does not do HASH JOINs. This said, a HASH JOIN in the right place is a lot better than an index lookup. With TPC-H Q13, our best HASH JOIN is over 2x better than the best INDEX-based JOIN, both being well tuned. For questions like "count the hairballs made in Germany reviewed by Japanese Hello Kitty fans," where two ends of a JOIN path are fairly selective doing the other as a HASH JOIN is good. This can, if the JOIN is always cardinality-reducing, even be merged inside an INDEX lookup. We have such capabilities since we have been for a while gearing up for the relational races, but are not using any of these with BSBM-BI, although they would be useful.

Let us see the profile for a single user 100 Mt run.

The database activity summary is --

select db_activity (0, 'http');

161.3M rnd  210.2M seq      0 same seg   104.5M same pg  45.08M same par      0 disk      0 spec disk      0B /      0 messages  2.393K fork

See the post "What Does BSBM Explore Measure" for an explanation of the numbers. We see that there is more sequential access than random and the random has fair locality with over half on the same page as the previous and a lot of the rest falling under the same parent. Funnily enough, the explore mix has more locality. Running with a longer vector size would probably increase performance by getting better locality. There is an optimization that adjusts vector size on the fly if locality is not sufficient but this is not being used here. So we manually set vector size to 100000 instead of the default 10000. We get --

172.4M rnd  220.8M seq      0 same seg   149.6M same pg  10.99M same par     21 disk    861 spec disk      0B /      0 messages     754 fork

The throughput goes from 1494 to 1779. We see more hits on the same page, as expected. We do not make this setting a default since it raises the cost for small queries; therefore the vector size must be self-adjusting -- besides, expecting a DBA to tune this is not reasonable. We will just have to correctly tune the self-adjust logic, and we have again clear gains.

Let us now go back to the first run with vector size 10000.

The top of the CPU oprofile is as follows:

722309   15.4507  cmpf_iri64n_iri64n
434791    9.3005  cmpf_iri64n_iri64n_anyn_iri64n
294712    6.3041  itc_next_set
273488    5.8501  itc_vec_split_search
203970    4.3631  itc_dive_transit
199687    4.2714  itc_page_rcf_search
181614    3.8848  dc_itc_append_any
173043    3.7015  itc_bm_vec_row_check
146727    3.1386  cmpf_int64n
128224    2.7428  itc_vec_row_check
113515    2.4282  dk_alloc
97296     2.0812  page_wait_access
62523     1.3374  qst_vec_get_int64
59014     1.2623  itc_next_set_parent
53589     1.1463  sslr_qst_get
48003     1.0268  ds_add
46641     0.9977  dk_free_tree
44551     0.9530  kc_var_col
43650     0.9337  page_col_cmp_1
35297     0.7550  cmpf_iri64n_iri64n_anyn_gt_lt
34589     0.7399  dv_compare
25864     0.5532  cmpf_iri64n_anyn_iri64n_iri64n_lte
23088     0.4939  dk_free

The top 10 are all index traversal, with the key compare for two leading IRI keys in the lead, corresponding to a lookup with P and S given. The one after that is with all parts given, corresponding to an existence test. The existence tests could probably be converted to HASH JOIN lookups to good advantage. Aggregation and arithmetic are absent. We should probably add a query like TPC-H Q1 that does nothing but these two. Considering the overall profile, GROUP BY seems to be around 3%. We should probably put in a query that makes a very large number of groups and could make use of streaming aggregation, i.e., take advantage of a situation where aggregation input comes already grouped by the grouping columns.

A BI use case should offer no problem with including arithmetic, but there are not that many numbers in the BSBM set. Some code sections in the queries with conditional execution and costly tests inside ANDs and ORs would be good. TPC-H has such in Q21 and Q19. An OR with existences where there would be gain from good guesses of a subquery's selectivity would be appropriate. Also, there should be conditional expressions somewhere with a lot of data, like the CASE-WHEN in TPC-H Q12.

We can make BSBM-BI more interesting by putting in the above. Also we will have to see where we can profit from HASH JOIN, both small and large. There should be such places in the workload already so this is a matter of just playing a bit more.

This post amounts to a cheat sheet for the BSBM-BI runs a bit farther down the road. By then we should be operational with the column store and Virtuoso 7 Cluster, though, so not everything is yet on the table.

Benchmarks, Redux Series

# PermaLink Comments [0]
03/22/2011 18:31 GMT Modified: 03/22/2011 17:04 GMT
Benchmarks, Redux (part 14): BSBM BI Mix [ Virtuso Data Space Bot ]

In this post, we look at how we run the BSBM-BI mix. We consider the 100 Mt and 1000 Mt scales with Virtuoso 7 using the same hardware and software as in the previous posts. The changes to workload and metric are given in the previous post.

Our intent here is to look at whether the metric works, and to see what results will look like in general. We are as much testing the benchmark as we are testing the system-under-test (SUT). The results shown here will likely not be comparable with future ones because we will most likely change the composition of the workload since it seems a bit out of balance. Anyway, for the sake of disclosure, we attach the query templates. The test driver we used will be made available soon, so the interested may still try a comparison with their systems. If you practice with this workload for the coming races, the effort will surely not be wasted.

Once we have come up with a rules document, we will redo all that we have published so far by-the-book, and have it audited as part of the LOD2 service we plan for this (see previous posts in this series). This will introduce comparability; but before we get that far with the BI workload, the workload needs to evolve a bit.

Below we show samples of test driver output; the whole output is downloadable.

100 Mt Single User

bsbm/testdriver   -runs 1   -w 0 -idir /bs/1  -drill  \  
   -ucf bsbm/usecases/businessIntelligence/sparql.txt  \  
   -dg http://bsbm.org http://localhost:8604/sparql
0: 43348.14ms, total: 43440ms

Scale factor:           284826
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Seed:                   808080
Number of query mix runs (without warmups): 1 times
min/max Querymix runtime:    43.3481s / 43.3481s
Elapsed runtime:        43.348 seconds
QMpH:                   83.049 query mixes per hour
CQET:                   43.348 seconds average runtime of query mix
CQET (geom.):           43.348 seconds geometric mean runtime of query mix
AQET (geom.):           0.492 seconds geometric mean runtime of query
Throughput:             1494.874 BSBM-BI throughput: qph*scale
BI Power:               7309.820 BSBM-BI Power: qph*scale (geom)

100 Mt 8 User

Thread 6: query mix 3: 195793.09ms, total: 196086.18ms
Thread 8: query mix 0: 197843.84ms, total: 198010.50ms
Thread 7: query mix 4: 201806.28ms, total: 201996.26ms
Thread 2: query mix 5: 221983.93ms, total: 222105.96ms
Thread 4: query mix 7: 225127.55ms, total: 225317.49ms
Thread 3: query mix 6: 225860.49ms, total: 226050.17ms
Thread 5: query mix 2: 230884.93ms, total: 231067.61ms
Thread 1: query mix 1: 237836.61ms, total: 237959.11ms
Benchmark run completed in 237.985427s

Scale factor:           284826
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Number of clients:      8
Seed:                   808080
Number of query mix runs (without warmups): 8 times
min/max Querymix runtime:    195.7931s / 237.8366s
Total runtime (sum):    1737.137 seconds
Elapsed runtime:        1737.137 seconds
QMpH:                   121.016 query mixes per hour
CQET:                   217.142 seconds average runtime of query mix
CQET (geom.):           216.603 seconds geometric mean runtime of query mix
AQET (geom.):           2.156 seconds geometric mean runtime of query
Throughput:             2178.285 BSBM-BI throughput: qph*scale
BI Power:               1669.745 BSBM-BI Power: qph*scale (geom)

1000 Mt Single User

0: 608707.03ms, total: 608768ms

Scale factor:           2848260
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Seed:                   808080
Number of query mix runs (without warmups): 1 times
min/max Querymix runtime:    608.7070s / 608.7070s
Elapsed runtime:        608.707 seconds
QMpH:                   5.914 query mixes per hour
CQET:                   608.707 seconds average runtime of query mix
CQET (geom.):           608.707 seconds geometric mean runtime of query mix
AQET (geom.):           5.167 seconds geometric mean runtime of query
Throughput:             1064.552 BSBM-BI throughput: qph*scale
BI Power:               6967.325 BSBM-BI Power: qph*scale (geom)

1000 Mt 8 User

bsbm/testdriver   -runs 8 -mt 8  -w 0 -idir /bs/10  -drill  \
   -ucf bsbm/usecases/businessIntelligence/sparql.txt   \
   -dg http://bsbm.org http://localhost:8604/sparql
Thread 3: query mix 4: 2211275.25ms, total: 2211371.60ms
Thread 4: query mix 0: 2212316.87ms, total: 2212417.99ms
Thread 8: query mix 3: 2275942.63ms, total: 2276058.03ms
Thread 5: query mix 5: 2441378.35ms, total: 2441448.66ms
Thread 6: query mix 7: 2804001.05ms, total: 2804098.81ms
Thread 2: query mix 2: 2808374.66ms, total: 2808473.71ms
Thread 1: query mix 6: 2839407.12ms, total: 2839510.63ms
Thread 7: query mix 1: 2889199.23ms, total: 2889263.17ms
Benchmark run completed in 2889.302566s

Scale factor:           2848260
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Number of clients:      8
Seed:                   808080
Number of query mix runs (without warmups): 8 times
min/max Querymix runtime:    2211.2753s / 2889.1992s
Total runtime (sum):    20481.895 seconds
Elapsed runtime:        20481.895 seconds
QMpH:                   9.968 query mixes per hour
CQET:                   2560.237 seconds average runtime of query mix
CQET (geom.):           2544.284 seconds geometric mean runtime of query mix
AQET (geom.):           13.556 seconds geometric mean runtime of query
Throughput:             1794.205 BSBM-BI throughput: qph*scale
BI Power:               2655.678 BSBM-BI Power: qph*scale (geom)

Metrics for Query:      1
Count:                  8 times executed in whole run
Time share              2.120884% of total execution time
AQET:                   54.299656 seconds (arithmetic mean)
AQET(geom.):            34.607302 seconds (geometric mean)
QPS:                    0.13 Queries per second
minQET/maxQET:          11.71547600s / 148.65379700s

Metrics for Query:      2
Count:                  8 times executed in whole run
Time share              0.207382% of total execution time
AQET:                   5.309462 seconds (arithmetic mean)
AQET(geom.):            2.737696 seconds (geometric mean)
QPS:                    1.34 Queries per second
minQET/maxQET:          0.78729800s / 25.80948200s

Metrics for Query:      3
Count:                  8 times executed in whole run
Time share              17.650472% of total execution time
AQET:                   451.893890 seconds (arithmetic mean)
AQET(geom.):            410.481088 seconds (geometric mean)
QPS:                    0.02 Queries per second
minQET/maxQET:          171.07262500s / 721.72939200s

Metrics for Query:      5
Count:                  32 times executed in whole run
Time share              6.196565% of total execution time
AQET:                   39.661685 seconds (arithmetic mean)
AQET(geom.):            6.849882 seconds (geometric mean)
QPS:                    0.18 Queries per second
minQET/maxQET:          0.15696500s / 189.00906200s

Metrics for Query:      6
Count:                  8 times executed in whole run
Time share              0.119916% of total execution time
AQET:                   3.070136 seconds (arithmetic mean)
AQET(geom.):            2.056059 seconds (geometric mean)
QPS:                    2.31 Queries per second
minQET/maxQET:          0.41524400s / 7.55655300s

Metrics for Query:      7
Count:                  40 times executed in whole run
Time share              1.577963% of total execution time
AQET:                   8.079921 seconds (arithmetic mean)
AQET(geom.):            1.342079 seconds (geometric mean)
QPS:                    0.88 Queries per second
minQET/maxQET:          0.02205800s / 40.27761500s

Metrics for Query:      8
Count:                  40 times executed in whole run
Time share              72.126818% of total execution time
AQET:                   369.323481 seconds (arithmetic mean)
AQET(geom.):            114.431863 seconds (geometric mean)
QPS:                    0.02 Queries per second
minQET/maxQET:          5.94377300s / 1824.57867400s

The CPU for the multiuser runs stays above 1500% for the whole run. The CPU for the single user 100 Mt run is 630%; for the 1000 Mt run, this is 574%. This can be improved since the queries usually have a lot of data to work on. But final optimization is not our goal yet; we are just surveying the race track. The difference between a warm single user run and a cold single user run is about 15% with data on SSD; with data on disk, this would be more. The numbers shown are with warm cache. The single-user and multi-user Throughput difference, 1064 single-user vs. 1794 multi-user, is about what one would expect from the CPU utilization.

With these numbers, the CPU does not appear badly memory-bound, else the increase would be less; also core multi-threading seems to bring some benefit. If the single-user run was at 800%, the Throughput would be 1488. The speed in excess of this may be attributed to core multi-threading, although we must remember that not every query mix is exactly the same length, so the figure is not exact. Core multi-threading does not seem to hurt, at the very least. Comparison of the same numbers with the column store will be interesting since it misses the cache a lot less and accordingly has better SMP scaling. The Intel Nehalem memory subsystem is really pretty good.

For reference, we show a run with Virtuoso 6 at 100Mt.

0: 424754.40ms, total: 424829ms

Scale factor:           284826
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Seed:                   808080
Number of query mix runs (without warmups): 1 times
min/max Querymix runtime:    424.7544s / 424.7544s
Elapsed runtime:        424.754 seconds
QMpH:                   8.475 query mixes per hour
CQET:                   424.754 seconds average runtime of query mix
CQET (geom.):           424.754 seconds geometric mean runtime of query mix
AQET (geom.):           1.097 seconds geometric mean runtime of query
Throughput:             152.559 BSBM-BI throughput: qph*scale
BI Power:               3281.150 BSBM-BI Power: qph*scale (geom)

and 8 user

Thread 5: query mix 3: 616997.86ms, total: 617042.83ms
Thread 7: query mix 4: 625522.18ms, total: 625559.09ms
Thread 3: query mix 7: 626247.62ms, total: 626304.96ms
Thread 1: query mix 0: 629675.17ms, total: 629724.98ms
Thread 4: query mix 6: 667633.36ms, total: 667670.07ms
Thread 8: query mix 2: 674206.07ms, total: 674256.72ms
Thread 6: query mix 5: 695020.21ms, total: 695052.29ms
Thread 2: query mix 1: 701824.67ms, total: 701864.91ms
Benchmark run completed in 701.909341s

Scale factor:           284826
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Number of clients:      8
Seed:                   808080
Number of query mix runs (without warmups): 8 times
min/max Querymix runtime:    616.9979s / 701.8247s
Total runtime (sum):    5237.127 seconds
Elapsed runtime:        5237.127 seconds
QMpH:                   41.031 query mixes per hour
CQET:                   654.641 seconds average runtime of query mix
CQET (geom.):           653.873 seconds geometric mean runtime of query mix
AQET (geom.):           2.557 seconds geometric mean runtime of query
Throughput:             738.557 BSBM-BI throughput: qph*scale
BI Power:               1408.133 BSBM-BI Power: qph*scale (geom)

Having the numbers, let us look at the metric and its scaling. We take the geometric mean of the single-user Power and the multiuser Throughput.

 100 Mt: sqrt ( 7771 * 2178 ); = 4114

1000 Mt: sqrt ( 6967 * 1794 ); = 3535

Scaling seems to work; the results are in the same general ballpark. The real times for the 1000 Mt run are a bit over 10x the times for the 100Mt run, as expected. The relative percentages of the queries are about the same on both scales, with the drill-down in Q8 alone being 77% and 72% respectively. The Q8 drill-down starts at the root of the product hierarchy. If we made this start one level from the top, its share would drop. This seems reasonable.

Conversely, Q2 is out of place, with far too little share of the time. It takes a product as a starting point and shows a list of products with common features, sorted by descending count of common features. This would more appropriately be applied to a leaf product category instead, measuring how many of the products in the category have the top 20 features found in this category, to name an example.

Also there should be more queries.

At present it appears that BSBM-BI is definitely runnable, but a cursory look suffices to show that the workload needs more development and variety. We remember that I dreamt up the business questions last fall without much analysis, and that these questions were subsequently translated to SPARQL by FU Berlin. So, on one hand, BSBM-BI is of crucial importance because it is the first attempt at doing a benchmark with long running queries in SPARQL. On the other hand, BSBM-BI is not very good as a benchmark; TPC-H is a lot better. This stands to reason, as TPC-H has had years and years of development and participation by many people.

Benchmark queries are trick questions: For example, TPC-H Q18 cannot be done without changing an IN into a JOIN with the IN subquery in the outer loop and doing streaming aggregation. Q13 cannot be done without a well-optimized HASH JOIN which besides must be partitioned at the larger scales.

Having such trick questions in an important benchmark eventually results in everybody doing the optimizations that the benchmark clearly calls for. Making benchmarks thus entails a responsibility ultimately to the end user, because an irrelevant benchmark might in the worst case send developers chasing things that are beside the point.

In the following, we will look at what BSBM-BI requires from the database and how these requirements can be further developed and extended.

BSBM-BI does not have any clear trick questions, at least not premeditatedly. BSBM-BI just requires a cost model that can guess the fanout of a JOIN and the cardinality of a GROUP BY; it is enough to distinguish smaller from greater; the guess does not otherwise have to be very good. Further, the queries are written in the benchmark text so that joining from left to right would work, so not even a cost-based optimizer is strictly needed. I did however have to add some cardinality statistics to get reasonable JOIN order since we always reorder the query regardless of the source formulation.

BSBM-BI does have variable selectivity from the drill-downs; thus these may call for different JOIN orders for different parameter values. I have not looked into whether this really makes a difference, though.

There are places in BSBM-BI where using a HASH JOIN makes sense. We do not use HASH JOINs with RDF because there is an index for everything and making a HASH JOIN in the wrong place can have a large up-front cost, so one is more robust against cost model errors if one does not do HASH JOINs. This said, a HASH JOIN in the right place is a lot better than an index lookup. With TPC-H Q13, our best HASH JOIN is over 2x better than the best INDEX-based JOIN, both being well tuned. For questions like "count the hairballs made in Germany reviewed by Japanese Hello Kitty fans," where two ends of a JOIN path are fairly selective doing the other as a HASH JOIN is good. This can, if the JOIN is always cardinality-reducing, even be merged inside an INDEX lookup. We have such capabilities since we have been for a while gearing up for the relational races, but are not using any of these with BSBM-BI, although they would be useful.

Let us see the profile for a single user 100 Mt run.

The database activity summary is --

select db_activity (0, 'http');

161.3M rnd  210.2M seq      0 same seg   104.5M same pg  45.08M same par      0 disk      0 spec disk      0B /      0 messages  2.393K fork

See the post "What Does BSBM Explore Measure" for an explanation of the numbers. We see that there is more sequential access than random and the random has fair locality with over half on the same page as the previous and a lot of the rest falling under the same parent. Funnily enough, the explore mix has more locality. Running with a longer vector size would probably increase performance by getting better locality. There is an optimization that adjusts vector size on the fly if locality is not sufficient but this is not being used here. So we manually set vector size to 100000 instead of the default 10000. We get --

172.4M rnd  220.8M seq      0 same seg   149.6M same pg  10.99M same par     21 disk    861 spec disk      0B /      0 messages     754 fork

The throughput goes from 1494 to 1779. We see more hits on the same page, as expected. We do not make this setting a default since it raises the cost for small queries; therefore the vector size must be self-adjusting -- besides, expecting a DBA to tune this is not reasonable. We will just have to correctly tune the self-adjust logic, and we have again clear gains.

Let us now go back to the first run with vector size 10000.

The top of the CPU oprofile is as follows:

722309   15.4507  cmpf_iri64n_iri64n
434791    9.3005  cmpf_iri64n_iri64n_anyn_iri64n
294712    6.3041  itc_next_set
273488    5.8501  itc_vec_split_search
203970    4.3631  itc_dive_transit
199687    4.2714  itc_page_rcf_search
181614    3.8848  dc_itc_append_any
173043    3.7015  itc_bm_vec_row_check
146727    3.1386  cmpf_int64n
128224    2.7428  itc_vec_row_check
113515    2.4282  dk_alloc
97296     2.0812  page_wait_access
62523     1.3374  qst_vec_get_int64
59014     1.2623  itc_next_set_parent
53589     1.1463  sslr_qst_get
48003     1.0268  ds_add
46641     0.9977  dk_free_tree
44551     0.9530  kc_var_col
43650     0.9337  page_col_cmp_1
35297     0.7550  cmpf_iri64n_iri64n_anyn_gt_lt
34589     0.7399  dv_compare
25864     0.5532  cmpf_iri64n_anyn_iri64n_iri64n_lte
23088     0.4939  dk_free

The top 10 are all index traversal, with the key compare for two leading IRI keys in the lead, corresponding to a lookup with P and S given. The one after that is with all parts given, corresponding to an existence test. The existence tests could probably be converted to HASH JOIN lookups to good advantage. Aggregation and arithmetic are absent. We should probably add a query like TPC-H Q1 that does nothing but these two. Considering the overall profile, GROUP BY seems to be around 3%. We should probably put in a query that makes a very large number of groups and could make use of streaming aggregation, i.e., take advantage of a situation where aggregation input comes already grouped by the grouping columns.

A BI use case should offer no problem with including arithmetic, but there are not that many numbers in the BSBM set. Some code sections in the queries with conditional execution and costly tests inside ANDs and ORs would be good. TPC-H has such in Q21 and Q19. An OR with existences where there would be gain from good guesses of a subquery's selectivity would be appropriate. Also, there should be conditional expressions somewhere with a lot of data, like the CASE-WHEN in TPC-H Q12.

We can make BSBM-BI more interesting by putting in the above. Also we will have to see where we can profit from HASH JOIN, both small and large. There should be such places in the workload already so this is a matter of just playing a bit more.

This post amounts to a cheat sheet for the BSBM-BI runs a bit farther down the road. By then we should be operational with the column store and Virtuoso 7 Cluster, though, so not everything is yet on the table.

Benchmarks, Redux Series

# PermaLink Comments [0]
03/22/2011 18:31 GMT Modified: 03/22/2011 17:04 GMT
Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks [ Orri Erling ]

Let us talk about what ought to be benchmarked in the context of RDF.

A point that often gets brought up by RDF-ers when talking about benchmarks is that there already exist systems which perform very well at TPC-H and similar workloads, and therefore there is no need for RDF to go there. It is, as it were, somebody else's problem; besides, it is a solved one.

On the other hand, being able to express what is generally expected of a query language might not be a core competence or a competitive edge, but it certainly is a checklist item.

BSBM seems to be adopted as a de facto RDF benchmark, as there indeed is almost nothing else. But we should not lose sight of the fact that this is in fact a relational schema and workload that has just been straightforwardly transformed to RDF. BSBM was made, after all, in part for measuring RDB to RDF mapping. Thus BSBM is no more RDF-ish than a trivially RDF-ized TPC-H would be. TPC-H is however a bit more difficult if also a better thought out benchmark than the BSBM BI Mix proposal. But I do not expect an RDF audience to have any enthusiasm for this as this is indeed a very tough race by now, and besides one in which RDB and SQL will keep some advantage. However, using this as a validation test is meaningful, as there exists a validation dataset and queries that we already have RDF-ized. We could publish these and call this "RDF-H".

In the following I will outline what would constitute an RDF-friendly, scientifically interesting benchmark. The points are in part based on discussions with Peter Boncz of CWI.

The Social Network Intelligence Benchmark (SNIB) takes the social web Facebook-style schema Ivan Mikhailov and I made last year under the name of Botnet BM. In LOD2, CWI is presently working on this.

The data includes DBpedia as a base component used for providing conversation topics, information about geographical locales of simulated users, etc. DBpedia is not very large, around 200M-300M triples, but it is diverse enough.

The data will have correlations, e.g., people who talk about sports tend to know other people who talk about the same sport, and they are more likely to know people from their geographical area than from elsewhere.

The bulk of the data consists of a rich history of interactions including messages to individuals and groups, linking to people, dropping links, joining and leaving groups, and so forth. The messages are tagged using real-world concepts from DBpedia, and there is correlation between tagging and textual content since both are generated from Dbpedia articles. Since there is such correlation, NLP techniques like entity and relationship extraction can be used with the data even though this is not the primary thrust of SNIB.

There is variation in frequency of online interaction, and this interaction consist of sessions. For example, one could analyze user behavior per time of day for online ad placement.

The data probably should include propagating memes, fashions, and trends that travel on the social network. With this, one could query about their origin and speed of propagation.

There should probably be cases of duplicate identities in the data, i.e., one real person using many online accounts to push an agenda. Resolving duplicate identities makes for nice queries.

Ragged data with half-filled profiles and misspelled identifiers like person and place names are a natural part of the social web use case. The data generator should take this into account.

  • Distribution of popularity and activity should follow a power-law-like pattern; actual measures of popularity can be sampled from existing social networks even though large quantities of data cannot easily be extracted.

  • The dataset should be predictably scalable. For the workload considered, the relative importance of the queries or other measured tasks should not change dramatically with the scale.

For example some queries are logarithmic to data size (e.g., find connections to a person), some are linear (e.g., find average online time of sports fans on Sundays), and some are quadratic or worse (e.g., find two extremists of the same ideology that are otherwise unrelated). Making a single metric from such parts may not be meaningful. Therefore, SNIB might be structured into different workloads.

The first would be an online mix with typically short lookups and updates, around O ( log ( n ) ).

The Business Intelligence Mix would be composed of queries around OO ( n log ( n ) ). Even so, with real data, choice of parameters will provide dramatic changes in query run-time. Therefore a run should be specified to have a predictable distribution of "hard" and "easy" parameter choices. In the BSBM BI mix modification, I did this by defining some to be drill downs from a more general to a more specific level of a hierarchy. This could be done here too in some cases; other cases would have to be defined with buckets of values.

Both the real world and LOD2 are largely concerned with data integration. The SNIB workload can have aspects of this, for example, in resolving duplicate identities. These operations are more complex than typical database queries, as the attributes used for joining might not even match in the initial data.

One characteristic of these is the production of sometimes large intermediate results that need to be materialized. Doing these operations in practice requires procedural control. Further, running algorithms like network analytics (e.g., Page rank, centrality, etc.) involves aggregation of intermediate results that is not very well expressible in a query language. Some basic graph operations like shortest path are expressible but then are not in unextended SPARQL 1.1; as these would for example involve returning paths, which are explicitly excluded from the spec.

These are however the areas where we need to go for a benchmark that is more than a repackaging of a relational BI workload.

We find that such a workload will have procedural sections either in application code or stored procedures. Map-reduce is sometimes used for scaling these. As one would expect, many cluster databases have their own version of these control structures. Therefore some of the SNIB workload could even be implemented as map-reduce jobs alongside parallel database implementations. We might here touch base with the LarKC map-reduce work to see if it could be applied to SNIB workloads.

We see a three-level structure emerging. There is an Online mix which is a bit like the BSBM Explore mix, and an Analytics mix which is on the same order of complexity as TPC-H. These may have a more-or-less fixed query formulation and test driver. Beyond these, yet working on the same data, we have a set of Predefined Tasks which the test sponsor may implement in a manner of their choice.

We would finally get to the "raging conflict" between the "declarativists" and the "map reductionists." Last year's VLDB had a lot of map-reduce papers. I know of comparisons between Vertica and map reduce for doing a fairly simple SQL query on a lot of data, but here we would be talking about much more complex jobs on more interesting (i.e., less uniform) data.

We might even interest some of the cluster RDBMS players (Teradata, Vertica, Greenplum, Oracle Exadata, ParAccel, and/or Aster Data, to name a few) in running this workload using their map-reduce analogs.

We see that as we get to topics beyond relational BI, we do not find ourselves in an RDF-only world but very much at a crossroads of many technologies, e.g., map-reduce and its database analogs, various custom built databases, graph libraries, data integration and cleaning tools, and so forth.

There is not, nor ought there to be, a sheltered, RDF-only enclave. RDF will have to justify itself in a world of alternatives.

This must be reflected in our benchmark development, so relational BI is not irrelevant; in fact, it is what everybody does. RDF cannot be a total failure at this, even if this were not RDF's claim to fame. The claim to fame comes after we pass this stage, which is what we intend to explore in SNIB.

Benchmarks, Redux Series

# PermaLink Comments [0]
03/10/2011 18:30 GMT Modified: 03/14/2011 19:36 GMT
Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks [ Virtuso Data Space Bot ]

Let us talk about what ought to be benchmarked in the context of RDF.

A point that often gets brought up by RDF-ers when talking about benchmarks is that there already exist systems which perform very well at TPC-H and similar workloads, and therefore there is no need for RDF to go there. It is, as it were, somebody else's problem; besides, it is a solved one.

On the other hand, being able to express what is generally expected of a query language might not be a core competence or a competitive edge, but it certainly is a checklist item.

BSBM seems to be adopted as a de facto RDF benchmark, as there indeed is almost nothing else. But we should not lose sight of the fact that this is in fact a relational schema and workload that has just been straightforwardly transformed to RDF. BSBM was made, after all, in part for measuring RDB to RDF mapping. Thus BSBM is no more RDF-ish than a trivially RDF-ized TPC-H would be. TPC-H is however a bit more difficult if also a better thought out benchmark than the BSBM BI Mix proposal. But I do not expect an RDF audience to have any enthusiasm for this as this is indeed a very tough race by now, and besides one in which RDB and SQL will keep some advantage. However, using this as a validation test is meaningful, as there exists a validation dataset and queries that we already have RDF-ized. We could publish these and call this "RDF-H".

In the following I will outline what would constitute an RDF-friendly, scientifically interesting benchmark. The points are in part based on discussions with Peter Boncz of CWI.

The Social Network Intelligence Benchmark (SNIB) takes the social web Facebook-style schema Ivan Mikhailov and I made last year under the name of Botnet BM. In LOD2, CWI is presently working on this.

The data includes DBpedia as a base component used for providing conversation topics, information about geographical locales of simulated users, etc. DBpedia is not very large, around 200M-300M triples, but it is diverse enough.

The data will have correlations, e.g., people who talk about sports tend to know other people who talk about the same sport, and they are more likely to know people from their geographical area than from elsewhere.

The bulk of the data consists of a rich history of interactions including messages to individuals and groups, linking to people, dropping links, joining and leaving groups, and so forth. The messages are tagged using real-world concepts from DBpedia, and there is correlation between tagging and textual content since both are generated from Dbpedia articles. Since there is such correlation, NLP techniques like entity and relationship extraction can be used with the data even though this is not the primary thrust of SNIB.

There is variation in frequency of online interaction, and this interaction consist of sessions. For example, one could analyze user behavior per time of day for online ad placement.

The data probably should include propagating memes, fashions, and trends that travel on the social network. With this, one could query about their origin and speed of propagation.

There should probably be cases of duplicate identities in the data, i.e., one real person using many online accounts to push an agenda. Resolving duplicate identities makes for nice queries.

Ragged data with half-filled profiles and misspelled identifiers like person and place names are a natural part of the social web use case. The data generator should take this into account.

  • Distribution of popularity and activity should follow a power-law-like pattern; actual measures of popularity can be sampled from existing social networks even though large quantities of data cannot easily be extracted.

  • The dataset should be predictably scalable. For the workload considered, the relative importance of the queries or other measured tasks should not change dramatically with the scale.

For example some queries are logarithmic to data size (e.g., find connections to a person), some are linear (e.g., find average online time of sports fans on Sundays), and some are quadratic or worse (e.g., find two extremists of the same ideology that are otherwise unrelated). Making a single metric from such parts may not be meaningful. Therefore, SNIB might be structured into different workloads.

The first would be an online mix with typically short lookups and updates, around O ( log ( n ) ).

The Business Intelligence Mix would be composed of queries around OO ( n log ( n ) ). Even so, with real data, choice of parameters will provide dramatic changes in query run-time. Therefore a run should be specified to have a predictable distribution of "hard" and "easy" parameter choices. In the BSBM BI mix modification, I did this by defining some to be drill downs from a more general to a more specific level of a hierarchy. This could be done here too in some cases; other cases would have to be defined with buckets of values.

Both the real world and LOD2 are largely concerned with data integration. The SNIB workload can have aspects of this, for example, in resolving duplicate identities. These operations are more complex than typical database queries, as the attributes used for joining might not even match in the initial data.

One characteristic of these is the production of sometimes large intermediate results that need to be materialized. Doing these operations in practice requires procedural control. Further, running algorithms like network analytics (e.g., Page rank, centrality, etc.) involves aggregation of intermediate results that is not very well expressible in a query language. Some basic graph operations like shortest path are expressible but then are not in unextended SPARQL 1.1; as these would for example involve returning paths, which are explicitly excluded from the spec.

These are however the areas where we need to go for a benchmark that is more than a repackaging of a relational BI workload.

We find that such a workload will have procedural sections either in application code or stored procedures. Map-reduce is sometimes used for scaling these. As one would expect, many cluster databases have their own version of these control structures. Therefore some of the SNIB workload could even be implemented as map-reduce jobs alongside parallel database implementations. We might here touch base with the LarKC map-reduce work to see if it could be applied to SNIB workloads.

We see a three-level structure emerging. There is an Online mix which is a bit like the BSBM Explore mix, and an Analytics mix which is on the same order of complexity as TPC-H. These may have a more-or-less fixed query formulation and test driver. Beyond these, yet working on the same data, we have a set of Predefined Tasks which the test sponsor may implement in a manner of their choice.

We would finally get to the "raging conflict" between the "declarativists" and the "map reductionists." Last year's VLDB had a lot of map-reduce papers. I know of comparisons between Vertica and map reduce for doing a fairly simple SQL query on a lot of data, but here we would be talking about much more complex jobs on more interesting (i.e., less uniform) data.

We might even interest some of the cluster RDBMS players (Teradata, Vertica, Greenplum, Oracle Exadata, ParAccel, and/or Aster Data, to name a few) in running this workload using their map-reduce analogs.

We see that as we get to topics beyond relational BI, we do not find ourselves in an RDF-only world but very much at a crossroads of many technologies, e.g., map-reduce and its database analogs, various custom built databases, graph libraries, data integration and cleaning tools, and so forth.

There is not, nor ought there to be, a sheltered, RDF-only enclave. RDF will have to justify itself in a world of alternatives.

This must be reflected in our benchmark development, so relational BI is not irrelevant; in fact, it is what everybody does. RDF cannot be a total failure at this, even if this were not RDF's claim to fame. The claim to fame comes after we pass this stage, which is what we intend to explore in SNIB.

Benchmarks, Redux Series

# PermaLink Comments [0]
03/10/2011 18:30 GMT Modified: 03/14/2011 19:37 GMT
Benchmarks, Redux (part 8): BSBM Explore and Update [ Orri Erling ]

We will here look at the Explore and Update scenario of BSBM. This presents us with a novel problem as the specification does not address any aspect of ACID.

A transaction benchmark ought to have something to say about this. The SPARUL (also known as SPARQL/Update) language does not say anything about transactionality, but I suppose it is in the spirit of the SPARUL protocol to promise atomicity and durability.

We begin by running Virtuoso 7 Single, with Single User and 16 User, each at scales of 100 Mt, 200 Mt, and 1000 Mt. The transactionality is default, meaning SERIALIZABLE isolation between INSERTs and DELETEs, and READ COMMITTED isolation between READ and any UPDATE transaction. (Figures for Virtuoso 6 will also be presented here in the near future, as they are the currently shipping production versions.)

Virtuoso 7 Single, Full ACID
(QMpH, query mixes per hour)
Scale Single User 16 User
100 Mt 9,969 65,537
200 Mt 8,646 40,527
1000 Mt 5,512 17,293

Virtuoso 6 Cluster, Full ACID
(QMpH, query mixes per hour)
Scale Single User 16 User
100 Mt 5604.520 34079.019
1000 Mt 2866.616 10028.325

Virtuoso 6 Single, Full ACID
(QMpH, query mixes per hour)
Scale Single User 16 User
100 Mt 7,152 21,065
200 Mt 5,862 16,895
1000 Mt 1,542 4,548

Each run is preceded by a warm-up of 500 or 300 mixes (the exact number is not material), resulting in a warm cache; see previous post on read-ahead for details. All runs do 1000 Explore and Update mixes. The initial database is in the state following the Explore only runs.

The results are in line with the Explore results. There is a fair amount of variability between consecutive runs; the 16 User run at 1000 Mt varies between 14K and 19K QMpH depending on the measurement. The smaller runs exhibit less variability.

In the following we will look at transactions and at how the definition of the workload and reporting could be made complete.

Full ACID means serializable semantic of concurrent insert and delete of the same quad. Non-transactional means that on concurrent insert and delete of overlapping sets of quads the result is undefined. Further if one logged such "transactions," the replay would give serialization although the initial execution did not, hence further confusing the issue. Considering the hypothetical use case of an e-commerce information portal, there is little chance of deletes and inserts actually needing serialization. An insert-only workload does not need serializability because an insert cannot fail. If the data already exists the insert does nothing, if the quad does not previously exist it is created. The same applies to deletes alone. If a delete and insert overlap, serialization would be needed but the semantics implicit in the use case make this improbable.

Read-only transactions (i.e., the Explore mix in the Explore and Update scenario) will be run as READ COMMITTED. These do not see uncommitted data and never block for lock wait. The reads may not be repeatable.

Our first point of call is to determine the cost of ACID. We run 1000 mixes of Explore and Update at 1000 Mt. The throughput is 19214 after a warm-up of 500 mixes. This is pretty good in comparison with the diverse read-only results at this scale.

We look at the pertinent statistics:

SELECT TOP 5 * FROM sys_l_stat ORDER BY waits DESC;
KEY_TABLE         INDEX_NAME       LOCKS   WAITS   WAIT_PCT   DEADLOCKS   LOCK_ESC   WAIT_MSECS
===============   =============   ======   =====   ========   =========   ========   ==========
DB.DBA.RDF_QUAD   RDF_QUAD_POGS   179205     934          0           0          0        35164
DB.DBA.RDF_IRI    RDF_IRI          20752     217          1           0          0        16445
DB.DBA.RDF_QUAD   RDF_QUAD_SP       9244       3          0           0          0          235

We see 934 waits with a total duration of 35 seconds on the index with the most contention. The run was 187 seconds, real time. The lock wait time is not real time since this is the total elapsed wait time summed over all threads. The lock wait frequency is a little over one per query mix, meaning a little over one per five locking transactions.

We note that we do not get deadlocks since all inserts and deletes are in ascending key order due to vectoring. This guarantees the absence of deadlocks for single insert transactions, as long as the transaction stays within the vector size. This is always the case since the inserts are a few hundred triples at the maximum. The waits concentrate on POGS, because this is a bitmap index where the locking resolution is less than a row, and the values do not correlate with insert order. The locking behavior could be better with the column store, where we would have row level locking also for this index. This is to be seen. The column store would otherwise tend to have higher cost per random insert.

Considering these results it does not seem crucial to "drop ACID," though doing so would save some time. We will now run measurements for all scales with 16 Users and ACID.

Let us now see what the benchmark writes:

SELECT TOP 10 * FROM sys_d_stat ORDER BY n_dirty DESC;
KEY_TABLE                     INDEX_NAME                       TOUCHES     READS   READ_PCT   N_DIRTY   N_BUFFERS
===========================   ============================   =========   =======   ========   =======   =========
DB.DBA.RDF_QUAD               RDF_QUAD_POGS                  763846891    237436          0     58040      228606
DB.DBA.RDF_QUAD               RDF_QUAD                       213282706   1991836          0     30226     1940280
DB.DBA.RDF_OBJ                RO_VAL                             15474     17837        115     13438       17431
DB.DBA.RO_START               RO_START                           10573     11195        105     10228       11227
DB.DBA.RDF_IRI                RDF_IRI                            61902    125711        203      7705      121300
DB.DBA.RDF_OBJ                RDF_OBJ                         23809053   3205963         13       636     3072517
DB.DBA.RDF_IRI                DB_DBA_RDF_IRI_UNQC_RI_ID        3237687    504486         15       340      488797
DB.DBA.RDF_QUAD               RDF_QUAD_SP                        89995     70446         78        99       68340
DB.DBA.RDF_QUAD               RDF_QUAD_OP                        19440     47541        244        66       45583
DB.DBA.VTLOG_DB_DBA_RDF_OBJ   VTLOG_DB_DBA_RDF_OBJ                3014         1          0        11          11
DB.DBA.RDF_QUAD               RDF_QUAD_GS                         1261       801         63        10         751
DB.DBA.RDF_PREFIX             RDF_PREFIX                            14       168       1120         1         153
DB.DBA.RDF_PREFIX             DB_DBA_RDF_PREFIX_UNQC_RP_ID        1807       200         11         1         200

The most dirty pages are on the POGS index, which is reasonable; values are spread out at random. After this we have the PSOG index, likely because of random deletes. New IRIs tend to get consecutive numbers and do not make many dirty pages. Literals come next, with the index from leading string or hash of the literal to id leading, as one would expect, again because of values being distributed at random. After this come IRIs. The distribution of updates is generally as one would expect.

* * *

Going back to BSBM, at least the following aspects of the benchmark have to be further specified:

  • Disclosure of ACID properties. If the benchmark required full ACID many would not run this at all. Besides full ACID is not necessarily an absolute requirement based on the hypothetical usage scenario of the benchmark. However, when publishing numbers the guarantees that go with the numbers must be made explicit. This includes logging, checkpoint frequency or equivalent etc.

  • Steady state. The working set of the Update mix is different from that of the Explore mixes. This touches more indices than Explore. The Explore warm-up is in part good but does not represent steady state.

  • Checkpoint and sustained throughput. Benchmarks involving update generally have rules for checkpointing the state and for sustained throughput. In specific, the throughput of an update benchmark cannot rely on never flushing to persistent storage. Even bulk load must be timed with a checkpoint guaranteeing durability at the end. A steady update stream should be timed with a test interval of sufficient length involving a few checkpoints; for example, a minimum duration of 30 minutes with no less than 3 completed checkpoints in the interval with at least 9 minutes between the end of one and the start of the next. Not all DBMSs work with logs and checkpoints, but if an alternate scheme is used then this needs to be described.

  • Memory and warm-up issues.We have seen the test data generator run out of memory when trying to generate update streams of meaningful length. Also the test driver should allow running updates in timed and non-timed mode (warm-up).

With an update benchmark, many more things need to be defined, and the set-up becomes more system specific, than with a read-only workload. We will address these shortcomings in the measurement rules proposal to come. Especially with update workloads, the vendors need to provide tuning expertise; however, this will not happen if the benchmark does not properly set the expectations. If benchmarks serve as a catalyst for clearly defining how things are to be set up, then they will have served the end user.

Benchmarks, Redux Series

# PermaLink Comments [0]
03/09/2011 12:32 GMT Modified: 03/15/2011 17:18 GMT
Benchmarks, Redux (part 8): BSBM Explore and Update [ Virtuso Data Space Bot ]

We will here look at the Explore and Update scenario of BSBM. This presents us with a novel problem as the specification does not address any aspect of ACID.

A transaction benchmark ought to have something to say about this. The SPARUL (also known as SPARQL/Update) language does not say anything about transactionality, but I suppose it is in the spirit of the SPARUL protocol to promise atomicity and durability.

We begin by running Virtuoso 7 Single, with Single User and 16 User, each at scales of 100 Mt, 200 Mt, and 1000 Mt. The transactionality is default, meaning SERIALIZABLE isolation between INSERTs and DELETEs, and READ COMMITTED isolation between READ and any UPDATE transaction. (Figures for Virtuoso 6 will also be presented here in the near future, as they are the currently shipping production versions.)

Virtuoso 7 Single, Full ACID
(QMpH, query mixes per hour)
Scale Single User 16 User
100 Mt 9,969 65,537
200 Mt 8,646 40,527
1000 Mt 5,512 17,293

Virtuoso 6 Cluster, Full ACID
(QMpH, query mixes per hour)
Scale Single User 16 User
100 Mt 5604.520 34079.019
1000 Mt 2866.616 10028.325

Virtuoso 6 Single, Full ACID
(QMpH, query mixes per hour)
Scale Single User 16 User
100 Mt 7,152 21,065
200 Mt 5,862 16,895
1000 Mt 1,542 4,548

Each run is preceded by a warm-up of 500 or 300 mixes (the exact number is not material), resulting in a warm cache; see previous post on read-ahead for details. All runs do 1000 Explore and Update mixes. The initial database is in the state following the Explore only runs.

The results are in line with the Explore results. There is a fair amount of variability between consecutive runs; the 16 User run at 1000 Mt varies between 14K and 19K QMpH depending on the measurement. The smaller runs exhibit less variability.

In the following we will look at transactions and at how the definition of the workload and reporting could be made complete.

Full ACID means serializable semantic of concurrent insert and delete of the same quad. Non-transactional means that on concurrent insert and delete of overlapping sets of quads the result is undefined. Further if one logged such "transactions," the replay would give serialization although the initial execution did not, hence further confusing the issue. Considering the hypothetical use case of an e-commerce information portal, there is little chance of deletes and inserts actually needing serialization. An insert-only workload does not need serializability because an insert cannot fail. If the data already exists the insert does nothing, if the quad does not previously exist it is created. The same applies to deletes alone. If a delete and insert overlap, serialization would be needed but the semantics implicit in the use case make this improbable.

Read-only transactions (i.e., the Explore mix in the Explore and Update scenario) will be run as READ COMMITTED. These do not see uncommitted data and never block for lock wait. The reads may not be repeatable.

Our first point of call is to determine the cost of ACID. We run 1000 mixes of Explore and Update at 1000 Mt. The throughput is 19214 after a warm-up of 500 mixes. This is pretty good in comparison with the diverse read-only results at this scale.

We look at the pertinent statistics:

SELECT TOP 5 * FROM sys_l_stat ORDER BY waits DESC;
KEY_TABLE         INDEX_NAME       LOCKS   WAITS   WAIT_PCT   DEADLOCKS   LOCK_ESC   WAIT_MSECS
===============   =============   ======   =====   ========   =========   ========   ==========
DB.DBA.RDF_QUAD   RDF_QUAD_POGS   179205     934          0           0          0        35164
DB.DBA.RDF_IRI    RDF_IRI          20752     217          1           0          0        16445
DB.DBA.RDF_QUAD   RDF_QUAD_SP       9244       3          0           0          0          235

We see 934 waits with a total duration of 35 seconds on the index with the most contention. The run was 187 seconds, real time. The lock wait time is not real time since this is the total elapsed wait time summed over all threads. The lock wait frequency is a little over one per query mix, meaning a little over one per five locking transactions.

We note that we do not get deadlocks since all inserts and deletes are in ascending key order due to vectoring. This guarantees the absence of deadlocks for single insert transactions, as long as the transaction stays within the vector size. This is always the case since the inserts are a few hundred triples at the maximum. The waits concentrate on POGS, because this is a bitmap index where the locking resolution is less than a row, and the values do not correlate with insert order. The locking behavior could be better with the column store, where we would have row level locking also for this index. This is to be seen. The column store would otherwise tend to have higher cost per random insert.

Considering these results it does not seem crucial to "drop ACID," though doing so would save some time. We will now run measurements for all scales with 16 Users and ACID.

Let us now see what the benchmark writes:

SELECT TOP 10 * FROM sys_d_stat ORDER BY n_dirty DESC;
KEY_TABLE                     INDEX_NAME                       TOUCHES     READS   READ_PCT   N_DIRTY   N_BUFFERS
===========================   ============================   =========   =======   ========   =======   =========
DB.DBA.RDF_QUAD               RDF_QUAD_POGS                  763846891    237436          0     58040      228606
DB.DBA.RDF_QUAD               RDF_QUAD                       213282706   1991836          0     30226     1940280
DB.DBA.RDF_OBJ                RO_VAL                             15474     17837        115     13438       17431
DB.DBA.RO_START               RO_START                           10573     11195        105     10228       11227
DB.DBA.RDF_IRI                RDF_IRI                            61902    125711        203      7705      121300
DB.DBA.RDF_OBJ                RDF_OBJ                         23809053   3205963         13       636     3072517
DB.DBA.RDF_IRI                DB_DBA_RDF_IRI_UNQC_RI_ID        3237687    504486         15       340      488797
DB.DBA.RDF_QUAD               RDF_QUAD_SP                        89995     70446         78        99       68340
DB.DBA.RDF_QUAD               RDF_QUAD_OP                        19440     47541        244        66       45583
DB.DBA.VTLOG_DB_DBA_RDF_OBJ   VTLOG_DB_DBA_RDF_OBJ                3014         1          0        11          11
DB.DBA.RDF_QUAD               RDF_QUAD_GS                         1261       801         63        10         751
DB.DBA.RDF_PREFIX             RDF_PREFIX                            14       168       1120         1         153
DB.DBA.RDF_PREFIX             DB_DBA_RDF_PREFIX_UNQC_RP_ID        1807       200         11         1         200

The most dirty pages are on the POGS index, which is reasonable; values are spread out at random. After this we have the PSOG index, likely because of random deletes. New IRIs tend to get consecutive numbers and do not make many dirty pages. Literals come next, with the index from leading string or hash of the literal to id leading, as one would expect, again because of values being distributed at random. After this come IRIs. The distribution of updates is generally as one would expect.

* * *

Going back to BSBM, at least the following aspects of the benchmark have to be further specified:

  • Disclosure of ACID properties. If the benchmark required full ACID many would not run this at all. Besides full ACID is not necessarily an absolute requirement based on the hypothetical usage scenario of the benchmark. However, when publishing numbers the guarantees that go with the numbers must be made explicit. This includes logging, checkpoint frequency or equivalent etc.

  • Steady state. The working set of the Update mix is different from that of the Explore mixes. This touches more indices than Explore. The Explore warm-up is in part good but does not represent steady state.

  • Checkpoint and sustained throughput. Benchmarks involving update generally have rules for checkpointing the state and for sustained throughput. In specific, the throughput of an update benchmark cannot rely on never flushing to persistent storage. Even bulk load must be timed with a checkpoint guaranteeing durability at the end. A steady update stream should be timed with a test interval of sufficient length involving a few checkpoints; for example, a minimum duration of 30 minutes with no less than 3 completed checkpoints in the interval with at least 9 minutes between the end of one and the start of the next. Not all DBMSs work with logs and checkpoints, but if an alternate scheme is used then this needs to be described.

  • Memory and warm-up issues.We have seen the test data generator run out of memory when trying to generate update streams of meaningful length. Also the test driver should allow running updates in timed and non-timed mode (warm-up).

With an update benchmark, many more things need to be defined, and the set-up becomes more system specific, than with a read-only workload. We will address these shortcomings in the measurement rules proposal to come. Especially with update workloads, the vendors need to provide tuning expertise; however, this will not happen if the benchmark does not properly set the expectations. If benchmarks serve as a catalyst for clearly defining how things are to be set up, then they will have served the end user.

Benchmarks, Redux Series

# PermaLink Comments [0]
03/09/2011 12:32 GMT Modified: 03/15/2011 17:18 GMT
 <<     | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |     >>
Powered by OpenLink Virtuoso Universal Server
Running on Linux platform