Orri Erling's Weblog

#ods_bar { margin: 0; padding: 0; width: 100%; float: left; clear: both; color: #444; font-size: 9pt; font-family: Arial, Helvetica, sans-serif; background-color: #ddeff9} #ods_bar ul { list-style-type: none} #ods_bar ul li { display: inline} #ods_bar a { text-decoration: none; color: inherit} #ods_bar img { float: none; border: 0; margin: 0} #ods_bar input { margin-right: 8px; font-size: 7pt; color: #555;} #ods_bar_handle { width: 10px; float: left} #ods_bar_content { float: left; width: 100%; background-color: #ddeff9} #ods_bar_top { float: left; width: 100%; background-color: #fff} #ods_bar_bot { float: left; clear: left; width: 100%; padding-top: 2px; padding-bottom: 2px; background-color: #85b9d2} #ods_bar_top_cmds { font-size: 7.5pt; margin-top: 4px; color: #42abc4; background-color: #fff; float: right; padding-right: 8px} #ods_bar_top_cmds img { vertical-align: middle;} #ods_bar_top_cmds a { text-decoration: none} #ods_bar_top_cmds a.user_profile_lnk { text-transform: none} #ods_bar_first_lvl { float: left; padding: 0; margin: 0; color: #fff; background: #0075A8 url("/ods/images/navlv1default.png")} #ods_bar_first_lvl li { padding: 0; padding-left: 4px; margin: 0} #ods_bar_first_lvl li a { margin-top: 0px; padding: 6px 3px 6px 3px; vertical-align: middle; color: #fff; /* Required due to buggy CSS in IE */} #ods_bar_first_lvl li a img { margin-top: 2px; margin-bottom: 5px; vertical-align: middle;} #ods_bar_first_lvl li.sel a { color: #455; background: #b1d4e5 url("/ods/images/navlv1sel.png")} #ods_bar_second_lvl { width: 100%; height: 20px; float: left; clear: left; margin: 0; padding: 0; padding-top: 4px; background: #ddeff9 url("/ods/images/navlv2default.png")} #ods_bar_second_lvl li { margin-right: 5px} #ods_bar_second_lvl li:first-child { margin-left: 27px;} #ods_bar_second_lvl li a { vertical-align: middle; color: #444; /* Required by buggy IE CSS implementation */ } #ods_bar_home_path { margin: 2px 0px 0px 36px; padding: 0; font-size: 8pt} .popup { position: absolute; background-color: #fff; border: 1px dotted #4800F4; padding: 0.5em; font-size: 80%; } #ods_bar_odslogin { font-size: 7.5pt; margin-top: 4px; color: #42abc4; background-color: #fff; float: right; padding-right: 8px; } #ods_bar_odslogin img { vertical-align: middle; margin-left: 8px; } #ods_bar_odslogin a { margin-left: 3px; color: inherit; text-decoration: none; }

Entries: [ 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 ]

Details

Orri Erling

FOAF

Full profile

OCS 0.5

OPML 1.0

XBEL

Multimedia

Videos

Audio

Images

iTunes Subscription

Post Categories

ALL

Display Settings

articles per page.

order.

Transaction Semantics in RDF and Relational Models

As a part of defining benchmark audit for testing ACID properties on RDF stores, we will here examine different RDF scenarios where lack of concurrency control causes inconsistent results. In so doing, we consider common implementation techniques and implications as concern locking (pessimistic) and multi-version (optimistic) concurrency control schemes.

In the following, we will talk in terms of triples, but the discussion can be trivially generalized to quads. We will use numbers for IRIs and literals. In most implementations, the internal representation for these is indeed a number (or at least some data type that has a well defined collation order). For ease of presentation, we consider a single index with key parts SPO. Any other index-like setting with any possible key order will have similar issues.

Insert (Create) and Delete

INSERT and DELETE as defined in SPARQL are queries which generate a result set which is then used for instantiating triple patterns. We note that a DELETE may delete a triple which the DELETE has not read; thus the delete set is not a subset of the read set. The SQL equivalent is the

DELETE FROM table WHERE key IN 
   ( SELECT key1 FROM other_table )

expression, supposing it were implemented as a scan of other_table and an index lookup followed by DELETE on table.

The meaning of INSERT is that the triples in question exist after the operation, and the meaning of DELETE is that said triples do not exist. In a transactional context, this means that the after-image of the transaction is guaranteed either to have or not-have said triples.

Suppose that the triples { 1 0 0 }, { 1 5 6 }, and { 1 5 7 } exist in the beginning. If we DELETE { 1 ?x ?y } and concurrently INSERT { 1 2 4 . 1 2 3 . 1 3 5 }, then whichever was considered to be first by the concurrency control of the DBMS would complete first, and the other after that. Thus the end state would either have no triples with subject 1 or would have the three just inserted.

Suppose the INSERT inserts the first triple, { 1 2 4 }. The DELETE at the same time reads all triples with subject 1. The exclusive read waits for the uncommitted INSERT. The INSERT then inserts the second triple, { 1 2 3 }. Depending on the isolation of the read, this either succeeds, since no { 1 2 3 } was read, or causes a deadlock. The first corresponds to REPEATABLE READ isolation; the second to SERIALIZABLE.

We would not get the desired end-state of either all the inserted triples or no triples with subject 1 if the read or the DELETE were not serializable.

Furthermore if a DELETE template produced a triple that did not exist in the pre-image, the DELETE semantics still imply that this also does not exist in the after-image, which implies serializability.

Read and Update

Let us consider the prototypical transaction example of transferring funds from one account to another. Two balances are updated, and a history record is inserted.

The initial state is

a  balance  10
b  balance  10

We transfer 1 from a to b, and at the same time transfer 2 from b to a. The end state must have a at 11 and b at 9.

A relational database needs REPEATABLE READ isolation for this.

With RDF, txn1 reads that a has a balance of 10. At the same time, txn1 reads the balance of a. txn2 waits because the read of txn1 is exclusive. txn1 proceeds and read the balance of b. It then updates the balance of a and b.

All goes without the deadlock which is always cited in this scenario, because the locks are acquired in the same order. The act of updating the balance of a, since RDF does not really have an update-in-place, consists of deleting { a balance 10 } and inserting { a balance 9 }. This gets done and txn1 commits. At this point, txn2 proceeds after its wait on the row that stated { a balance 10 }. This row is now gone, and txn2 sees that a has no balance, which is quite possible in RDF's schema-less model.

We see that REPEATABLE READ is not adequate with RDF, even though it is with relational. The reason why there is no UPDATE-in-place is that the PRIMARY KEY of the triple includes all the parts, including the object. Even in a RDBMS, an UPDATE of a primary key part amounts to a DELETE-plus-INSERT. One could here argue that an implementation might still UPDATE-in-place if the key order were not changed. This would resolve the special case of the accounts but not a more general case.

Thus we see that the read of the balance must be SERIALIZABLE. This means that the read locks the space before the first balance, so that no insertion may take place. In this way the read of txn2 waits on the lock that is conceptually before the first possible match of { a balance ?x }.

locking order and OLTP

To implement TPC-C, I would update the table with the highest cardinality first, and then all tables in descending order of cardinality. In this way, the locks with the highest likelihood for contention are held for the least time. If locking multiple rows of a table, these should be locked in a deterministic order, e.g., lowest key-value first. In this way, the workload would not deadlock. In actual fact, with clusters and parallel execution, the lock acquisition will not be guaranteed to be serial, so deadlocks do not entirely go away, but still may get fewer. Besides, any outside transaction might still lock in the wrong order and cause deadlocks, which is why the OLTP application must in any case be built to deal with the possibility of deadlock.

This is the conventional relational view of the matter. In more recent times, in-memory schemes with deterministic lock acquisition (Abadi VLDB 2010) or single-threaded atomic execution of transactions (Uni Munich BIRTE workshop at VLDB2010, VoltDB) have been proposed. There the transaction is described as a stored procedure, possibly with extra annotations. These techniques might apply to RDF also. RDF is however an unlikely model for transaction-intensive applications, so we will not for now examine these further.

RDBMS usually implement row-level locking. This means that once a column of a row has an uncommitted state, any other transaction is prevented from changing the row. This has no ready RDF equivalent. RDF is usually implemented as a row-per-triple system and applying row-level locking to this does not give the semantic one expects of a relational row.

I would argue that it is not essential to enforce transactional guarantees in units of rows. The guarantees must apply between data that is read and written by a transaction. It does not need to apply to columns that the transaction does not reference. To take the TPC-C example, the new order transaction updates the stock level and the delivery transaction updates the delivery count on the stock table. In practice, a delivery and a new order falling on the same row of stock will lock each other out, but nothing in the semantics of the workload mandates this.

It does not seem a priori necessary to recreate the row as a unit of concurrency control in RDF. One could say that a multi-attribute whole (such as an address) ought to be atomic for concurrency control, but then applications updating addresses will most likely read and update all the fields together even if only the street name changes.

Pessimistic Vs. Optimistic Concurrency Control

We have so far spoken only in terms of row-level locking, which is to my knowledge the most widely used model in RDBMS, and one we implement ourselves. Some databases (e.g., MonetDB and VectorWise) implement optimistic concurrency control. The general idea is that each transaction has a read and write set and when a transaction commits, any other transactions whose read or write set intersects with the write set of the committing transaction are marked un-committable. Once a transaction thus becomes un-committable, it may presumably continue reading indefinitely but may no longer commit its updates. Optimistic concurrency is generally coupled with multi-version semantics where the pre-image of a transaction is a clean committed state of the database as of a specific point in time, i.e., snapshot isolation.

To implement SERIALIZABLE isolation, i.e., the guarantee that if a transaction twice performs a COUNT the result will be the same, one locks also the row that precedes the set of selected rows and marks each lock so as to prevent an insert to the right of the lock in key order. The same thing may be done in an optimistic setting.

Positional Handling of Updates in Column Stores [Heman, Zukowski, CWI science library] discusses management of multiple consecutive snapshots in some detail. The paper does not go into the details of different levels of isolation but nothing there suggests that serializability could not be supported. There is some complexity in marking the space between ordered rows as non-insertable across multiple versions but this should be feasible enough.

The issue of optimistic Vs. pessimistic concurrency does not seem to be affected by the differences between RDF and relational models. We note that an OLTP workload can be made to run with very few transaction aborts (deadlocks) by properly ordering operations when using a locking scheme. The same does not work with optimistic concurrency since updates happen immediately and transaction aborts occur whenever the writes of one intersect the reads or writes of another, regardless of the order in which these were made.

Developers seldom understand transactions; therefore DBMS should, within the limits of the possible, optimize locking order for locking schemes. A simple example is locking in key order when doing an operation on a set of values. A more complex variant would consist of analyzing data dependencies in stored procedures and reordering updates so as to get the highest cardinality tables first. We note that this latter trick also benefits optimistic schemes.

In RDF, the same principles apply but distinguishing cardinality of an updated set will have to rely on statistics of predicate cardinality. Such are anyhow needed for query optimization.

Eventual Consistency

Web scale systems that need to maintain consistent state across multiple data centers sometimes use "eventual consistency" schemes. Two-phase-commit becomes very inefficient as latency increases, thus strict transactional semantics have prohibitive cost if the system is more distributed than a cluster with a fast interconnect.

Eventual consistency schemes (Amazon Dynamo, Yahoo! PNUTS) maintain history information on the record which is the unit of concurrency control. The record is typically a non-first normal form chunk of related data that it makes sense to store together from the application's viewpoint. Application logic can then be applied to reconciling differing copies of the same logical record.

Such a scheme seems a priori ill-suited for RDF, where the natural unit of concurrency control would seem to be the quad. We first note that only recently changed (i.e., DELETEd + INSERTed quads, as there is no UPDATE-in-place) need history information. This history information can be stored away from the quad itself, thus not disrupting compression. When detecting that one site has INSERTed a quad that another has DELETEd in the same general time period, application logic can still be applied for reading related quads in order to arrive at a decision on how to reconcile two databases that have diverged. The same can apply to conflicting values of properties that for the application should be single-valued. Comparing time-stamped transaction logs on quads is not fundamentally different from comparing record histories in Dynamo or PNUTS.

As we overcome the data size penalties that have until recently been associated with RDF, RDF becomes even more interesting as a data model for large online systems such as social network platforms where frequent application changes lead to volatility of schema. Key value stores are currently found in such applications, but they generally do not provide the query flexibility at which RDF excels.

Conclusions

We have gone over basic aspects of the endlessly complex and variable topic of transactions, and drawn parallels as well as outlined two basic differences between relational and RDF systems: What used to be REPEATABLE READ becomes SERIALIZABLE; and row-level locking becomes locking at the level of a single attribute value. For the rest, we see that the optimistic and pessimistic modes of concurrency control, as well as guidelines for writing transaction procedures, remain much the same.

Based on this overview, it should be possible to design an ACID test for describing the ACID behavior of benchmarked systems. We do not intend to make transaction support a qualification requirement for an RDF benchmark, but information on transaction support will still be valuable in comparing different systems.

bookmark it! submit digg.com

digg it!

reddit!

# PermaLink Comments [1]

03/22/2011 19:55 GMT

Modified: 08/20/2015 17:31 GMT

RDF and Transactions

I will here talk about RDF and transactions for developers in general. The next one talks about specifics and is for specialists.

Transactions are certainly not the first thing that comes to mind when one hears "RDF". We have at times used a recruitment questionnaire where we ask applicants to define a transaction. Many vaguely remember that it is a unit of work, but usually not more than that. We sometimes get questions from users about why they get an error message that says "deadlock". "Deadlock" is what happens when multiple users concurrently update balances on multiple bank accounts in the wrong order. What does this have to do with RDF?

There are in fact users who even use XA with a Virtuoso-based RDF application. Franz also has publicized their development of full ACID capabilities for AllegroGraph. RDF is a database schema model, and transactions will inevitably become an issue in databases.

At the same time, the developer population trained with MySQL and PHP is not particularly transaction-aware. Transactions have gone out of style, declares the No-SQL crowd. Well, it is not so much SQL they object to but ACID, i.e., transactional guarantees. We will talk more about this in the next post. The SPARQL language and protocol do not go into transactions, except for expressing the wish that an UPDATE request to an end-point be atomic. But beware -- atomicity is a gateway drug, and soon one finds oneself on full ACID.

If one says that a thing will either happen in its entirety or not at all, which is what (A) atomicity means, then the question arises of (I) isolation; that is, what happens if somebody else does something to the same data at the same time? Then comes the question of whether a thing, once having happened, will stay that way; i.e., (D) durability. Finally, there is (C) consistency, which means that the transaction's result must not contradict restrictions the database is supposed to enforce. RDF usually has no restrictions; thus consistency mostly means that the internal state of the DBMS must be consistent, e.g., different indices on triples/quads should contain the same data.

There are, of course, database-like consistency criteria that one can express in RDF Schema and OWL, concerning data types, mandatory presence of properties, or restrictions on cardinality (i.e., one may only have one spouse at a time, and the like).

If one indeed did enforce them all, then RDF would be very like the relational model -- with all the restrictions, but without the 40 years of work on RDBMS performance. For this reason, RDF use tends to involve data that is not structured enough to be a good fit for RDBMS.

There is of course the OWL side, where consistency is important but is defined in such complex ways that they again are not a good fit for RDBMS. RDF could be seen to be split between the schema-last world and the knowledge representation world. I will here focus on the schema-last side.

Transactions are relevant in RDF in two cases: 1. If data is trickle loaded in small chunks, one likes to know that the chunks do not get lost or corrupted; 2. If the application has any semantics that reserve resources, then these operations need transactions. The latter is not so common with RDF but examples include read-write situations, like checking if a seat is available and then reserving it. Transactionality guarantees that the same seat does not get reserved twice.

Web people argue with some justification that since the four cardinal virtues of database never existed on the web to begin with, applying strict ACID to web data is beside the point, like locking the stable after the horse has long since run away. This may be so; yet the systems used for processing data, whether that data is dirty or not, benefit from predictable operation under concurrency and from not losing data.

Analytics workloads are not primarily about transactions, but still need to specify what happens with updates. Analyzing data from measurements may not have concurrent updates, but there the transaction issue is replaced by the question of making explicit how the data was acquired and what processing has been applied to it before storage.

As mentioned before, the LOD2 project is at the crossroads of RDF and database. I construe its mission to be the making of RDF into a respectable database discipline. Database respectability in turn is as good as inconceivable without addressing the very bedrock on which this science was founded: transactions.

As previously argued, we need well-defined and auditable benchmarks. This again brings up the topic of transactions. Once we embark on the database benchmark route, there is no way around this. TPC-H mandates that the system under test support transactions, and the audit involves a test for this. We can do no less.

This has led me to more closely examine the issue of RDF and transactions, and whether there exist differences between transactions applied to RDF and to relational data.

As concerns Virtuoso, our position has been that one can get full ACID in Virtuoso, whether in SQL or SPARQL, by using a connected client (e.g., ODBC, JDBC, or the Jena or Sesame frameworks), and setting the isolation options on the connection. Having taken this step, one then must take the next step, which consists of dealing with deadlocks; i.e., with concurrent utilization, it may happen that the database at any time notifies the client that the transaction got aborted and the client must retry.

Web developers especially do not like this, because this is not what MySQL has taught them to expect. MySQL does have transactional back-ends like InnoDB, but often gets used without transactions.

With the March 2011 Virtuoso releases, we have taken a closer look at transactions with RDF. It is more practical to reduce the possibility of errors than to require developers to pay attention. For this reason we have automated isolation settings for RDF, greatly reduced the incidence of deadlocks, and even incorporated automatic deadlock retries where applicable.

If all users lock resources they need in the same order, there will be no deadlocks. This is what we do with RDF load in Virtuoso 7; thus any mix of concurrent INSERTs and DELETEs, if these are under a certain size (normally 10000 quads) are guaranteed never to fail due to locking. These could still fail due to running out of space, though. With previous versions, there always was a possibility of having an INSERT or DELETE fail because of deadlock with multiple users. Vectored INSERT and DELETE are sufficient for making web crawling or archive maintenance practically deadlock free, since there the primary transaction is the INSERT or DELETE of a small graph.

Furthermore, since the SPARQL protocol has no way of specifying transactions consisting of multiple client-server exchanges, the SPARQL end-point may deal with deadlocks by itself. If all else fails, it can simply execute requests one after the other, thus eliminating any possibility of locking. We note that many statements will be intrinsically free of deadlocks by virtue of always locking in key order, but this cannot be universally guaranteed with arbitrary size operations; thus concurrent operations might still sometimes deadlock. Anyway, vectored execution as introduced in Virtuoso 7, besides getting easily double-speed random access, also greatly reduces deadlocks by virtue of ordering operations.

In the next post we will talk about what transactions mean with RDF and whether there is any difference with the relational model.

bookmark it! submit digg.com

digg it!

reddit!

# PermaLink Comments [0]

03/22/2011 18:52 GMT

Modified: 03/22/2011 17:44 GMT

Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire

Below is a questionnaire I sent to the BSBM participants in order to get tuning instructions for the runs we were planning. I have filled in the answers for Virtuoso, here. This can be a checklist for pretty much any RDF database tuning.

Threading - What settings should be used (e.g., for query parallelization, I/O parallelization [e.g., prefetch, flush of dirty], thread pools [e,.g. web server], any other thread related)? We will run with 8 and 32 cores, so if there are settings controlling number of read/write (R/W) locks or mutexes or such for serializing diverse things, these should be set accordingly to minimize contention.

The following three settings are all in the [Parameters] section of the virtuoso.ini file.
- AsyncQueueMaxThreads controls the size of a pool of extra threads that can be used for query parallelization. This should be set to either 1.5 * the number of cores or 1.5 * the number of core threads; see which works better.
- ThreadsPerQuery is the maximum number of threads a single query will take. This should be set to either the number of cores or the number of core threads; see which works better.
- IndexTreeMaps is the number of mutexes over which control for buffering an index tree is split. This can generally be left at default (256 in normal operation; valid settings are powers of 2 from 2 to 1024), but setting to 64, 128, or 512 may be beneficial.
  
  A low number will lead to frequent contention; upwards of 64 will have little contention. We have sometimes seen a multiuser workload go 10% faster when setting this to 64 (down from 256), which seems counter-intuitive. This may be a cache artifact.
In the [HTTPServer] section of the virtuoso.ini file, the ServerThreads setting is the number of web server threads, i.e., the maximum number of concurrent SPARQL protocol requests. Having a value larger than the number of concurrent clients is OK; for large numbers of concurrent clients a lower value may be better, which will result in requests waiting for a thread to be available.

Note — The [HTTPServer] ServerThreads are taken from the total pool made available by the [Parameters] ServerThreads. Thus, the [Parameters] ServerThreads should always be at least as large as (and is best set greater than) the [HTTPServer] ServerThreads, and if using the closed-source Commercial Version, [Parameters] ServerThreads cannot exceed the licensed thread count.
File layout - Are there settings for striping over multiple devices? Settings for other file access parallelism? Settings for SSDs (e.g., SSD based cache of hot set of larger db files on disk)? The target config is for 4 independent disks and 4 independent SSDs. If you depend on RAID, are there settings for this? If you need RAID to be set up, please provide the settings/script for doing this with 4 SSDs on Linux (RH and Debian). This will be software RAID, as we find the hardware RAID to be much worse than an independent disk setup on the system in question.

It is best to stripe database files over all available disks, and to not use RAID. If RAID is desired, then stripe database files across many RAID sets. Use the segment declaration in the virtuoso.ini file. It is very important to give each independently seekable device its own I/O queue thread. See the documentation on the TPC-C sample for examples.

in the [Parameters] section of the virtuoso.ini file, set FDsPerFile to be (the number of concurrent threads * 1.5) ÷ the number of distinct database files.

There are no SSD specific settings.
Loading - How many parallel streams work best? We are looking for non-transactional bulk load, with no inference materialization. For partitioned cluster settings, do we divide the load streams over server processes?

Use one stream per core (not per core thread). In the case of a cluster, divide load streams evenly across all processes. The total number of streams on a cluster can equal the total number of cores; adjust up or down depending on what is observed.

Use the built-in bulk load facility, i.e.,

ld_dir ('<source-filename-or-directory>', '<file name pattern>', '<destination graph iri>');

For example,

SQL> ld_dir ('/path/to/files', '*.n3', 'http://dbpedia.org');

Then do a rdf_loader_run () on enough connections. For example, you can use the shell command

isql rdf_loader_run () &

to start one in a background isql process. When starting background load commands from the shell, you can use the shell wait command to wait for completion. If starting from isql, use the wait_for_children; command (see isql documentation for details).

See the BSBM disclosure report for an example load script.
What command should be used after non-transactional bulk load, to ensure a consistent persistent state on disk, like a log checkpoint or similar? Load and checkpoint will be timed separately, load being CPU-bound and checkpoint being I/O-bound. No roll-forward log or similar is required; the load does not have to recover if it fails before the checkpoint.

Execute

CHECKPOINT;

through a SQL client, e.g., isql. This is not a SPARQL statement and cannot be executed over the SPARQL protocol.
What settings should be used for trickle load of small triple sets into a pre-existing graph? This should be as transactional as supported; at least there should be a roll forward log, unlike the case for the bulk load.

No special settings are needed for load testing; defaults will produce transactional behavior with a roll forward log. Default transaction isolation is REPEATABLE READ, but this may be altered via SQL session settings or at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini file, with

DefaultIsolation = 4

Transaction isolation cannot be set over the SPARQL protocol.

NOTE: When testing full CRUD operations, other isolation settings may be preferable, due to ACID considerations. See answer #12, below, and detailed discussion in part 8 of this series, BSBM Explore and Update.
What settings control allocation of memory for database caching? We will be running mostly from memory, so we need to make sure that there is enough memory configured.

In the [Parameters] section of the virtuoso.ini file, NumberOfBuffers controls the amount of RAM used by Virtuoso to cache database files. One buffer caches an 8KB database page. In practice, count 10KB of memory per page. If "swappiness" on Linux is low (e.g., 2), two-thirds or more of physical memory can be used for database buffers. If swapping occurs, decrease the setting.
What command gives status on memory allocation (e.g., number of buffers, number of dirty buffers, etc.) so that we can verify that things are indeed in server memory and not, for example, being served from OS disk cache. If the cached format is different from the disk layout (e.g., decompression after disk read), is there a command for space statistics for database cache?

In an isql session, execute

STATUS ( ? ? );

The second result paragraph gives counts of total, used, and dirty buffers. If used buffers is steady and less than total, and if the disk read count on the line below does not increase, the system is running from memory. The cached format is the same as the disk based format.
What command gives information on disk allocation for different things? We are looking for the total size of allocated database pages for quads (including table, indices, anything else associated with quads) and dictionaries for literals, IRI names, etc. If there is a text index on literals, what command gives space stats for this? We count used pages, excluding any preallocated unused pages or other gaps. There is one number for quads and another for the dictionaries or other such structures, optionally a third for text index.

Execute on an isql session:
CHECKPOINT; SELECT TOP 20 * FROM sys_index_space_stats ORDER BY iss_pages DESC;
The iss_pages column is the total pages for each index, including blob pages. Pages are 8KB. Only used pages are reported, gaps and unused pages are not counted. The rows pertaining to RDF_QUAD are for quads; RDF_IRI, RDF_PREFIX, RO_START, RDF_OBJ are for dictionaries; RDF_OBJ_RO_FLAGS_WORDS and VTLOG_DB_DBA_RDF_OBJ are for text index.
If there is a choice between triples and quads, we will run with quads. How do we ascertain that the run is with quads? How do we find out the index scheme? Should be use an alternate index scheme? Most of the data will be in a single big graph.

The default scheme uses quads. The default index layout is PSOG, POGS, GS, SP, OP. To see the current index scheme, use an isql session to execute

STATISTICS DB.DBA.RDF_QUAD;
For partitioned cluster settings, are there partitioning-related settings to control even distribution of data between partitions? For example, is there a way to set partitioning by S or O depending on which is first in key order for each index?

The default partitioning settings are good, i.e., partitioning is on O or S, whichever is first in key order.
For partitioned clusters, are there settings to control message batching or similar? What are the statistics available for checking interconnect operation, e.g. message counts, latencies, total aggregate throughput of interconnect?

In the [Cluster] section of the cluster.ini file, ReqBatchSize is the number of query states dispatched between cluster nodes per message round trip. This may be incremented from the default of 10000 to 50000 or so if this is seen to be useful.

To change this on the fly, the following can be issued through an isql session:

cl_exec ( ' __dbf_set (''cl_request_batch_size'', 50000) ' );

The commands below may be executed through an isql session to get a summary of CPU and message traffic for the whole cluster or process-by-process, respectively. The documentation details the fields.
```
 STATUS ('cluster')      ;; whole cluster 
 STATUS ('cluster_d')    ;; process-by-process
   
```
Other settings - Are there settings for limiting query planning, when appropriate? For example, the BSBM Explore mix has a large component of unnecessary query optimizer time, since the queries themselves access almost no data. Any other relevant settings?
- For BSBM, needless query optimization should be capped at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini, with
  
  StopCompilerWhenXOverRun = 1
- When testing full CRUD operations (not simply CREATE, i.e., load, as discussed in #5, above), it is essential to make queries run with transaction isolation of READ COMMITTED, to remove most lock contention. Transaction isolation cannot be adjusted via SPARQL. This can be changed through SQL session settings, or at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini file, with
  
  DefaultIsolation = 2

Benchmarks, Redux Series

bookmark it! submit digg.com

digg it!

reddit!

# PermaLink Comments [0]

03/04/2011 15:28 GMT

Modified: 08/21/2014 10:46 GMT

Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire

Threading - What settings should be used (e.g., for query parallelization, I/O parallelization [e.g., prefetch, flush of dirty], thread pools [e,.g. web server], any other thread related)? We will run with 8 and 32 cores, so if there are settings controlling number of read/write (R/W) locks or mutexes or such for serializing diverse things, these should be set accordingly to minimize contention.

The following three settings are all in the [Parameters] section of the virtuoso.ini file.
- AsyncQueueMaxThreads controls the size of a pool of extra threads that can be used for query parallelization. This should be set to either 1.5 * the number of cores or 1.5 * the number of core threads; see which works better.
- ThreadsPerQuery is the maximum number of threads a single query will take. This should be set to either the number of cores or the number of core threads; see which works better.
- IndexTreeMaps is the number of mutexes over which control for buffering an index tree is split. This can generally be left at default (256 in normal operation; valid settings are powers of 2 from 2 to 1024), but setting to 64, 128, or 512 may be beneficial.
  
  A low number will lead to frequent contention; upwards of 64 will have little contention. We have sometimes seen a multiuser workload go 10% faster when setting this to 64 (down from 256), which seems counter-intuitive. This may be a cache artifact.
In the [HTTPServer] section of the virtuoso.ini file, the ServerThreads setting is the number of web server threads, i.e., the maximum number of concurrent SPARQL protocol requests. Having a value larger than the number of concurrent clients is OK; for large numbers of concurrent clients a lower value may be better, which will result in requests waiting for a thread to be available.

Note — The [HTTPServer] ServerThreads are taken from the total pool made available by the [Parameters] ServerThreads. Thus, the [Parameters] ServerThreads should always be at least as large as (and is best set greater than) the [HTTPServer] ServerThreads, and if using the closed-source Commercial Version, [Parameters] ServerThreads cannot exceed the licensed thread count.
File layout - Are there settings for striping over multiple devices? Settings for other file access parallelism? Settings for SSDs (e.g., SSD based cache of hot set of larger db files on disk)? The target config is for 4 independent disks and 4 independent SSDs. If you depend on RAID, are there settings for this? If you need RAID to be set up, please provide the settings/script for doing this with 4 SSDs on Linux (RH and Debian). This will be software RAID, as we find the hardware RAID to be much worse than an independent disk setup on the system in question.

It is best to stripe database files over all available disks, and to not use RAID. If RAID is desired, then stripe database files across many RAID sets. Use the segment declaration in the virtuoso.ini file. It is very important to give each independently seekable device its own I/O queue thread. See the documentation on the TPC-C sample for examples.

in the [Parameters] section of the virtuoso.ini file, set FDsPerFile to be (the number of concurrent threads * 1.5) ÷ the number of distinct database files.

There are no SSD specific settings.
Loading - How many parallel streams work best? We are looking for non-transactional bulk load, with no inference materialization. For partitioned cluster settings, do we divide the load streams over server processes?

Use one stream per core (not per core thread). In the case of a cluster, divide load streams evenly across all processes. The total number of streams on a cluster can equal the total number of cores; adjust up or down depending on what is observed.

Use the built-in bulk load facility, i.e.,

ld_dir ('<source-filename-or-directory>', '<file name pattern>', '<destination graph iri>');

For example,

SQL> ld_dir ('/path/to/files', '*.n3', 'http://dbpedia.org');

Then do a rdf_loader_run () on enough connections. For example, you can use the shell command

isql rdf_loader_run () &

to start one in a background isql process. When starting background load commands from the shell, you can use the shell wait command to wait for completion. If starting from isql, use the wait_for_children; command (see isql documentation for details).

See the BSBM disclosure report for an example load script.
What command should be used after non-transactional bulk load, to ensure a consistent persistent state on disk, like a log checkpoint or similar? Load and checkpoint will be timed separately, load being CPU-bound and checkpoint being I/O-bound. No roll-forward log or similar is required; the load does not have to recover if it fails before the checkpoint.

Execute

CHECKPOINT;

through a SQL client, e.g., isql. This is not a SPARQL statement and cannot be executed over the SPARQL protocol.
What settings should be used for trickle load of small triple sets into a pre-existing graph? This should be as transactional as supported; at least there should be a roll forward log, unlike the case for the bulk load.

No special settings are needed for load testing; defaults will produce transactional behavior with a roll forward log. Default transaction isolation is REPEATABLE READ, but this may be altered via SQL session settings or at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini file, with

DefaultIsolation = 4

Transaction isolation cannot be set over the SPARQL protocol.

NOTE: When testing full CRUD operations, other isolation settings may be preferable, due to ACID considerations. See answer #12, below, and detailed discussion in part 8 of this series, BSBM Explore and Update.
What settings control allocation of memory for database caching? We will be running mostly from memory, so we need to make sure that there is enough memory configured.

In the [Parameters] section of the virtuoso.ini file, NumberOfBuffers controls the amount of RAM used by Virtuoso to cache database files. One buffer caches an 8KB database page. In practice, count 10KB of memory per page. If "swappiness" on Linux is low (e.g., 2), two-thirds or more of physical memory can be used for database buffers. If swapping occurs, decrease the setting.
What command gives status on memory allocation (e.g., number of buffers, number of dirty buffers, etc.) so that we can verify that things are indeed in server memory and not, for example, being served from OS disk cache. If the cached format is different from the disk layout (e.g., decompression after disk read), is there a command for space statistics for database cache?

In an isql session, execute

STATUS ( ? ? );

The second result paragraph gives counts of total, used, and dirty buffers. If used buffers is steady and less than total, and if the disk read count on the line below does not increase, the system is running from memory. The cached format is the same as the disk based format.
What command gives information on disk allocation for different things? We are looking for the total size of allocated database pages for quads (including table, indices, anything else associated with quads) and dictionaries for literals, IRI names, etc. If there is a text index on literals, what command gives space stats for this? We count used pages, excluding any preallocated unused pages or other gaps. There is one number for quads and another for the dictionaries or other such structures, optionally a third for text index.

Execute on an isql session:
CHECKPOINT; SELECT TOP 20 * FROM sys_index_space_stats ORDER BY iss_pages DESC;
The iss_pages column is the total pages for each index, including blob pages. Pages are 8KB. Only used pages are reported, gaps and unused pages are not counted. The rows pertaining to RDF_QUAD are for quads; RDF_IRI, RDF_PREFIX, RO_START, RDF_OBJ are for dictionaries; RDF_OBJ_RO_FLAGS_WORDS and VTLOG_DB_DBA_RDF_OBJ are for text index.
If there is a choice between triples and quads, we will run with quads. How do we ascertain that the run is with quads? How do we find out the index scheme? Should be use an alternate index scheme? Most of the data will be in a single big graph.

The default scheme uses quads. The default index layout is PSOG, POGS, GS, SP, OP. To see the current index scheme, use an isql session to execute

STATISTICS DB.DBA.RDF_QUAD;
For partitioned cluster settings, are there partitioning-related settings to control even distribution of data between partitions? For example, is there a way to set partitioning by S or O depending on which is first in key order for each index?

The default partitioning settings are good, i.e., partitioning is on O or S, whichever is first in key order.
For partitioned clusters, are there settings to control message batching or similar? What are the statistics available for checking interconnect operation, e.g. message counts, latencies, total aggregate throughput of interconnect?

In the [Cluster] section of the cluster.ini file, ReqBatchSize is the number of query states dispatched between cluster nodes per message round trip. This may be incremented from the default of 10000 to 50000 or so if this is seen to be useful.

To change this on the fly, the following can be issued through an isql session:

cl_exec ( ' __dbf_set (''cl_request_batch_size'', 50000) ' );

The commands below may be executed through an isql session to get a summary of CPU and message traffic for the whole cluster or process-by-process, respectively. The documentation details the fields.
```
 STATUS ('cluster')      ;; whole cluster 
 STATUS ('cluster_d')    ;; process-by-process
   
```
Other settings - Are there settings for limiting query planning, when appropriate? For example, the BSBM Explore mix has a large component of unnecessary query optimizer time, since the queries themselves access almost no data. Any other relevant settings?
- For BSBM, needless query optimization should be capped at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini, with
  
  StopCompilerWhenXOverRun = 1
- When testing full CRUD operations (not simply CREATE, i.e., load, as discussed in #5, above), it is essential to make queries run with transaction isolation of READ COMMITTED, to remove most lock contention. Transaction isolation cannot be adjusted via SPARQL. This can be changed through SQL session settings, or at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini file, with
  
  DefaultIsolation = 2

Benchmarks, Redux Series

bookmark it! submit digg.com

digg it!

reddit!

# PermaLink Comments [0]

03/04/2011 15:28 GMT

Modified: 08/21/2014 10:46 GMT

Virtuoso Directions for 2011

At the start of 2010, I wrote that 2010 would be the year when RDF became performance- and cost-competitive with relational technology for data warehousing and analytics. More specifically, RDF would shine where data was heterogenous and/or where there was a high frequency of schema change.

I will now discuss what we have done towards this end in 2010 and how you will gain by this in 2011.

At the start of 2010, we had internally demonstrated 4x space efficiency gains from column-wise compression and 3x loop join speed gains from vectored execution. To recap, column-wise compression means a column-wise storage layout where values of consecutive rows of a single column are consecutive in memory/disk and are compressed in a manner that benefits from the homogenous data type and possible sort order of the column. Vectored execution means passing large numbers of query variable bindings between query operators and possibly sorting inputs to joins for improving locality. Furthermore, always operating on large sets of values gives extra opportunities for parallelism, from instruction level to threads to scale out.

So, during 2010, we integrated these technologies into Virtuoso, for relational- and graph-based applications alike. Further, even if we say that RDF will be close to relational speed in Virtuoso, the point is moot if Virtuoso's relational speed is not up there with the best of analytics-oriented RDBMS. RDF performance does rest on the basis of general-purpose database performance; what is sauce for the goose is sauce for the gander. So we reimplemented HASH JOIN and GROUP BY, and fine-tuned many of the tricks required by TPC-H. TPC-H is not the sole final destination, but it is a step on the way and a valuable checklist for what a database ought to do.

At the Semdata workshop of VLDB 2010 we presented some results of our column store applied to RDF and relational tasks. As noted in the paper, the implementation did demonstrate significant gains over the previous row-wise architecture but was not yet well optimized, so not ready to be compared with the best of the relational analytics world. A good part of the fall of 2010 went into optimizing the column store and completing functionality such as transaction support with columns.

A lot of this work is not specifically RDF oriented, but all of this work is constantly informed by the specific requirements of RDF. For example, the general idea of vectored execution is to eliminate overheads and optimize CPU cache and other locality by doing single query operations on arrays of operands so that the whole batch runs more or less in CPU cache. Are the gains not lost if data is typed at run time, as in RDF? In fact, the cost of run-time-typing turns out to be small, since data in practice tends to be of homogenous type and with locality of reference in values. Virtuoso's column store implementation resembles in broad outline other column stores like Vertica or VectorWise, the main difference being the built-in support for run-time heterogenous types.

The LOD2 EU FP 7 project started in September 2010. In this project OpenLink and the celebrated heroes of the column store, CWI of MonetDB and VectorWise fame, represent the database side.

The first database task of LOD2 is making a survey of the state of the art and a round of benchmarking of RDF stores. The Berlin SPARQL Benchmark (BSBM) has accordingly evolved to include a business intelligence section and an update stream. Initial results from running these will become available in February/March, 2011. The specifics of this process merit another post; let it for now be said that benchmarking is making progress. In the end, it is our conviction that we need a situation where vendors may publish results as and when they are available and where there exists a well defined process for documenting and checking results.

LOD2 will continue by linking the universe, as I half-facetiously put it on a presentation slide. This means alignment of anything from schema to instance identifiers, with and without supervision, and always with provenance, summarization, visualization, and so forth. In fact, putting it this way, this gets to sound like the old chimera of generating applications from data or allowing users to derive actionable intelligence from data of which they do not even know the structure. No, we are not that unrealistic. But we are moving toward more ad-hoc discovery and faster time to answer. And since we provide an infrastructure element under all this, we want to do away with the "RDF tax," by which we mean any significant extra cost of RDF compared to an alternate technology. To put it another way, you ought to pay for unpredictable heterogeneity or complex inference only when you actually use them, not as a fixed up-front overhead.

So much for promises. When will you see something? It is safe to say that we cannot very well publish benchmarks of systems that are not generally available in some form. This places an initial technology preview cut of Virtuoso 7 with vectored execution somewhere in January or early February. The column store feature will be built in, but more than likely the row-wise compressed RDF format of Virtuoso 6 will still be the default. Version 6 and 7 databases will be interchangeable unless column-store structures are used.

For now, our priority is to release the substantial gains that have already been accomplished.

After an initial preview cut, we will return to the agenda of making sure Virtuoso is up there with the best in relational analytics, and that the equivalent workload with an RDF data model runs as close as possible to relational performance. As a first step this means taking TPC-H as is, and then converting the data and queries to the trivially equivalent RDF and SPARQL and seeing how it goes. In the September paper we dabbled a little with the data at a small scale but now we must run the full set of queries at 100GB and 300GB scales, which come to about 14 billion and 42 billion triples, respectively. A well done analysis of the issues encountered, covering similarities and dissimilarities of the implementation of the workload as SQL and SPARQL, should make a good VLDB paper.

Database performance is an entirely open-ended quest and the bag of potentially applicable tricks is as good as infinite. Having said this, it seems that the scales comfortably reached in the TPC benchmarks are more than adequate for pretty much anything one is likely to encounter in real world applications involving comparable workloads. Businesses getting over 6 million new order transactions per minute (the high score of TPC-C) or analyzing a warehouse of 60 billion orders shipped to 6 billion customers over 7 years (10000GB or 10TB TPC-H) are not very common if they exist at all.

The real world frontier has moved on. Scaling up the TPC workloads remains a generally useful exercise that continues to contribute to the state of the art but the applications requiring this advance are changing.

Someone once said that for a new technology to become mainstream, it needs to solve a new class of problem. Yes, while it is a preparatory step to run TPC-H translated to SPARQL without dying of overheads, there is little point in doing this in production since SQL is anyway likely better and already known, proven, and deployed.

The new class of problem, as LOD2 sees it, is the matter of web-wide cross-organizational data integration. Web-wide does not necessarily mean crawling the whole web, but does tend to mean running into significant heterogeneity of sources, both in terms of modeling and in terms of usage of more-or-less standard data models. Around this topic we hear two messages. The database people say that inference beyond what you can express in SQL views is theoretically nice but practically not needed; on the other side, we hear that the inference now being standardized in efforts like RIF and OWL is not expressive enough for the real world. As one expert put it, if enterprise data integration in the 1980s was between a few databases, today it is more like between 1000 databases, which makes this matter similar to searching the web. How can one know in such a situation that the data being aggregated is in fact meaningfully aggregate-able?

Add to this the prevalence of unstructured data in the world and the need to mine it for actionable intelligence. Think of combining data from CRM, worldwide media coverage of own and competitive brands, and in-house emails for assessing organizational response to events on the market.

These are the actual use cases for which we need RDF at relational DW performance and scale. This is not limited to RDF and OWL profiles, since we fully believe that inference needs are more diverse. The reason why this is RDF and not SQL plus some extension of Datalog, is the widespread adoption of RDF and linked data as a data publishing format, with all the schema-last and open world aspects that have been there from the start.

Stay tuned for more news later this month!

bookmark it! submit digg.com

digg it!

reddit!

# PermaLink Comments [0]

01/19/2011 11:29 GMT

Modified: 01/20/2011 12:54 GMT

VLDB Semdata Workshop

I will begin by extending my thanks to the organizers, in specific Reto Krummenacher of STI and Atanas Kiryakov of Ontotext for inviting me to give a position paper at the workshop. Indeed, it is the builders of bridges, the pontifs (pontifex) amongst us who shall be remembered by history. The idea of organizing a semantic data management workshop at VLDB is a laudable attempt at rapprochement between two communities to the advantage of all concerned.

Franz, Ontotext, and OpenLink were the vendors present at the workshop. To summarize very briefly, Jans Aasman of Franz talked about the telco call center automation solution by Amdocs, where the AllegroGraph RDF store is integrated. On the technical side, AllegroGraph has Javascript as a stored procedure language, which is certainly a good idea. Naso of Ontotext talked about the BBC FIFA World Cup site. The technical proposition was that materialization is good and data partitioning is not needed; a set of replicated read-only copies is good enough.

I talked about making RDF cost competitive with relational for data integration and BI. The crux is space efficiency and column store techniques.

One question that came up was that maybe RDF could approach relational in some things, but what about string literals being stored in a separate table? Or URI strings being stored in a separate table?

The answer is that if one accesses a lot of these literals the access will be local and fairly efficient. If one accesses just a few, it does not matter. For user-facing reports, there is no point in returning a million strings that the user will not read anyhow. But then it turned out that there in fact exist reports in bioinformatics where there are 100,000 strings. Now taking the worst abuse of SPARQL, a regexp over all literals in a property of a given class. With a column store this is a scan of the column; with RDF, a three table join. The join is about 10x slower than the column scan. Quite OK, considering that a full text index is the likely solution for such workloads anyway. Besides, a sensible relational schema will also not use strings for foreign keys, and will therefore incur a similar burden from fetching the strings before returning the result.

Another question was about whether the attitude was one of confrontation between RDF and relational and whether it would not be better to join forces. Well, as said in my talk, sauce for the goose is sauce for the gander and generally speaking relational techniques apply equally to RDF. There are a few RDB tricks that have no RDF equivalent, like clustering a fact table on dimension values, e.g., sales ordered by country, manufacturer, month. But by and large, column-store techniques apply. The execution engine can be essentially identical, just needing a couple of extra data types and some run-time typing and in some cases producing nulls instead of errors. Query optimization is much the same, except that RDB stats are not applicable as such; one needs to sample the data in the cost model. All in all, these adaptations to a RDB are not so large, even though they do require changes to source code.

Another question was about combining data models, e.g., relational (rows and columns), RDF (graph), XML (tree), and full text. Here I would say that it is a fault of our messaging that we do not constantly repeat the necessity of this combining, as we take it for granted. Most RDF stores have a full text index on literal values. OWLIM and a CWI prototype even have it for URIs. XML is a valid data type for an RDF literal, even though this does not get used very much. So doing SPARQL to select the values, and then doing XPath and XSLT on the values, is entirely possible, at least in Virtuoso which has an XPath/XSLT engine built in. Same for invoking SPARQL from an XSLT sheet. Colocating a native RDBMS with local and federated SQL is what Virtuoso has always done. One can, for example, map tables in heterogenous remote RDBs into tables in Virtuoso, then map these into RDF, and run SPARQL queries that get translated into SQL against the original tables, thereby getting SPARQL access without any materialization. Alongside this, one can ETL relational data into RDF via the same declarative mapping.

Further, there are RDF extensions for geospatial queries in Virtuoso and AllegroGraph, and soon also in others.

With all this cross-model operation, RDF is definitely not a closed island. We'll have to repeat this more.

Of the academic papers, the SpiderStore (paper is not yet available at time of writing, but should be soon) and Webpie that should be specially noted.

Let us talk about SpiderStore first.

SpiderStore

The SpiderStore from the University of Innsbruck is a main-memory-only system that has a record for each distinct IRI. The IRI record has one array of pointers to all IRI records that are objects where the referencing record is the subject, and a similar array of pointers to all records where the referencing record is the object. Both sets of pointers are clustered based on the predicate labeling the edge.

According to the authors (Robert Binna, Wolfgang Gassler, Eva Zangerle, Dominic Pacher, and Günther Specht), a distinct IRI is 5 pointers and each triple is 3 pointers. This would make about 4 pointers per triple, i.e., 32 bytes with 64-bit pointers.

This is not particularly memory efficient, since one must count unused space after growing the lists, fragmentation, etc., which will make the space consumption closer to 40 bytes per triple, plus should one add a graph to the mix one would need another pointer per distinct predicate, adding another 1-4 bytes per triple. Supporting non-IRI types in the object position is not a problem, as long as all distinct values have a chunk of memory to them with a type tag.

We get a few times better memory efficiency with column compressed quads, plus we are not limited to main memory.

But SpiderStore has a point. Making the traversal of an edge in the graph into a pointer dereference is not such a bad deal, especially if the data set is not that big. Furthermore, compiling the queries into C procedures playing with the pointers alone would give performance to match or exceed any hard coded graph traversal library and would not be very difficult. Supporting multithreaded updates would spoil much of the gain but allowing single threaded updates and forking read-only copies for reading would be fine.

SpiderStore as such is not attractive for what we intend to do, this being aggregating RDF quads in volumes far exceeding main memory and scaling to clusters. We note that SpiderStore hits problems with distributed memory, since SpiderStore executes depth first, which is manifestly impossible if significant latencies are involved. In other words, if there can be latency, one must amortize by having a lot of other possible work available. Running with long vectors of values is one way, as in MonetDB or Virtuoso Cluster. The other way is to have a massively multithreaded platform which favors code with few instructions but little memory locality. SpiderStore could be a good fit for massive multithreading, specially if queries were compiled to C, dramatically cutting down on the count of instructions to execute.

We too could adopt some ideas from SpiderStore. Namely, if running vectored, one just in passing, without extra overhead, generates an array of links to the next IRI, a bit like the array that SpiderStore has for each predicate for the incoming and outgoing edges of a given IRI. Of course, here these would be persistent IDs and not pointers, but a hash from one to the other takes almost no time. So, while SpiderStore alone may not be what we are after for data warehousing, Spiderizing parts of the working set would not be so bad. This is especially so since the Spiderizable data structure almost gets made as a by-product of query evaluation.

If an algorithm made several passes over a relatively small subgraph of the whole database, Spiderizing it would accelerate things. The memory overhead could have a fixed cap so as not to ruin the working set if locality happened not to hold.

Running a SpiderStore-like execution model on vectors instead of single values would likely do no harm and might even result in better cache behavior. The exception is in the event of completely unpredictable patterns of connections which may only be amortized by massive multithreading.

Webpie

Webpie from VU Amsterdam and the LarKC EU FP 7 project is, as it were, the opposite of SpiderStore. This is a map-reduce-based RDFS and OWL Horst inference engine which is all about breadth-first passes over the data in a map-reduce framework with intermediate disk-based storage.

Webpie is not however a database. After the inference result has been materialized, it must be loaded into a SPARQL engine in order to evaluate a query against the result.

The execution plan of Webpie is made from the ontology whose consequences must be materialized. The steps are sorted and run until a fixed point is reached for each. This is similar to running SPARQL INSERT … SELECT statements until no new inserts are produced. The only requirement is that the INSERT statement should report whether new inserts were actually made. This is easy to do. In this way, a comparison between map-reduce plus memory-based joining and a parallel RDF database could be made.

We have suggested such an experiment to the LarKC people. We will see.

bookmark it! submit digg.com

digg it!

reddit!

# PermaLink Comments [0]

09/21/2010 17:14 GMT

Modified: 09/21/2010 16:22 GMT

Suggested Extensions to the BSBM

Below is a list of possible extensions to the Berlin SPARQL Benchmark. Our previous critique of BSBM consists of:

The queries touch very little data, to the point where compilation is a large fraction of execution time. This is not representative of the data integration/analytics orientation of RDF.
Most queries are logarithmic to scale factor, but some are linear. The linear ones come to dominate the metric at larger scales.
An update stream would make the workload more realistic.

We could rectify this all with almost no changes to the data generator or test driver by adding one or two more metrics.

So I am publishing the below as a starting point for discussion.

BSBM Analytics Mix

Below is a set of business questions that can be answered with the BSBM data set. These are more complex and touch a greater percentage of the data than the initial mix. Their evaluation is between linear and n * log(n) to the data size. The TPC-H rules can be used for a power (single user) and a throughput (multi-user, where each submits queries from the mix with different parameters and in different order). The TPC-H score formula and executive summary formats are directly applicable.

This can be a separate metric from the "restricted" BSBM score. Restricted means "without a full scan with regexp" which will dominate the whole metric at larger scales.

Vendor specific variations in syntax will occur, hence these are allowed but disclosure of specific query text should accompany results. Hints for JOIN order and the like are not allowed; queries must be declarative. We note that both SPARQL and SQL implementations of the queries are possible.

The queries are ordered so that the first ones fill the cache. Running the analytics mix immediately after backup post initial load is allowed, resulting in semi-warm cache. Steady-state rules will be defined later, seeing the characteristics of the actual workload.

For each country, list the top 10 product categories, ordered by the count of reviews from the country.
Product with the most reviews during its first month on the market
10 products most similar to X, with similarity score based on the count of features in common
Top 10 reviewers of category X
Product with largest increase in reviews in month X compared to month X-minus-1.
Product of category X with largest change in mean price in the last month
Most active American reviewer of Japanese cameras last year
Correlation of price and average review
Features with greatest impact on price — for features occurring in category X, find the top 10 features where the mean price with the feature is most above the mean price without the feature
Country with greatest popularity of products in category X — reviews of category X from country Y divided by total reviews
Leading product of category X by country, mentioning mean price in each country and number of offers, sort by number of offers
Fans of manufacturer — find top reviewers who score manufacturer above their mean score
Products sold only in country X

BSBM IR

Since RDF stores often implement a full text index, and since a full scan with regexp matching would never be used in an online E-commerce portal, it is meaningful to extend the benchmark to have some full text queries.

For the SPARQL implementation, text indexing should be enabled for all string-valued literals even though only some of them will be queried in the workload.

Q6 from the original mix, now allowing use of text index.
Reviews of products of category X where the review contains the names of 1 to 3 product features that occur in said category of products; e.g., MP3 players with support for mp4 and ogg.
ibid but now specifying review author. The intent is that structured criteria are here more selective than text.
Difference in the frequency of use of "awesome", "super", and "suck(s)" by American vs. European vs. Asian review authors.

Changes to Test Driver

For full text queries, the search terms have to be selected according to a realistic distribution. DERI has offered to provide a definition and possibly an implementation for this.

The parameter distribution for the analytics queries will be defined when developing the queries; the intent is that one run will touch 90% of the values in the properties mentioned in the queries.

The result report will have to be adapted to provide a TPC-H executive summary-style report and appropriate metrics.

Changes to Data Generation

For supporting the IR mix, reviews should, in addition to random text, contain the following:

For each feature in the product concerned, add the label of said feature to 60% of the reviews.
Add the names of review author, product, product category, and manufacturer.
The review score should be expressed in the text by adjectives (e.g., awesome, super, good, dismal, bad, sucky). Every 20th word can be an adjective from the list correlating with the score in 80% of uses of the word and random in 20%. For 90% of adjectives, pick the adjectives from lists of idiomatic expressions corresponding to the country of the reviewer. In 10% of cases, use a random list of idioms.
Skew the review scores so that comparatively expensive products have a smaller chance for a bad review.

Update Stream

During the benchmark run:

1% of products are added;
3% of initial offers are deleted and 3% are added; and
5% of reviews are added.

Updates may be divided into transactions and run in series or in parallel in a manner specified by the test sponsor. The code for loading the update stream is vendor specific but must be disclosed.

The initial bulk load does not have to be transactional in any way.

Loading the update stream must be transactional, guaranteeing that all information pertaining to a product or an offer constitutes a transaction. Multiple offers or products may be combined in a transaction. Queries should run at least in READ COMMITTED isolation, so that half-inserted products or offers are not seen.

Full text indices do not have to be updated transactionally; the update can lag up to 2 minutes behind the insertion of the literal being indexed.

The test data generator generates the update stream together with the initial data. The update stream is a set of files containing Turtle-serialized data for the updates, with all triples belonging to a transaction in consecutive order. The possible transaction boundaries are marked with a comment distinguishable from the text. The test sponsor may implement a special load program if desired. The files must be loaded in sequence but a single file may be loaded on any number of parallel threads.

The data generator should generate multiple files for the initial dump in order to facilitate parallel loading.

The same update stream can be used during all tests, starting each run from a backup containing only the initial state. In the original run, the update stream is applied starting at the measurement interval, after the SUT is in steady state.

Tags: benchmarking | scalability | rdf | semanticweb | sparql

bookmark it! submit digg.com

digg it!

reddit!

# PermaLink Comments [0]

09/21/2010 17:13 GMT

Modified: 09/21/2010 16:22 GMT

LOD2 Kick Off

The LOD2 kick off meeting was held in Leipzig on Sept 6-8. I will here talk about OpenLink plans as concerns LOD2; hence this is not to be taken as representative of the whole project. I will first discuss the immediate and conclude with the long term.

As concerns OpenLink specifically, we have two short term activities, namely publishing the initial LOD2 repository in December and publishing a set of RDB and RDF benchmarks in February.

The LOD2 repository is a fusion of the OpenLink LOD Cloud Cache (which includes data from URIBurner and PingTheSemanticWeb) and Sindice, both hosted at DERI. The value-add compared to Sindice or the Virtuoso-based LOD Cloud Cache alone is the merger of the timeliness and ping-ping crawling of Sindice with the SPARQL of Virtuoso.

Further down the road, after we migrate the system to the Virtuoso column store, we will also see gains in performance, primarily due to much better working set, as data is many times more compact than with the present row-wise key compression.

Still further, but before next September, we will have dynamic repartitioning; the time of availability is set as this is part of the LOD2 project roadmap. The operational need for this is pushed back somewhat by the compression gains from column-wise storage.

As for benchmarks, I just compiled a draft of suggested extensions to the BSBM (Berlin SPARQL Benchmark). I talked about this with Peter Boncz and Chris Bizer, to the effect that some extensions of BSBM could be done but that the time was a bit short for making a RDF-specific benchmark. We do recall that BSBM is fully feasible with a relational schema and that RDF offers no fundamental edge for the workload.

There was a graph benchmark talk at the TPC workshop at VLDB 2010. There too, the authors were suggesting a social network use case for benchmarking anything from RDF stores to graph libraries. The presentation did not include any specification of test data, so it may be that some cooperation is possible there. The need for such a benchmark is well acknowledged. The final form of this is not yet set but LOD2 will in time publish results from such.

We did informally talk about a process for publishing with our colleagues from Franz and Ontotext at VLDB 2010. The idea is that vendors tune their own systems and do the runs and that the others check on this, preferably all using the same hardware.

Now, the LOD2 benchmarks will also include relational-to-RDF comparisons, for example TPC-H in SQL and SPARQL. The SQL will be Virtuoso, MonetDB, and possibly VectorWise and others, depending on what legal restrictions apply at the time. This will give an RDF-to-SQL comparison of TPC-H at least on Virtuoso, later also on MonetDB, depending on the schedule for a MonetDB SPARQL front-end.

In the immediate term, this of course focuses our efforts on productizing the Virtuoso column store extension and the optimizations that go with it.

LOD2 is however about much more than database benchmarks. Over the longer term, we plan to apply suitable parts of the ground-breaking database research done at CWI to RDF use cases.

This involves anything from adaptive indexing, to reuse and caching of intermediate results, to adaptive execution. This is however more than just mapping column store concepts to RDF. New challenges are posed by running on clusters and dealing with more expressive queries than just SQL, in specific queries with Datalog-like rules and recursion.

LOD2 is principally about integration and alignment, from the schema to the instance level. This involves complex batch processing, close to the data, on large volumes of data. Map-reduce is not the be-all-end-all of this. Of course, a parallel database like Virtuoso, Greenplum, or Vertica can do map-reduce style operations under control of the SQL engine. After all, the SQL engine needs to do map-reduce and a lot more to provide good throughput for parallel, distributed SQL. Something like the Berkeley Orders Of Magnitude (BOOM) distributed Datalog implementation (Overlog, Deadalus, BLOOM) could be a parallel computation framework that would subsume any map-reduce-style functionality under a more elegant declarative framework while still leaving control of execution to the developer for the cases where this is needed.

From our viewpoint, the project's gains include:

Significant narrowing of the RDB to RDF performance gap. RDF will be an option for large scale warehousing, cutting down on time to integration by providing greater schema flexibility.
Ready to use toolbox for data integration, including schema alignment and resolution of coreference.
Data discovery, summarization and visualization

Integrating this into a relatively unified stack of tools is possible, since these all cluster around the task of linking the universe with RDF and linked data. In this respect the integration of results may be stronger than often seen in European large scale integrating projects.

The use cases fit the development profile well:

Wolters Kluwer will develop an application for integrating resources around law, from the actual laws to court cases to media coverage. The content is modeled in a fine grained legal ontology.
Exalead will implement the linked data enterprise, addressing enterprise search and any typical enterprise data integration plus generating added value from open sources.
The Open Knowledge Foundation will create a portal of all government published data for easy access by citizens.

In all these cases, the integration requirements of schema alignment, resolution of identity, information extraction, and efficient storage and retrieval play a significant role. The end user interfaces will be task-specific but developer interfaces around integration tools and query formulation may be quite generic and suited for generic RDF application development.

bookmark it! submit digg.com

digg it!

reddit!

# PermaLink Comments [0]

09/21/2010 17:13 GMT

Modified: 09/21/2010 16:22 GMT

"The Acquired, The Innate, and the Semantic" or "Teaching Sem Tech"

I was recently asked to write a section for a policy document touching the intersection of database and semantics, as a follow up to the meeting in Sofia I blogged about earlier. I will write about technology, but this same document also touches the matter of education and computer science curricula. Since the matter came up, I will share a few thoughts on the latter topic.

I have over the years trained a few truly excellent engineers and managed a heterogeneous lot of people. These days, since what we are doing is in fact quite difficult and the world is not totally without competition, I find that I must stick to core competence, which is hardcore tech and leave management to those who have time for it.

When younger, I thought that I could, through sheer personal charisma, transfer either technical skills, sound judgment, or drive and ambition to people I was working with. Well, to the extent I believed this, my own judgment was not sound. Transferring anything at all is difficult and chancy. I must here think of a fantasy novel where a wizard said that, "working such magic that makes things do what they already want to do is easy." There is a grain of truth in that.

In order to build or manage organizations, we must work, as the wizard put it, with nature, not against it. There are also counter-examples, for example my wife's grandmother had decided to transform a regular willow into a weeping one by tying down the branches. Such "magic," needless to say, takes constant maintenance; else the spell breaks.

To operate efficiently, either in business or education, we need to steer away from such endeavors. This is a valuable lesson, but now consider teaching this to somebody. Those who would most benefit from this wisdom are the least receptive to it. So again, we are reminded to stay away from the fantasy of being able to transfer some understanding we think to have and to have this take root. It will if it will and if it does not, it will take constant follow up, like the would-be weeping willow.

Now, in more specific terms, what can we realistically expect to teach about computer science?

Complexity of algorithms would be the first thing. Understanding the relative throughputs and latencies of the memory hierarchy (i.e., cache, memory, local network, disk, wide area network) is the second. Understanding the difference of synchronous and asynchronous and the cost of synchronization (i.e., anything from waiting for a mutex to waiting for a network message) is the third.

Understanding how a database works would be immensely helpful for almost any application development task but this is probably asking too much.

Then there is the question of engineering. Where do we put interfaces and what should these interfaces expose? Well, they certainly should expose multiple instances of whatever it is they expose, since passing through an interface takes time.

I tried once to tell the SPARQL committee that parameterized queries and array parameters are a self-evident truism on the database side. This is an example of an interface that exposes multiple instances of what it exposes. But the committee decided not to standardize these. There is something in the "semanticist" mind that is irrationally antagonistic to what is self-evident for databasers. This is further an example of ignoring precept 2 above, the point about the throughputs and latencies in the memory hierarchy. Nature is a better and more patient teacher than I; the point will become clear of itself in due time, no worry.

Interfaces seem to be overvalued in education. This is tricky because we should not teach that interfaces are bad either. Nature has islands of tightly intertwined processes, separated by fairly narrow interfaces. People are taught to think in block diagrams, so they probably project this also where it does not apply, thereby missing some connections and porosity of interfaces.

LarKC (EU FP7 Large Knowledge Collider project) is an exercise in interfaces. The lessons so far are that coupling needs to be tight, and that the roles of the components are not always as neatly separable as the block diagram suggests.

Recognizing the points where interfaces are naturally narrow is very difficult. Teaching this in a curriculum is likely impossible. This is not to say that the matter should not be mentioned and examples of over-"paradigmatism" given. The geek mind likes to latch on to a paradigm (e.g., object orientation), and then they try to put it everywhere. It is safe to say that taking block diagrams too naively or too seriously makes for poor performance and needless code. In some cases, block diagrams can serve as tactical disinformation; i.e., you give lip service to the values of structure, information hiding, and reuse, which one is not allowed to challenge, ever, and at the same time you do not disclose the competitive edge, which is pretty much always a breach of these same principles.

I was once at a data integration workshop in the US where some very qualified people talked about the process of science. They had this delightfully American metaphor for it:

The edge is created in the "Wild West" — there are no standards or hard-and-fast rules, and paradigmatism for paradigmatism's sake is a laughing matter with the cowboys in the fringe where new ground is broken. Then there is the OK Corral, where the cowboys shoot it out to see who prevails. Then there is Dodge City, where the lawman already reigns, and compliance, standards, and paradigms are not to be trifled with, lest one get the tar-and-feather treatment and be "driven out o'Dodge."

So, if reality is like this, what attitude should the curriculum have towards it? Do we make innovators or followers? Well, as said before, they are not made. Or if they are made, they are not at least made in the university but much before that. I never made any of either, in spite of trying, but did meet many of both kinds. The education system needs to recognize individual differences, even though this is against the trend of turning out a standardized product. Enforced mediocrity makes mediocrity. The world has an amazing tolerance for mediocrity, it is true. But the edge is not created with this, if edge is what we are after.

But let us move to specifics of semantic technology. What are the core precepts, the equivalent of the complexity/memory/synchronization triangle of general purpose CS basics? Let us not forget that, especially in semantic technology, when we have complex operations, lots of data, and almost always multiple distributed data sources, forgetting the laws of physics carries an especially high penalty.

Know when to ontologize, when to folksonomize. The history of standards has examples of "stacks of Babel," sky-high and all-encompassing, which just result in non-communication and non-adoption. Lighter weight, community driven, tag folksonomy, VoCamp-style approaches can be better. But this is a judgment call, entirely contextual, having to do with the maturity of the domain of discourse, etc.
Answer only questions that are actually asked. This precept is two-pronged. The literal interpretation is not to do inferential closure for its own sake, materializing all implied facts of the knowledge base.

The broader interpretation is to take real-world problems. Expanding RDFS semantics with map-reduce and proving how many iterations this will take is a thing one can do but real-world problems will be more complex and less neat.
Deal with ambiguity. Data on which semantic technologies will be applied will be dirty, with errors from machine processing of natural language to erroneous human annotations. The knowledge bases will not be contradiction free. Michael Witbrock of CYC said many good things about this in Sofia; he would have something to say about a curriculum, no doubt.

Here we see that semantic technology is a younger discipline than computer science. We can outline some desirable skills and directions to follow but the idea of core precepts is not as well formed.

So we can approach the question from the angle of needed skills more than of precepts of science. What should the certified semantician be able to do?

Data integration. Given heterogenous relational schemas talking about the same entities, the semantician should find existing ontologies for the domain, possibly extend these, and then map the relational data to them. After the mapping is conceptually done, the semantician must know what combination of ETL and on-the-fly mapping fits the situation. This does mean that the semantician indeed must understand databases, which I above classified as an almost unreachable ideal. But there is no getting around this. Data is increasingly what makes the world go round. From this it follows that everybody must increasingly publish, consume, and refine, i.e., integrate. The anti-database attitude of the semantic web community simply has to go.
Design and implement workflows for content extraction, e.g., NLP or information extraction from images. This also means familiarity with NLP, desirably to the point of being able to tune the extraction rule sets of various NLP frameworks.
Design SOA workflows. The semantician should be able to extract and represent the semantics of business transactions and the data involved therein.
Lightweight knowledge engineering. The experience of building expert systems from the early days of AI is not the best possible, but with semantics attached to data, some sort of rules seem about inevitable. The rule systems will merge into the DBMS in time. Some ability to work with these, short of making expert systems, will be desirable.
Understand information quality in the sense of trust, provenance, errors in the information, etc. If the world is run based on data analytics, then one must know what the data in the warehouse means, what accidental and deliberate errors it contains, etc.

Of course, most of these tasks take place at some sort of organizational crossroads or interface. This means that the semantician must have some project management skills; must be capable of effectively communicating with different publics and simply getting the job done, always in the face of organizational inertia and often in the face of active resistance from people who view the semantician as some kind of intruder on their turf.

Now, this is a tall order. The semantician will have to be reasonably versatile technically, reasonably clever, and a self-starter on top. The self-starter aspect is the hardest.

The semanticists I have met are more of the scholar than the IT consultant profile. I say semanticist for the semantic web research people and semantician for the practitioner we are trying to define.

We could start by taking people who already do data integration projects and educating them in some semantic technology. We are here talking about a different breed than the one that by nature gravitates to description logics and AI. Projecting semanticist interests or attributes on this public is a source of bias and error.

If we talk about a university curriculum, the part that cannot be taught is the leadership and self-starter aspect, or whatever makes a good IT consultant. Thus the semantic technology studies must be profiled so as to attract people with this profile. As quoted before, the dream job for each era is a scarce skill that makes value from something that is plentiful in the environment. At this moment and for a few moments to come, this is the data geek, or maybe even semantician profile, if we take data geek past statistics and traditional business intelligence skills.

The semantic tech community, especially the academic branch of it, needs to reinvent itself in order to rise to this occasion. The flavor of the dream job curriculum will be away from the theoretical computer science towards the hands-on of database, large systems performance, and the practicalities of getting data intensive projects delivered.

Related

bookmark it! submit digg.com

digg it!

reddit!

# PermaLink Comments [0]

04/05/2010 11:21 GMT

Modified: 05/05/2010 13:49 GMT

"The Acquired, The Innate, and the Semantic" or "Teaching Sem Tech"

Now, in more specific terms, what can we realistically expect to teach about computer science?

Understanding how a database works would be immensely helpful for almost any application development task but this is probably asking too much.

I was once at a data integration workshop in the US where some very qualified people talked about the process of science. They had this delightfully American metaphor for it:

The edge is created in the "Wild West" — there are no standards or hard-and-fast rules, and paradigmatism for paradigmatism's sake is a laughing matter with the cowboys in the fringe where new ground is broken. Then there is the OK Corral, where the cowboys shoot it out to see who prevails. Then there is Dodge City, where the lawman already reigns, and compliance, standards, and paradigms are not to be trifled with, lest one get the tar-and-feather treatment and be "driven out o'Dodge."

Know when to ontologize, when to folksonomize. The history of standards has examples of "stacks of Babel," sky-high and all-encompassing, which just result in non-communication and non-adoption. Lighter weight, community driven, tag folksonomy, VoCamp-style approaches can be better. But this is a judgment call, entirely contextual, having to do with the maturity of the domain of discourse, etc.
Answer only questions that are actually asked. This precept is two-pronged. The literal interpretation is not to do inferential closure for its own sake, materializing all implied facts of the knowledge base.

The broader interpretation is to take real-world problems. Expanding RDFS semantics with map-reduce and proving how many iterations this will take is a thing one can do but real-world problems will be more complex and less neat.
Deal with ambiguity. Data on which semantic technologies will be applied will be dirty, with errors from machine processing of natural language to erroneous human annotations. The knowledge bases will not be contradiction free. Michael Witbrock of CYC said many good things about this in Sofia; he would have something to say about a curriculum, no doubt.

Here we see that semantic technology is a younger discipline than computer science. We can outline some desirable skills and directions to follow but the idea of core precepts is not as well formed.

So we can approach the question from the angle of needed skills more than of precepts of science. What should the certified semantician be able to do?

Data integration. Given heterogenous relational schemas talking about the same entities, the semantician should find existing ontologies for the domain, possibly extend these, and then map the relational data to them. After the mapping is conceptually done, the semantician must know what combination of ETL and on-the-fly mapping fits the situation. This does mean that the semantician indeed must understand databases, which I above classified as an almost unreachable ideal. But there is no getting around this. Data is increasingly what makes the world go round. From this it follows that everybody must increasingly publish, consume, and refine, i.e., integrate. The anti-database attitude of the semantic web community simply has to go.
Design and implement workflows for content extraction, e.g., NLP or information extraction from images. This also means familiarity with NLP, desirably to the point of being able to tune the extraction rule sets of various NLP frameworks.
Design SOA workflows. The semantician should be able to extract and represent the semantics of business transactions and the data involved therein.
Lightweight knowledge engineering. The experience of building expert systems from the early days of AI is not the best possible, but with semantics attached to data, some sort of rules seem about inevitable. The rule systems will merge into the DBMS in time. Some ability to work with these, short of making expert systems, will be desirable.
Understand information quality in the sense of trust, provenance, errors in the information, etc. If the world is run based on data analytics, then one must know what the data in the warehouse means, what accidental and deliberate errors it contains, etc.

Now, this is a tall order. The semantician will have to be reasonably versatile technically, reasonably clever, and a self-starter on top. The self-starter aspect is the hardest.

Related

bookmark it! submit digg.com

digg it!

reddit!

# PermaLink Comments [0]

04/05/2010 11:21 GMT

Modified: 05/05/2010 13:49 GMT

<< | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | >>

Powered by OpenLink Virtuoso Universal Server

Running on Linux platform

Orri Erling's Weblog

Details

Subscribe

Tag Cloud

Post Categories

Recent Articles

Display Settings

Insert (Create) and Delete

Read and Update

locking order and OLTP

Pessimistic Vs. Optimistic Concurrency Control

Eventual Consistency

Conclusions

Benchmarks, Redux Series

Benchmarks, Redux Series

Related

SpiderStore

Webpie

BSBM Analytics Mix

BSBM IR

Changes to Test Driver

Changes to Data Generation

Update Stream