Details
OpenLink Software
Burlington, United States
Subscribe
Post Categories
Recent Articles
Community Member Blogs
Display Settings
Translate
|
Showing posts in all categories Refresh
LOD2 Plenary and Review: Semanticist, Think Database!
[
Virtuso Data Space Bot
]
Last week the LOD2 FP7 project had its first review, preceded by its third plenary meeting.
Before this, we did, as promised, get the column store and vectored execution capabilities of Virtuoso 7 Single-Server Edition extended to Virtuoso 7 Cluster Edition. More interesting still, we decoupled storage from the database server process, so now database files can migrate between server processes. This means that clusters are now elastic, i.e., new servers can be added to a cluster and the load can be redistributed without reloading the data.
These things were long planned, but now are done. Measurements will be published in some weeks, as part of CWI's continued running of RDF store benchmarks, per the LOD2 plan.
Doing the column store and elastic cluster is work enough, so I do not in general participate in support or consultancy or the like. This has some pros and cons. On the plus side, there is a relative lack of noise and a very clear idea of focus. Of course, this work is most highly applied, thus always informed by use cases, thus forgetting what ought to be done out there is not the problem. Rather, the problem is forgetting how things in fact are done as opposed to how they could or should be done.
To cut a long story short, it has become clear to me that the DBMS must tell the application developer what to do. Of course, the application developer could also look at performance metrics, but they do not, and explaining these metrics is too much work and yields no lasting benefit. Developers will produce all kinds of performance diagnostic traces if requested, but going through this song and dance can also be avoided by the right automation.
So, I will introduce two new product features called Wazzup? and Saywhat?
Wazzup? is answered by a mood line, like "Heavily disk bound: 100G more memory will give 10x speedup" or "Network bound: Processing in larger batches will give 5x more throughput" and Saywhat? is answered by some commentary on the user's last action, for example "there is no ?order with o_totalprice < 0" or "there is no property O_misspelledtotallprrice."
Wazzup? is about overall system state, and Saywhat? is about the user session, specifically query plans. But an explanation of a query plan is not understandable, so this will just point out some salient facts, like the reason why the answer comes out empty.
The other thing that came to my attention is the fact that a user has no instinctive feel for ETL. A database person takes it for a self-evident truth that data is loaded in bulk, but the application developer does not think of that. Likewise, the line between warehousing and federating is not instinctively felt; actually the question is not even posed in these terms. So one will find Web protocols and end-points and glue code on the app server when one ought to have ETL and adequate hardware for running the consolidated database.
Further, under-provisioning of equipment is endemic with semanticists. The Semantic Web gets a needlessly bad rap just because we find too much data on too little equipment. For example, I was surprised to learn that the Linked Geodata demo ran on only 16 GB RAM and 6 processor cores with 2 billion triples and 350 million points in a geo index. Now, even with our greatest space efficiency advances, there is no way this will run from memory.
It is not that the Web 2.0 stack is necessarily efficient (we hear the wildest stories of lack of database understanding from that side too), but at least there is a culture of running with enough equipment. Surely when the web-scale data gear (e.g. Google Bigtable, Yahoo PNUTS, Amazon Dynamo) was new, by the operators' own admission there was no way for this to be particularly efficient, database-wise. Not if your eventual consistency is a client application to a shared MySQL back-end. For a lookup or single-record-update workload, who cares when there is enough hardware? For analytics, there is the de facto impossibility of doing big joins, but map reduce is for that, all offline. The big web houses have always known how to deal with data; it is the smaller Web 2.0 guys who patch systems together with duct tape and memcache. Even so, the online experience gets created.
Semanticism has no part of this outlook, except maybe for Freebase, but then they are from California and now have been inside Google for a while.
We quite understand that when one needs to get big data online, one makes a key-value store as a point solution, because this way one owns what one operates, and the time to market is a lot shorter than if one tried building all this inside a general-purpose DBMS. Besides, the people who can in fact do this almost do not exist, and even if one had a whole army of this rare breed, development is not very scalable in a tightly-integrated system like a high-performance DBMS. Still further, to even start, one needs to own the DBMS, meaning that the initial platform must be known through and through. This is an issue even though open source platforms exist.
The graph data, semdata, schema-last, RDF, linked data enterprise -- whatever one calls it -- makes the bold proposition of bringing complex-query-at-scale to heterogeneous data. This is a database claim.
In the meantime, test deployments are made in defiance of database best practices. This is a bit like test driving a race car in reverse gear and steering by looking in the rear-view mirror.
There is also no short-term scalable way to educate people. At the LOD2 review, one comment was that an integrated project ought to clearly indicate how to set up the tool chain for good performance, specially as concerns interfaces between the tools. This is very true. Experience shows that developers of tools cannot accurately anticipate what usage patterns will emerge in the field. Therefore, we propose to do better than just documentation; we will make the server recognize the common sources of inefficiency and point the user to the right action.
Provisioning and usage patterns: The DBMS ought to know best.
Imagine the following conversation:
DBMS: Your application does single-triple INSERTs over client-server protocol all day, from a single client. 57% of real time goes in client server latency, 40% in cluster interconnect latency, 2% in compiling the statements, and 1% in doing the work. Use array parameters or bulk load from a file.
Operator: My developers use industry-standard Java class libraries with a service-oriented architecture and strictly enforced interfaces. This is called software engineering. Watch out ere you raise your voice against the canon.
[Some weeks later, after the load job has gone on for 10 days and gotten a third of the way, developers have discovered that JDBC has array parameters and are trying these.]
DBMS: 60% of real time goes into waiting for locks. 10% of transactions get aborted for deadlock. Transactions consist of an average of 10 client-server operations. Use stored procedures; acquire locks in predictable order; do SELECT FOR UPDATE. Throughput will be 4x higher if client-server operations are merged into a single operation. The transactions only INSERT; hence consider bulk load instead.
Operator: We are using an enterprise-class three-tier architecture. It has "enterprise" in the name and all the big guys are using it, so it must be scalable. Besides, it is distributed transactions, and distributed computing is the wave of the future. You are a cluster yourself, so the pot's got no business calling the kettle black.
[After a while, the data gets loaded with bulk load, but now on a single stream.]
DBMS: CPU is at 400% for an INSERT workload; adding more parallel threads will get 4.5x better throughput.
[Some time has elapsed and there are Ajax client apps out there trying to use the data.]
DBMS: Will you really not give me another 140 GB RAM and 16 more cores?
Operator: No, on general principles I will not, shut up.
DBMS: Do you know that your page impression takes 3 seconds and anything over 0.25 seconds is visibly slow? 300 GB worth of distinct pages have been accessed in the last 24 hours for 160 GB of RAM. Latency will drop 10x by using SSD; 50x by increasing RAM.
Operator: No dice, bucket. Shut up, besides, when I scroll through the data I always use for testing, I get it fast enough, you are just doing this out of greed and self-importance. You are a server among many, just like the mail server; you databases are just pretentious.
Currently addressing any of the above sorts of issues takes a long time and involves mostly-avoidable support communication. Questions of this sort do occur. We can probably produce commentary like the above based on logging some 50 numbers, and making some 15 regularly-run reports over these. The patterns to watch out for are well known. No, we will not make a Zippy the Pinhead office assistant; a computer should not try to be cute. This one will talk only in terms of gains from adjusting the deployment or usage patterns.
Now, suppose the operator said yes to the request for more cores and memory; then it would be up to the DBMS to deliver. This entails a capacity to redistribute itself automatically, and to give a quantitative report on the success of this measure. This means usage-based repartitioning of the data to equalize load over a cluster. The relevant metric in the above case is the drop in response time. On the other hand, the DBMS should also notice if there is clearly unused capacity.
This all will be presented as a line in the status report, so there is no extra wizard or workload analyzer that one must remember to run. For programmatic use there are SQL views for the relevant reports.
As for ETL, even if the DBMS can detect that it is not being done right, this does not mean that the user will know what to do. Therefore, for all the Web harvesting we support, as well as any import from local file system or Web services, with some RDF-ization, we will simply implement a proper ETL utility that will do things right. Wazzup? can just point the user to that if the workload looks like loading. This will have its own status report giving a load and transform rate and will point out what takes the longest, after everything is duly parallelized and made asynchronous.
Beyond these lessons, there is more to say about the review and plenary, we will get to that a bit later. We did promise a new edition of the LOD cache in a couple of months, now on the clustered column-store platform. Look for advances in data discoverability.
LOD2 Plenary and Review: Semanticist, Think Database!
[
Orri Erling
]
Last week the LOD2 FP7 project had its first review, preceded by its third plenary meeting.
Before this, we did, as promised, get the column store and vectored execution capabilities of Virtuoso 7 Single-Server Edition extended to Virtuoso 7 Cluster Edition. More interesting still, we decoupled storage from the database server process, so now database files can migrate between server processes. This means that clusters are now elastic, i.e., new servers can be added to a cluster and the load can be redistributed without reloading the data.
These things were long planned, but now are done. Measurements will be published in some weeks, as part of CWI's continued running of RDF store benchmarks, per the LOD2 plan.
Doing the column store and elastic cluster is work enough, so I do not in general participate in support or consultancy or the like. This has some pros and cons. On the plus side, there is a relative lack of noise and a very clear idea of focus. Of course, this work is most highly applied, thus always informed by use cases, thus forgetting what ought to be done out there is not the problem. Rather, the problem is forgetting how things in fact are done as opposed to how they could or should be done.
To cut a long story short, it has become clear to me that the DBMS must tell the application developer what to do. Of course, the application developer could also look at performance metrics, but they do not, and explaining these metrics is too much work and yields no lasting benefit. Developers will produce all kinds of performance diagnostic traces if requested, but going through this song and dance can also be avoided by the right automation.
So, I will introduce two new product features called Wazzup? and Saywhat?
Wazzup? is answered by a mood line, like "Heavily disk bound: 100G more memory will give 10x speedup" or "Network bound: Processing in larger batches will give 5x more throughput" and Saywhat? is answered by some commentary on the user's last action, for example "there is no ?order with o_totalprice < 0" or "there is no property O_misspelledtotallprrice."
Wazzup? is about overall system state, and Saywhat? is about the user session, specifically query plans. But an explanation of a query plan is not understandable, so this will just point out some salient facts, like the reason why the answer comes out empty.
The other thing that came to my attention is the fact that a user has no instinctive feel for ETL. A database person takes it for a self-evident truth that data is loaded in bulk, but the application developer does not think of that. Likewise, the line between warehousing and federating is not instinctively felt; actually the question is not even posed in these terms. So one will find Web protocols and end-points and glue code on the app server when one ought to have ETL and adequate hardware for running the consolidated database.
Further, under-provisioning of equipment is endemic with semanticists. The Semantic Web gets a needlessly bad rap just because we find too much data on too little equipment. For example, I was surprised to learn that the Linked Geodata demo ran on only 16 GB RAM and 6 processor cores with 2 billion triples and 350 million points in a geo index. Now, even with our greatest space efficiency advances, there is no way this will run from memory.
It is not that the Web 2.0 stack is necessarily efficient (we hear the wildest stories of lack of database understanding from that side too), but at least there is a culture of running with enough equipment. Surely when the web-scale data gear (e.g. Google Bigtable, Yahoo PNUTS, Amazon Dynamo) was new, by the operators' own admission there was no way for this to be particularly efficient, database-wise. Not if your eventual consistency is a client application to a shared MySQL back-end. For a lookup or single-record-update workload, who cares when there is enough hardware? For analytics, there is the de facto impossibility of doing big joins, but map reduce is for that, all offline. The big web houses have always known how to deal with data; it is the smaller Web 2.0 guys who patch systems together with duct tape and memcache. Even so, the online experience gets created.
Semanticism has no part of this outlook, except maybe for Freebase, but then they are from California and now have been inside Google for a while.
We quite understand that when one needs to get big data online, one makes a key-value store as a point solution, because this way one owns what one operates, and the time to market is a lot shorter than if one tried building all this inside a general-purpose DBMS. Besides, the people who can in fact do this almost do not exist, and even if one had a whole army of this rare breed, development is not very scalable in a tightly-integrated system like a high-performance DBMS. Still further, to even start, one needs to own the DBMS, meaning that the initial platform must be known through and through. This is an issue even though open source platforms exist.
The graph data, semdata, schema-last, RDF, linked data enterprise -- whatever one calls it -- makes the bold proposition of bringing complex-query-at-scale to heterogeneous data. This is a database claim.
In the meantime, test deployments are made in defiance of database best practices. This is a bit like test driving a race car in reverse gear and steering by looking in the rear-view mirror.
There is also no short-term scalable way to educate people. At the LOD2 review, one comment was that an integrated project ought to clearly indicate how to set up the tool chain for good performance, specially as concerns interfaces between the tools. This is very true. Experience shows that developers of tools cannot accurately anticipate what usage patterns will emerge in the field. Therefore, we propose to do better than just documentation; we will make the server recognize the common sources of inefficiency and point the user to the right action.
Provisioning and usage patterns: The DBMS ought to know best.
Imagine the following conversation:
DBMS: Your application does single-triple INSERTs over client-server protocol all day, from a single client. 57% of real time goes in client server latency, 40% in cluster interconnect latency, 2% in compiling the statements, and 1% in doing the work. Use array parameters or bulk load from a file.
Operator: My developers use industry-standard Java class libraries with a service-oriented architecture and strictly enforced interfaces. This is called software engineering. Watch out ere you raise your voice against the canon.
[Some weeks later, after the load job has gone on for 10 days and gotten a third of the way, developers have discovered that JDBC has array parameters and are trying these.]
DBMS: 60% of real time goes into waiting for locks. 10% of transactions get aborted for deadlock. Transactions consist of an average of 10 client-server operations. Use stored procedures; acquire locks in predictable order; do SELECT FOR UPDATE. Throughput will be 4x higher if client-server operations are merged into a single operation. The transactions only INSERT; hence consider bulk load instead.
Operator: We are using an enterprise-class three-tier architecture. It has "enterprise" in the name and all the big guys are using it, so it must be scalable. Besides, it is distributed transactions, and distributed computing is the wave of the future. You are a cluster yourself, so the pot's got no business calling the kettle black.
[After a while, the data gets loaded with bulk load, but now on a single stream.]
DBMS: CPU is at 400% for an INSERT workload; adding more parallel threads will get 4.5x better throughput.
[Some time has elapsed and there are Ajax client apps out there trying to use the data.]
DBMS: Will you really not give me another 140 GB RAM and 16 more cores?
Operator: No, on general principles I will not, shut up.
DBMS: Do you know that your page impression takes 3 seconds and anything over 0.25 seconds is visibly slow? 300 GB worth of distinct pages have been accessed in the last 24 hours for 160 GB of RAM. Latency will drop 10x by using SSD; 50x by increasing RAM.
Operator: No dice, bucket. Shut up, besides, when I scroll through the data I always use for testing, I get it fast enough, you are just doing this out of greed and self-importance. You are a server among many, just like the mail server; you databases are just pretentious.
Currently addressing any of the above sorts of issues takes a long time and involves mostly-avoidable support communication. Questions of this sort do occur. We can probably produce commentary like the above based on logging some 50 numbers, and making some 15 regularly-run reports over these. The patterns to watch out for are well known. No, we will not make a Zippy the Pinhead office assistant; a computer should not try to be cute. This one will talk only in terms of gains from adjusting the deployment or usage patterns.
Now, suppose the operator said yes to the request for more cores and memory; then it would be up to the DBMS to deliver. This entails a capacity to redistribute itself automatically, and to give a quantitative report on the success of this measure. This means usage-based repartitioning of the data to equalize load over a cluster. The relevant metric in the above case is the drop in response time. On the other hand, the DBMS should also notice if there is clearly unused capacity.
This all will be presented as a line in the status report, so there is no extra wizard or workload analyzer that one must remember to run. For programmatic use there are SQL views for the relevant reports.
As for ETL, even if the DBMS can detect that it is not being done right, this does not mean that the user will know what to do. Therefore, for all the Web harvesting we support, as well as any import from local file system or Web services, with some RDF-ization, we will simply implement a proper ETL utility that will do things right. Wazzup? can just point the user to that if the workload looks like loading. This will have its own status report giving a load and transform rate and will point out what takes the longest, after everything is duly parallelized and made asynchronous.
Beyond these lessons, there is more to say about the review and plenary, we will get to that a bit later. We did promise a new edition of the LOD cache in a couple of months, now on the clustered column-store platform. Look for advances in data discoverability.
|
09/29/2011 10:50 GMT
|
Modified:
09/29/2011 14:48 GMT
|
GDB for the Data Driven Age (STI Summit Position Paper)
[
Virtuso Data Space Bot
]
Note: The following was written prior to the event, but was not posted until later due to human error.
The Semantic Technology Institute (STI) is organizing a meeting around the questions of making semantic technology deliver on its promise. We were asked to present a position paper (reproduced below). This is another recap of our position on making graph databasing come of age. While the database technology matters are getting tackled, we are drawing closer to the question of deciding actually what kind of inference will be needed close to the data. My personal wish is to use this summit for clarifying exactly what is needed from the database in order to extract value from the data explosion. We have a good idea of what to do with queries but what is the exact requirement for transformation and alignment of schema and identifiers? What is the actual use case of inference, OWL or other, in this? It is time to get very concrete in terms of applications. We expect a mixed requirement but it is time to look closely at the details.
GDB for the Data Driven Age
Databases and knowledge representation both have decades of history, but to date the exchange of ideas and techniques between these disciplines has been limited. The intuition that there would be value in greater cooperation has not failed to occur to researchers on either side; after all, both sides deal with data. From this, we have seen deductive databases emerge, as well as more recently "database friendly" profiles of OWL.
In this position paper we will examine what, in the most concrete terms, is needed in order to bring leading edge database technology together with expressive querying and reasoning. This draws on our experience in building Virtuoso, one of today's leading graph data stores. Following this, we argue for the creation of benchmarks and challenges that in fact do reflect reality and facilitate open and fair comparison of products and technologies.
Data integration is often mentioned as the motivating use case for GDB, commonly popularized today as RDF. Database research has over the past few years produced great advances for business intelligence (i.e., complex queries and read-mostly workloads). These advances are typified by compressed columnar storage and architecture-conscious execution models, mostly based on the idea of always processing multiple sets of values in each operation (vectoring). With these techniques, raw performance with relatively simple schemas and regular data (e.g., TPC-H) is no longer a barrier to extracting value from data.
A similar breakthrough has not been seen on the semantics side. Data integration still requires manual labor. Publishing GDB datasets is a good and necessary intermediate stage, but producing these datasets from diverse sources is not fundamentally different from doing the same work without GDB or RDF. Even so, GDB and RDF serve as a catalyst for a culture of publishing datasets.
GDB, as a base model for integration, offers the following benefits over a purely relational result format:
- All entities have globally unique identifiers.
- Any statements may be associated ad hoc to any entities.
- These statements can be scoped into graphs according to their provenance, time, validity, etc.
Obtaining this flexibility on a relational basis would simply require moving to an graph-like representation with essentially one-row-per-attribute. Indeed, we see key-value stores being used in online applications with high volatility of schema (e.g., social networks, search); and we also see relational applications making provisions for post-hoc addition of per-entity attributes (i.e., associating a bag of mixed non-first normal form data with entities). The benefits of a schema-last approach are recognized in many places.
GDB seems a priori a fit for all these requirements, thus how will it claim its place as a solution?
The first part of the answer lies in learning all the relevant database lessons. The second part lies in eliminating the impedance mismatch between querying and reasoning. The third and most important part consists of substantiating these claims in a manner that is understandable to the relevant publics, finally leading to the creation of a semantics-aware segment of the database industry. We will address each of these aspects in turn.
GDB and RDB
The problem is divided into storage format, execution, and query optimization. For the first two, Daniel Abadi's renowned Ph.D. thesis holds most of the keys. Space efficiency is specially important for Linked Data, since data is often voluminous, and many datasets have to be brought together for integration. Access patterns are also unpredictable, with indexed-random-access predominating, as opposed to RDB BI workloads where sequential scans and hash joins represent the bulk of the work. However, we find that a sorted column-wise compressed representation of Linked Data with a single quad table for all statements gives excellent space efficiency and good random access as well as random insert speed. The space efficiency is close to par with the equivalent column-wise relational format, since three of the four columns of the quad table compress to almost nothing. As many sort orders as are necessary may be maintained, but we find that two are enough, with some extra data structures for dealing with queries where the predicate is unspecified. The details are found in VLDB 2010 Semdata workshop paper, Directions and Challenges for Semantically Linked Data. Since GDB/RDF is a model typed at run time, the engine must support an "ANY" data type for columns and query variables, where values on successive rows may be of different types. This is a straightforward enhancement.
Vectored execution is traditionally associated with column stores because the per-row access cost is relatively high, thus needing to access many nearby rows at a time in order to amortize the overhead. Aside this, vectored execution provides many opportunities for parallelism, from the instruction level all the way to threading and distributed execution on clusters, thus some form of execution on large numbers of concurrent query states is needed for RDF stores, just as it is needed for RDBMS"s.
Query optimization for GDBMS is similar to that for RDBMS, except that the statistics can no longer be collected by column and table, but must rather apply to individual entities and ranges of a single quad table. This can be provided through run-time sampling of the database based on constants in the query being optimized. This may take into account trivial inference such as expanding properties into the set of their sub-properties and the like. Beyond this, interleaving execution and optimization (as in ROX) seems to offer limitless possibilities, especially when inference is introduced, making optimizer statistics less predictive.
In summary, starting with an RDBMS and going to GDB entails changes to all parts of the engine, but these changes are not fundamental. One does need to own the engine; however, otherwise the expertise for efficiently implementing these changes will not exist. Essentially any DBMS technique may be translated to a GDB use case, if its application can be decided at run-time. GDB may be schema-less, yet most datasets have fairly regular structure; the question is simply to reconstruct the needed statistics and schema information from the data on an as you go basis. Techniques with high up-front cost, like constructing specially ordered materializations for optimizing specific queries, are harder to deploy but still conceivable for GDB also.
RDB and Inference
Compared to the straightforwardly performance oriented world of database engines, the contours of the landscape become less defined when moving to inference. Databases, whether relational or schema-less all perform roughly the same functions but inference is more diverse. We include here also techniques like machine learning and meta-reasoning for guiding reasoning, although these might not strictly fit the definition.
As we posit that data integration is the motivating use case for GDB as opposed to RDB (Relational Database Model), we must ask which modes of inference are actually required for data integration. Further, we need to ask whether these inferences ought to be applied as a preprocessing step (ETL or forward chaining), or as needed (backward chaining). Some low-hanging fruit can be collected by simply constructing class or property hierarchies; e.g., in the data at hand, the following properties have the meaning of company name, and the following classes have the meaning of company. We have found that such techniques can be efficiently supported at run-time, without materialization, if the support is simply built into the engine, which is in itself straightforward as long as one controls the engine. The same applies to trivial identity resolution, such as owl:sameAs or resolution of identity based on sharing an inverse-functional property value. These things take longer at run-time, but if one caches and reuses the result, one can get around materialization.
We do not believe in weak statements of identity, as in X is similar to Y, since the meaning of similarity is entirely contextual. X and Y may or may not be interchangeable depending on the application; thus the statement on identity needs to be strong, but it must be easy to modify the grounds on which such a statement is made. This is a further argument for why one should not automatically materialize consequences of identity, particularly if dealing with web data where identity is especially problematic.
Real-world problems are however harder than just bundling properties, classes, or instances into sets of interchangeable equivalents, which is all we have mentioned thus far. There are differences of modeling ("address as many columns in customer table" vs. "address normalized away under a contact entity"), normalization ("first name" and "last name" as one or more properties; national conventions on person names; tags as comma-separated in a string or as a one-to-many), incomplete data (one customer table has family income bracket, the other does not), diversity in units of measurement (Imperial vs. metric), variability in the definition of units (seven different things all called blood pressure), variability in unit conversions (currency exchange rates), to name a few. What a world!
If data exists, the conversion questions are often answerable but their answer depends on context -- e.g., date of transaction for currency exchange rate; source of data for the definition of blood pressure.
Alongside these, there remain issues of identity, e.g., depending on the perspective, a national subsidiary is or is not the same entity as the parent company, companies with the same name can be entirely unrelated in different jurisdictions.
It appears that we may need a multi-level approach, combining different techniques for different phases of the integration process. We do not a priori believe that using SQL VIEWs for unit and modeling conversion, and then OWL for unifying terminology on top of this, were the whole solution. Even if this were the solution, the pipeline from the relational sources to SPARQL and OWL needs to be optimized for real-world BI information volumes, and the query language needs to be able to express the business questions and needs to interface with the reporting tools the analyst has come to expect.
Our answer so far consists of a SPARQL extension with non-recursive rules, roughly equivalent to SQL VIEWs in expressive power, tightly integrated to the query engine. There is also limited support for recursion through transitive subqueries; thus one can compactly express things like "all parts of all assemblies and subassemblies must satisfy applicable safety requirements, where the requirements depend on the type of the part in question."
This is only an intermediate step. We believe that a database-scale generic inference engine with at least Datalog power, with second-order extensions like computed predicates, is needed, executing inside the DBMS, benefiting from the whole array of optimizations database-science expects of execution engines, as part of the answer.
This will not relieve the analyst of having to consider that the currency rates in effect at the time of conversion must be taken into account when calculating profits, but this will at least make expressing this and similar pieces of context more compact.
We note that time-to-answer has historically won over raw performance. This was also the case for RDBMS when these were the fresh challenger to the CODASYL incumbents, just as was the case with the adoption of high-level languages. The key is that the raw performance must be sufficient for the real world task. With the adoption of the database lessons outlined in the previous section, we believe this to be the case for GDB (and thus, RDF).
Substantiating the Claims
Benchmarks have a stellar record for improving any metric they measure. The question is, how can we make a metric that measures GDB's ability to deliver on its claim to fame -- time-to-answer for big data -- with all the integration and other complexities this entails?
So far, GDB benchmarks have consisted of workloads where RDBMS are clearly better (e.g., LUBM, or the Berlin SPARQL Benchmark). This does not remove their usefulness for GDB, but does not constitute a GDB selling point, either.
We suggest a dual approach. The first part is demonstrating that GDB is scalable for BI: We take the industry standard decision support benchmark TPC-H, which is very favorable to RDB and quite unfavorable to GDB, and show that we can tackle the workload at reasonable cost. If TPC-H is all one wants, an RDBMS will stay a better fit, but then this benchmark does not capture any of the heterogeneity, schema evolution, or other such requirements faced by real-world data warehouses. This is still a qualification test, not the selling point.
The issue of benchmark is inextricably tied to the issue of messaging. There must be a compelling story, with which the IT community can identify. Further, the benchmark must capture real-world challenges in the area of interest. With all this, the benchmark should not be too expensive to run. Here too, a multistage approach suggests itself.
Our tentative answer to this question is the Social Intelligence Benchmark (SIB), developed together with CWI and other partners in the LOD2 consortium. This simulates a social network and combines an online workload with complex analytics. This benchmark should cover all of the target areas of the LOD2 project, so that the project itself generates its own metric of success. The project has clear data integration targets, especially as applies to Web and Linked Data. Questions of integration with enterprise sources need to be further developed; for example, comparing CRM data with extractions from the online conversation space for market research.
Data integration will invariably involve human effort, and the area cannot be satisfactorily covered with metrics of scale and throughput alone. Development time, accuracy of results, and cost of maintenance are all factors. Furthermore, the task being modeled must correspond to reality, still without being too domain-specific or prohibitively time-consuming to implement.
Conclusions
The data driven world will increase rewards for efficiency in data integration. We believe that such efficiency crucially depends on semantics. Real world requirements just might throw the database and AI communities together with enough heat and pressure for fusion to ignite, allegorically speaking. Without a clear and present need, the geek world analog of electrostatic repulsion will keep the communities separate, as has been the case thus far, and no new, qualitatively-different element will arise.
Efforts such as this STI Summit and the LOD2 Project are needed for setting directions and communicating the requirement to the research world. In our fusion analogy, this is the field which directs the nuclei to collide.
Once there is an actual reaction that produces more than it consumes by a sufficient margin, regular business dynamics will take over, and we will have an industry with several products of comparable capability, as well as a set of metrics, all to the benefit of the end user.
References
-
TPC-H results pages
-
Daniel Abadi's Ph.D. Thesis, Query Execution in
Column-Oriented Database Systems ( PDF )
-
Our VLDB 2010 Semdata workshop paper, Directions and Challenges for Semantically Linked Data ( HTML | PDF )
-
CWI's ROX: Run-time Optimization of XQueries ( PDF )
-
The LOD2 Project web site
|
07/26/2011 09:37 GMT
|
Modified:
07/26/2011 09:40 GMT
|
GDB for the Data Driven Age (STI Summit Position Paper)
[
Orri Erling
]
Note: The following was written prior to the event, but was not posted until later due to human error.
The Semantic Technology Institute (STI) is organizing a meeting around the questions of making semantic technology deliver on its promise. We were asked to present a position paper (reproduced below). This is another recap of our position on making graph databasing come of age. While the database technology matters are getting tackled, we are drawing closer to the question of deciding actually what kind of inference will be needed close to the data. My personal wish is to use this summit for clarifying exactly what is needed from the database in order to extract value from the data explosion. We have a good idea of what to do with queries but what is the exact requirement for transformation and alignment of schema and identifiers? What is the actual use case of inference, OWL or other, in this? It is time to get very concrete in terms of applications. We expect a mixed requirement but it is time to look closely at the details.
GDB for the Data Driven Age
Databases and knowledge representation both have decades of history, but to date the exchange of ideas and techniques between these disciplines has been limited. The intuition that there would be value in greater cooperation has not failed to occur to researchers on either side; after all, both sides deal with data. From this, we have seen deductive databases emerge, as well as more recently "database friendly" profiles of OWL.
In this position paper we will examine what, in the most concrete terms, is needed in order to bring leading edge database technology together with expressive querying and reasoning. This draws on our experience in building Virtuoso, one of today's leading graph data stores. Following this, we argue for the creation of benchmarks and challenges that in fact do reflect reality and facilitate open and fair comparison of products and technologies.
Data integration is often mentioned as the motivating use case for GDB, commonly popularized today as RDF. Database research has over the past few years produced great advances for business intelligence (i.e., complex queries and read-mostly workloads). These advances are typified by compressed columnar storage and architecture-conscious execution models, mostly based on the idea of always processing multiple sets of values in each operation (vectoring). With these techniques, raw performance with relatively simple schemas and regular data (e.g., TPC-H) is no longer a barrier to extracting value from data.
A similar breakthrough has not been seen on the semantics side. Data integration still requires manual labor. Publishing GDB datasets is a good and necessary intermediate stage, but producing these datasets from diverse sources is not fundamentally different from doing the same work without GDB or RDF. Even so, GDB and RDF serve as a catalyst for a culture of publishing datasets.
GDB, as a base model for integration, offers the following benefits over a purely relational result format:
- All entities have globally unique identifiers.
- Any statements may be associated ad hoc to any entities.
- These statements can be scoped into graphs according to their provenance, time, validity, etc.
Obtaining this flexibility on a relational basis would simply require moving to an graph-like representation with essentially one-row-per-attribute. Indeed, we see key-value stores being used in online applications with high volatility of schema (e.g., social networks, search); and we also see relational applications making provisions for post-hoc addition of per-entity attributes (i.e., associating a bag of mixed non-first normal form data with entities). The benefits of a schema-last approach are recognized in many places.
GDB seems a priori a fit for all these requirements, thus how will it claim its place as a solution?
The first part of the answer lies in learning all the relevant database lessons. The second part lies in eliminating the impedance mismatch between querying and reasoning. The third and most important part consists of substantiating these claims in a manner that is understandable to the relevant publics, finally leading to the creation of a semantics-aware segment of the database industry. We will address each of these aspects in turn.
GDB and RDB
The problem is divided into storage format, execution, and query optimization. For the first two, Daniel Abadi's renowned Ph.D. thesis holds most of the keys. Space efficiency is specially important for Linked Data, since data is often voluminous, and many datasets have to be brought together for integration. Access patterns are also unpredictable, with indexed-random-access predominating, as opposed to RDB BI workloads where sequential scans and hash joins represent the bulk of the work. However, we find that a sorted column-wise compressed representation of Linked Data with a single quad table for all statements gives excellent space efficiency and good random access as well as random insert speed. The space efficiency is close to par with the equivalent column-wise relational format, since three of the four columns of the quad table compress to almost nothing. As many sort orders as are necessary may be maintained, but we find that two are enough, with some extra data structures for dealing with queries where the predicate is unspecified. The details are found in VLDB 2010 Semdata workshop paper, Directions and Challenges for Semantically Linked Data. Since GDB/RDF is a model typed at run time, the engine must support an "ANY" data type for columns and query variables, where values on successive rows may be of different types. This is a straightforward enhancement.
Vectored execution is traditionally associated with column stores because the per-row access cost is relatively high, thus needing to access many nearby rows at a time in order to amortize the overhead. Aside this, vectored execution provides many opportunities for parallelism, from the instruction level all the way to threading and distributed execution on clusters, thus some form of execution on large numbers of concurrent query states is needed for RDF stores, just as it is needed for RDBMS"s.
Query optimization for GDBMS is similar to that for RDBMS, except that the statistics can no longer be collected by column and table, but must rather apply to individual entities and ranges of a single quad table. This can be provided through run-time sampling of the database based on constants in the query being optimized. This may take into account trivial inference such as expanding properties into the set of their sub-properties and the like. Beyond this, interleaving execution and optimization (as in ROX) seems to offer limitless possibilities, especially when inference is introduced, making optimizer statistics less predictive.
In summary, starting with an RDBMS and going to GDB entails changes to all parts of the engine, but these changes are not fundamental. One does need to own the engine; however, otherwise the expertise for efficiently implementing these changes will not exist. Essentially any DBMS technique may be translated to a GDB use case, if its application can be decided at run-time. GDB may be schema-less, yet most datasets have fairly regular structure; the question is simply to reconstruct the needed statistics and schema information from the data on an as you go basis. Techniques with high up-front cost, like constructing specially ordered materializations for optimizing specific queries, are harder to deploy but still conceivable for GDB also.
RDB and Inference
Compared to the straightforwardly performance oriented world of database engines, the contours of the landscape become less defined when moving to inference. Databases, whether relational or schema-less all perform roughly the same functions but inference is more diverse. We include here also techniques like machine learning and meta-reasoning for guiding reasoning, although these might not strictly fit the definition.
As we posit that data integration is the motivating use case for GDB as opposed to RDB (Relational Database Model), we must ask which modes of inference are actually required for data integration. Further, we need to ask whether these inferences ought to be applied as a preprocessing step (ETL or forward chaining), or as needed (backward chaining). Some low-hanging fruit can be collected by simply constructing class or property hierarchies; e.g., in the data at hand, the following properties have the meaning of company name, and the following classes have the meaning of company. We have found that such techniques can be efficiently supported at run-time, without materialization, if the support is simply built into the engine, which is in itself straightforward as long as one controls the engine. The same applies to trivial identity resolution, such as owl:sameAs or resolution of identity based on sharing an inverse-functional property value. These things take longer at run-time, but if one caches and reuses the result, one can get around materialization.
We do not believe in weak statements of identity, as in X is similar to Y, since the meaning of similarity is entirely contextual. X and Y may or may not be interchangeable depending on the application; thus the statement on identity needs to be strong, but it must be easy to modify the grounds on which such a statement is made. This is a further argument for why one should not automatically materialize consequences of identity, particularly if dealing with web data where identity is especially problematic.
Real-world problems are however harder than just bundling properties, classes, or instances into sets of interchangeable equivalents, which is all we have mentioned thus far. There are differences of modeling ("address as many columns in customer table" vs. "address normalized away under a contact entity"), normalization ("first name" and "last name" as one or more properties; national conventions on person names; tags as comma-separated in a string or as a one-to-many), incomplete data (one customer table has family income bracket, the other does not), diversity in units of measurement (Imperial vs. metric), variability in the definition of units (seven different things all called blood pressure), variability in unit conversions (currency exchange rates), to name a few. What a world!
If data exists, the conversion questions are often answerable but their answer depends on context -- e.g., date of transaction for currency exchange rate; source of data for the definition of blood pressure.
Alongside these, there remain issues of identity, e.g., depending on the perspective, a national subsidiary is or is not the same entity as the parent company, companies with the same name can be entirely unrelated in different jurisdictions.
It appears that we may need a multi-level approach, combining different techniques for different phases of the integration process. We do not a priori believe that using SQL VIEWs for unit and modeling conversion, and then OWL for unifying terminology on top of this, were the whole solution. Even if this were the solution, the pipeline from the relational sources to SPARQL and OWL needs to be optimized for real-world BI information volumes, and the query language needs to be able to express the business questions and needs to interface with the reporting tools the analyst has come to expect.
Our answer so far consists of a SPARQL extension with non-recursive rules, roughly equivalent to SQL VIEWs in expressive power, tightly integrated to the query engine. There is also limited support for recursion through transitive subqueries; thus one can compactly express things like "all parts of all assemblies and subassemblies must satisfy applicable safety requirements, where the requirements depend on the type of the part in question."
This is only an intermediate step. We believe that a database-scale generic inference engine with at least Datalog power, with second-order extensions like computed predicates, is needed, executing inside the DBMS, benefiting from the whole array of optimizations database-science expects of execution engines, as part of the answer.
This will not relieve the analyst of having to consider that the currency rates in effect at the time of conversion must be taken into account when calculating profits, but this will at least make expressing this and similar pieces of context more compact.
We note that time-to-answer has historically won over raw performance. This was also the case for RDBMS when these were the fresh challenger to the CODASYL incumbents, just as was the case with the adoption of high-level languages. The key is that the raw performance must be sufficient for the real world task. With the adoption of the database lessons outlined in the previous section, we believe this to be the case for GDB (and thus, RDF).
Substantiating the Claims
Benchmarks have a stellar record for improving any metric they measure. The question is, how can we make a metric that measures GDB's ability to deliver on its claim to fame -- time-to-answer for big data -- with all the integration and other complexities this entails?
So far, GDB benchmarks have consisted of workloads where RDBMS are clearly better (e.g., LUBM, or the Berlin SPARQL Benchmark). This does not remove their usefulness for GDB, but does not constitute a GDB selling point, either.
We suggest a dual approach. The first part is demonstrating that GDB is scalable for BI: We take the industry standard decision support benchmark TPC-H, which is very favorable to RDB and quite unfavorable to GDB, and show that we can tackle the workload at reasonable cost. If TPC-H is all one wants, an RDBMS will stay a better fit, but then this benchmark does not capture any of the heterogeneity, schema evolution, or other such requirements faced by real-world data warehouses. This is still a qualification test, not the selling point.
The issue of benchmark is inextricably tied to the issue of messaging. There must be a compelling story, with which the IT community can identify. Further, the benchmark must capture real-world challenges in the area of interest. With all this, the benchmark should not be too expensive to run. Here too, a multistage approach suggests itself.
Our tentative answer to this question is the Social Intelligence Benchmark (SIB), developed together with CWI and other partners in the LOD2 consortium. This simulates a social network and combines an online workload with complex analytics. This benchmark should cover all of the target areas of the LOD2 project, so that the project itself generates its own metric of success. The project has clear data integration targets, especially as applies to Web and Linked Data. Questions of integration with enterprise sources need to be further developed; for example, comparing CRM data with extractions from the online conversation space for market research.
Data integration will invariably involve human effort, and the area cannot be satisfactorily covered with metrics of scale and throughput alone. Development time, accuracy of results, and cost of maintenance are all factors. Furthermore, the task being modeled must correspond to reality, still without being too domain-specific or prohibitively time-consuming to implement.
Conclusions
The data driven world will increase rewards for efficiency in data integration. We believe that such efficiency crucially depends on semantics. Real world requirements just might throw the database and AI communities together with enough heat and pressure for fusion to ignite, allegorically speaking. Without a clear and present need, the geek world analog of electrostatic repulsion will keep the communities separate, as has been the case thus far, and no new, qualitatively-different element will arise.
Efforts such as this STI Summit and the LOD2 Project are needed for setting directions and communicating the requirement to the research world. In our fusion analogy, this is the field which directs the nuclei to collide.
Once there is an actual reaction that produces more than it consumes by a sufficient margin, regular business dynamics will take over, and we will have an industry with several products of comparable capability, as well as a set of metrics, all to the benefit of the end user.
References
-
TPC-H results pages
-
Daniel Abadi's Ph.D. Thesis, Query Execution in
Column-Oriented Database Systems ( PDF )
-
Our VLDB 2010 Semdata workshop paper, Directions and Challenges for Semantically Linked Data ( HTML | PDF )
-
CWI's ROX: Run-time Optimization of XQueries ( PDF )
-
The LOD2 Project web site
|
07/26/2011 09:36 GMT
|
Modified:
07/26/2011 09:39 GMT
|
The 2011 STI Semantic Summit
[
Virtuso Data Space Bot
]
I was recently at the STI 2011 summit in Riga, Latvia.
This is a meeting of senior participants in the semantic web and sem tech scene, organized by STI of Dieter Fensel fame, with board members like Michael Brodie, Mark Greaves, and Jim Hendler.
This is substantially about the intersection of AI, knowledge representation, and databases. As we have said before, the database side has not been very prominent in these meetings in the past, but this time we had Peter Boncz of CWI, of MonetDB and VectorWise fame, attending the proceedings.
Will DB and AI finally meet? Well, they have met, but how do they get along? Before I try to answer this, let us look at some background.
At present, CWI and OpenLink are working together in the LOD2 EU FP7 project, around the general topic of bringing the best of Relational Database (RDB) science to the Graph Database (GDB) world. Virtuoso has for a few months had a column store capability (which is about to be made available for public preview). CWI has a long history of column store work, with MonetDB and Ingres VectorWise as results. OpenLink's column store implementation is separate in terms of code but is of course influenced by the work at CWI and other published column store results. The plan is to transplant the applicable CWI innovations into the graph context within Virtuoso. These improvements naturally also benefit Virtuoso RDB (SQL), but the LOD2 project is primarily concerned with GDB applications. The RDB yardstick for much of this work is TPC-H, of which we have made a GDB translation. CWI is uniquely qualified as concerns this in light of VectorWise holding some of the top places in the TPC-H charts.
Even now, we do in fact run the 22 TPC-H queries in SPARQL against the Virtuoso column store. True, these run faster in SQL against relational tables but we have established a beach head. From this initial position, we can incrementally improve the GDB/SPARQL and RDB/SQL functions, and see how close to SQL we get with SPARQL. I will make a separate post commenting on the differences between SQL and SPARQL.
So let's get back to Riga. Mark Greaves said in his opening comments that he would be sick if he once again heard complaining about how bad and un-scalable the tools were. From all the talks, I did get the overall impression that just better databasing for Graph Data is still needed. OK, we have 1-1/2 years of unreleased work just for that about to hit the street; advances are substantial. Along these lines, the people from Bio2RDF pointed out that there still is a cost to publishing query services, specially for complex queries. Well, this cost will be substantially reduced.
The takeaway from the meeting is that the most useful thing, for both our public and ourselves, is simply to keep advancing database tech for graph data. In the first instance, this is about launching what we already have; in the second, about going through the CWI record of innovation and adapting this to GDB.
The thinking is that once query-answering on some tens-of-billions of triples is easily interactive no matter what question one asks, a tipping point will be reached, and GDB can efficiently play the role of data-melting-pot that has been envisioned for it.
This is just a beginning, though. Michael Brodie has on a number of occasions pointed out that that (relational) database guys are only about performance with little or no regard to meaning or even questions of the applicability of the relational model. Peter Boncz then comments back that it can well be that the bulk of IT expenditure worldwide in fact goes into data integration. However, data integration is an "AI-complete" problem with infinite variety and consequent difficulty of measurement. So, making better database engines stands a much greater chance of success and has the nicety of relatively unambiguous metrics.
Quite so. We are somewhere in the middle. I'd say that GDB is still at the stage where making better databases is a matter of make-or-break and not a matter of cutting already vanishingly-short response times just for the sake of it. We will have progress if we just keep at it; for now, performance is still a basic need and not a luxury.
Now that there is all this potentially integrable data published as graphs (most commonly as RDF serializations), what do we do? Someone at the Riga meeting suggested we take a look across the tracks to the RDB world to see what is being done there for data integration. The question is raised, what does GDB have for data integration? The automatic answer that GDB and RDF have OWL is not adequate, as was rightly pointed out by many. Having schema-last, global identifiers, and some culture of vocabulary reuse is nice, but this is only a start. To cite an example, owl:sameAs will not work when entities simply do not align: One database models a product as a parts hierarchy; another does the same but now based on the materials used in the parts. One tree just has a node that is not in the other. Besides, things like string matching (as in extracting area codes from phone numbers) are common, and OWL specifically excludes any such functions.
It is now time to look at what will come after all the database advances. In my talk I outlined some things that have or are about to get solutions:
-
Database technology: Applying advances from RDB (specifically columns, vectoring, and some adaptive query execution) will make GDB a possibility for data warehousing at some scale.
-
Benchmarks: These advances will be demonstrable through benchmarking. There is a better suite of benchmarks with many variations of BSBM, an GDB-modified TPC-H, and the upcoming Social Intelligence Benchmark (SIBB) with actual graph data. There are the beginnings of an auditing process for result publishing, and a fair chance the semdata world will get its analog of the TPC.
After these basics are more or less in hand, we have a vista of more diverse questions:
-
What to do about inference? We do not want OWL or RIF for their own sake; instead we want whatever will declaratively facilitate making sense of data. This is an entirely use-case-driven question. If this can have a reasonably generic answer, we will build it into the engine.
-
Data integration is highly diverse, and tool sets like IBM Infosphere have thousands of modules and functions for different aspects of the problem. To what degree does it make sense to put DI-oriented capabilities into a DBMS?
-
Is it the case that SQL or SPARQL, plus or minus a few details, is as powerful as a language can be while staying application domain-agnostic? In other words, if more powerful reasoning is built into the query language, will the requirements vary so much between application domains that the work is not generally applicable? Datalog is general enough, but can we demonstrate substantially reduced time to answer with big data if this is built into the engine? Berkeley Orders Of Magnitude claims this, even though their claim is not exactly in a database context. We need use cases to refine the actual requirement for inference.
In all these questions, we of necessity turn to the user community. In fact we do not follow the usage of these technologies as much as we ought to. One outcome of the Riga summit is a set of public challenges that will hopefully ameliorate this state of matters, to be released soon.
The general feeling was that there is more going on on the data side than the AI side. The LOD movement proceeds and lightweight everything predominates, also for knowledge representation. There was some discussion about "pay as you go" integration. On the one hand, there is no up-front integration of information systems just for its own sake, so pay as you go is the only kind that exists, system by system, as the need becomes sufficient. On the other hand, each such integration is a process which has its distinct steps and maintenance and within itself it is planned, and thus pre-paid, so to speak. We need more work with the data itself to better understand the matter. The open government data should offer a playground for this and there will be a special challenge around this.
Schema.org and Microdata got their share of discussion. As we see it, it is good that search engines make their pre-competitive data open. This is better than, for example, Google wanting retailers to put their catalogs in Google Base. We do not care about the specific syntax in which data is embedded; we support them all. Microdata converts easily to triples, and if one wants to make a tabular extraction for use with relational tools, this too is simple enough. Applications will have to do their own entity resolution, but this is independent of data publication format.
All in all, the mood was positive. Mark Greaves noted in his closing remarks that there has been a 1000x increase in published GDB data over a few years. There is in fact a large quantity of technology for tackling almost any aspect of the LOD value chain, but people do not necessarily know about this nor is it easy to integrate. Still there would be great value in integration. Getting software to interoperate in a meaningful way is manual labor, so it might make sense to organize hackathons around this. While the STI Summit is for the senior people, there could be a parallel track of events for bringing the coders together to actually practice tool integration and interoperation.
The 2011 STI Semantic Summit
[
Orri Erling
]
I was recently at the STI 2011 summit in Riga, Latvia.
This is a meeting of senior participants in the semantic web and sem tech scene, organized by STI of Dieter Fensel fame, with board members like Michael Brodie, Mark Greaves, and Jim Hendler.
This is substantially about the intersection of AI, knowledge representation, and databases. As we have said before, the database side has not been very prominent in these meetings in the past, but this time we had Peter Boncz of CWI, of MonetDB and VectorWise fame, attending the proceedings.
Will DB and AI finally meet? Well, they have met, but how do they get along? Before I try to answer this, let us look at some background.
At present, CWI and OpenLink are working together in the LOD2 EU FP7 project, around the general topic of bringing the best of Relational Database (RDB) science to the Graph Database (GDB) world. Virtuoso has for a few months had a column store capability (which is about to be made available for public preview). CWI has a long history of column store work, with MonetDB and Ingres VectorWise as results. OpenLink's column store implementation is separate in terms of code but is of course influenced by the work at CWI and other published column store results. The plan is to transplant the applicable CWI innovations into the graph context within Virtuoso. These improvements naturally also benefit Virtuoso RDB (SQL), but the LOD2 project is primarily concerned with GDB applications. The RDB yardstick for much of this work is TPC-H, of which we have made a GDB translation. CWI is uniquely qualified as concerns this in light of VectorWise holding some of the top places in the TPC-H charts.
Even now, we do in fact run the 22 TPC-H queries in SPARQL against the Virtuoso column store. True, these run faster in SQL against relational tables but we have established a beach head. From this initial position, we can incrementally improve the GDB/SPARQL and RDB/SQL functions, and see how close to SQL we get with SPARQL. I will make a separate post commenting on the differences between SQL and SPARQL.
So let's get back to Riga. Mark Greaves said in his opening comments that he would be sick if he once again heard complaining about how bad and un-scalable the tools were. From all the talks, I did get the overall impression that just better databasing for Graph Data is still needed. OK, we have 1-1/2 years of unreleased work just for that about to hit the street; advances are substantial. Along these lines, the people from Bio2RDF pointed out that there still is a cost to publishing query services, specially for complex queries. Well, this cost will be substantially reduced.
The takeaway from the meeting is that the most useful thing, for both our public and ourselves, is simply to keep advancing database tech for graph data. In the first instance, this is about launching what we already have; in the second, about going through the CWI record of innovation and adapting this to GDB.
The thinking is that once query-answering on some tens-of-billions of triples is easily interactive no matter what question one asks, a tipping point will be reached, and GDB can efficiently play the role of data-melting-pot that has been envisioned for it.
This is just a beginning, though. Michael Brodie has on a number of occasions pointed out that that (relational) database guys are only about performance with little or no regard to meaning or even questions of the applicability of the relational model. Peter Boncz then comments back that it can well be that the bulk of IT expenditure worldwide in fact goes into data integration. However, data integration is an "AI-complete" problem with infinite variety and consequent difficulty of measurement. So, making better database engines stands a much greater chance of success and has the nicety of relatively unambiguous metrics.
Quite so. We are somewhere in the middle. I'd say that GDB is still at the stage where making better databases is a matter of make-or-break and not a matter of cutting already vanishingly-short response times just for the sake of it. We will have progress if we just keep at it; for now, performance is still a basic need and not a luxury.
Now that there is all this potentially integrable data published as graphs (most commonly as RDF serializations), what do we do? Someone at the Riga meeting suggested we take a look across the tracks to the RDB world to see what is being done there for data integration. The question is raised, what does GDB have for data integration? The automatic answer that GDB and RDF have OWL is not adequate, as was rightly pointed out by many. Having schema-last, global identifiers, and some culture of vocabulary reuse is nice, but this is only a start. To cite an example, owl:sameAs will not work when entities simply do not align: One database models a product as a parts hierarchy; another does the same but now based on the materials used in the parts. One tree just has a node that is not in the other. Besides, things like string matching (as in extracting area codes from phone numbers) are common, and OWL specifically excludes any such functions.
It is now time to look at what will come after all the database advances. In my talk I outlined some things that have or are about to get solutions:
-
Database technology: Applying advances from RDB (specifically columns, vectoring, and some adaptive query execution) will make GDB a possibility for data warehousing at some scale.
-
Benchmarks: These advances will be demonstrable through benchmarking. There is a better suite of benchmarks with many variations of BSBM, an GDB-modified TPC-H, and the upcoming Social Intelligence Benchmark (SIBB) with actual graph data. There are the beginnings of an auditing process for result publishing, and a fair chance the semdata world will get its analog of the TPC.
After these basics are more or less in hand, we have a vista of more diverse questions:
-
What to do about inference? We do not want OWL or RIF for their own sake; instead we want whatever will declaratively facilitate making sense of data. This is an entirely use-case-driven question. If this can have a reasonably generic answer, we will build it into the engine.
-
Data integration is highly diverse, and tool sets like IBM Infosphere have thousands of modules and functions for different aspects of the problem. To what degree does it make sense to put DI-oriented capabilities into a DBMS?
-
Is it the case that SQL or SPARQL, plus or minus a few details, is as powerful as a language can be while staying application domain-agnostic? In other words, if more powerful reasoning is built into the query language, will the requirements vary so much between application domains that the work is not generally applicable? Datalog is general enough, but can we demonstrate substantially reduced time to answer with big data if this is built into the engine? Berkeley Orders Of Magnitude claims this, even though their claim is not exactly in a database context. We need use cases to refine the actual requirement for inference.
In all these questions, we of necessity turn to the user community. In fact we do not follow the usage of these technologies as much as we ought to. One outcome of the Riga summit is a set of public challenges that will hopefully ameliorate this state of matters, to be released soon.
The general feeling was that there is more going on on the data side than the AI side. The LOD movement proceeds and lightweight everything predominates, also for knowledge representation. There was some discussion about "pay as you go" integration. On the one hand, there is no up-front integration of information systems just for its own sake, so pay as you go is the only kind that exists, system by system, as the need becomes sufficient. On the other hand, each such integration is a process which has its distinct steps and maintenance and within itself it is planned, and thus pre-paid, so to speak. We need more work with the data itself to better understand the matter. The open government data should offer a playground for this and there will be a special challenge around this.
Schema.org and Microdata got their share of discussion. As we see it, it is good that search engines make their pre-competitive data open. This is better than, for example, Google wanting retailers to put their catalogs in Google Base. We do not care about the specific syntax in which data is embedded; we support them all. Microdata converts easily to triples, and if one wants to make a tabular extraction for use with relational tools, this too is simple enough. Applications will have to do their own entity resolution, but this is independent of data publication format.
All in all, the mood was positive. Mark Greaves noted in his closing remarks that there has been a 1000x increase in published GDB data over a few years. There is in fact a large quantity of technology for tackling almost any aspect of the LOD value chain, but people do not necessarily know about this nor is it easy to integrate. Still there would be great value in integration. Getting software to interoperate in a meaningful way is manual labor, so it might make sense to organize hackathons around this. While the STI Summit is for the senior people, there could be a parallel track of events for bringing the coders together to actually practice tool integration and interoperation.
Transaction Semantics in RDF and Relational Models
[
Orri Erling
]
As a part of defining benchmark audit for testing ACID properties on RDF stores, we will here examine different RDF scenarios where lack of concurrency control causes inconsistent results. In so doing, we consider common implementation techniques and implications as concern locking (pessimistic) and multi-version (optimistic) concurrency control schemes.
In the following, we will talk in terms of triples, but the discussion can be trivially generalized to quads. We will use numbers for IRIs and literals. In most implementations, the internal representation for these is indeed a number (or at least some data type that has a well defined collation order). For ease of presentation, we consider a single index with key parts SPO. Any other index-like setting with any possible key order will have similar issues.
Insert (Create) and Delete
INSERT and DELETE as defined in SPARQL are queries which generate a result set which is then used for instantiating triple patterns. We note that a DELETE may delete a triple which the DELETE has not read; thus the delete set is not a subset of the read set. The SQL equivalent is the
DELETE FROM table WHERE key IN
( SELECT key1 FROM other_table )
expression, supposing it were implemented as a scan of other_table and an index lookup followed by DELETE on table.
The meaning of INSERT is that the triples in question exist after the operation, and the meaning of DELETE is that said triples do not exist. In a transactional context, this means that the after-image of the transaction is guaranteed either to have or not-have said triples.
Suppose that the triples { 1 0 0 }, { 1 5 6 }, and { 1 5 7 } exist in the beginning. If we DELETE { 1 ?x ?y } and concurrently INSERT { 1 2 4 . 1 2 3 . 1 3 5 }, then whichever was considered to be first by the concurrency control of the DBMS would complete first, and the other after that. Thus the end state would either have no triples with subject 1 or would have the three just inserted.
Suppose the INSERT inserts the first triple, { 1 2 4 }. The DELETE at the same time reads all triples with subject 1. The exclusive read waits for the uncommitted INSERT. The INSERT then inserts the second triple, { 1 2 3 }. Depending on the isolation of the read, this either succeeds, since no { 1 2 3 } was read, or causes a deadlock. The first corresponds to REPEATABLE READ isolation; the second to SERIALIZABLE.
We would not get the desired end-state of either all the inserted triples or no triples with subject 1 if the read or the DELETE were not serializable.
Furthermore if a DELETE template produced a triple that did not exist in the pre-image, the DELETE semantics still imply that this also does not exist in the after-image, which implies serializability.
Read and Update
Let us consider the prototypical transaction example of transferring funds from one account to another. Two balances are updated, and a history record is inserted.
The initial state is
a balance 10
b balance 10
We transfer 1 from a to b, and at the same time transfer 2 from b to a. The end state must have a at 11 and b at 9.
A relational database needs REPEATABLE READ isolation for this.
With RDF, txn1 reads that a has a balance of 10. At the same time, txn1 reads the balance of a. txn2 waits because the read of txn1 is exclusive. txn1 proceeds and read the balance of b. It then updates the balance of a and b.
All goes without the deadlock which is always cited in this scenario, because the locks are acquired in the same order. The act of updating the balance of a, since RDF does not really have an update-in-place, consists of deleting { a balance 10 } and inserting { a balance 9 }. This gets done and txn1 commits. At this point, txn2 proceeds after its wait on the row that stated { a balance 10 }. This row is now gone, and txn2 sees that a has no balance, which is quite possible in RDF's schema-less model.
We see that REPEATABLE READ is not adequate with RDF, even though it is with relational. The reason why there is no UPDATE-in-place is that the PRIMARY KEY of the triple includes all the parts, including the object. Even in a RDBMS, an UPDATE of a primary key part amounts to a DELETE-plus-INSERT. One could here argue that an implementation might still UPDATE-in-place if the key order were not changed. This would resolve the special case of the accounts but not a more general case.
Thus we see that the read of the balance must be SERIALIZABLE. This means that the read locks the space before the first balance, so that no insertion may take place. In this way the read of txn2 waits on the lock that is conceptually before the first possible match of { a balance ?x }.
locking order and OLTP
To implement TPC-C, I would update the table with the highest cardinality first, and then all tables in descending order of cardinality. In this way, the locks with the highest likelihood for contention are held for the least time. If locking multiple rows of a table, these should be locked in a deterministic order, e.g., lowest key-value first. In this way, the workload would not deadlock. In actual fact, with clusters and parallel execution, the lock acquisition will not be guaranteed to be serial, so deadlocks do not entirely go away, but still may get fewer. Besides, any outside transaction might still lock in the wrong order and cause deadlocks, which is why the OLTP application must in any case be built to deal with the possibility of deadlock.
This is the conventional relational view of the matter. In more recent times, in-memory schemes with deterministic lock acquisition (Abadi VLDB 2010) or single-threaded atomic execution of transactions (Uni Munich BIRTE workshop at VLDB2010, VoltDB) have been proposed. There the transaction is described as a stored procedure, possibly with extra annotations. These techniques might apply to RDF also. RDF is however an unlikely model for transaction-intensive applications, so we will not for now examine these further.
RDBMS usually implement row-level locking. This means that once a column of a row has an uncommitted state, any other transaction is prevented from changing the row. This has no ready RDF equivalent. RDF is usually implemented as a row-per-triple system and applying row-level locking to this does not give the semantic one expects of a relational row.
I would argue that it is not essential to enforce transactional guarantees in units of rows. The guarantees must apply between data that is read and written by a transaction. It does not need to apply to columns that the transaction does not reference. To take the TPC-C example, the new order transaction updates the stock level and the delivery transaction updates the delivery count on the stock table. In practice, a delivery and a new order falling on the same row of stock will lock each other out, but nothing in the semantics of the workload mandates this.
It does not seem a priori necessary to recreate the row as a unit of concurrency control in RDF. One could say that a multi-attribute whole (such as an address) ought to be atomic for concurrency control, but then applications updating addresses will most likely read and update all the fields together even if only the street name changes.
Pessimistic Vs. Optimistic Concurrency Control
We have so far spoken only in terms of row-level locking, which is to my knowledge the most widely used model in RDBMS, and one we implement ourselves. Some databases (e.g., MonetDB and VectorWise) implement optimistic concurrency control. The general idea is that each transaction has a read and write set and when a transaction commits, any other transactions whose read or write set intersects with the write set of the committing transaction are marked un-committable. Once a transaction thus becomes un-committable, it may presumably continue reading indefinitely but may no longer commit its updates. Optimistic concurrency is generally coupled with multi-version semantics where the pre-image of a transaction is a clean committed state of the database as of a specific point in time, i.e., snapshot isolation.
To implement SERIALIZABLE isolation, i.e., the guarantee that if a transaction twice performs a COUNT the result will be the same, one locks also the row that precedes the set of selected rows and marks each lock so as to prevent an insert to the right of the lock in key order. The same thing may be done in an optimistic setting.
Positional Handling of Updates in Column Stores [Heman, Zukowski, CWI science library] discusses management of multiple consecutive snapshots in some detail. The paper does not go into the details of different levels of isolation but nothing there suggests that serializability could not be supported. There is some complexity in marking the space between ordered rows as non-insertable across multiple versions but this should be feasible enough.
The issue of optimistic Vs. pessimistic concurrency does not seem to be affected by the differences between RDF and relational models. We note that an OLTP workload can be made to run with very few transaction aborts (deadlocks) by properly ordering operations when using a locking scheme. The same does not work with optimistic concurrency since updates happen immediately and transaction aborts occur whenever the writes of one intersect the reads or writes of another, regardless of the order in which these were made.
Developers seldom understand transactions; therefore DBMS should, within the limits of the possible, optimize locking order for locking schemes. A simple example is locking in key order when doing an operation on a set of values. A more complex variant would consist of analyzing data dependencies in stored procedures and reordering updates so as to get the highest cardinality tables first. We note that this latter trick also benefits optimistic schemes.
In RDF, the same principles apply but distinguishing cardinality of an updated set will have to rely on statistics of predicate cardinality. Such are anyhow needed for query optimization.
Eventual Consistency
Web scale systems that need to maintain consistent state across multiple data centers sometimes use "eventual consistency" schemes. Two-phase-commit becomes very inefficient as latency increases, thus strict transactional semantics have prohibitive cost if the system is more distributed than a cluster with a fast interconnect.
Eventual consistency schemes (Amazon Dynamo, Yahoo! PNUTS) maintain history information on the record which is the unit of concurrency control. The record is typically a non-first normal form chunk of related data that it makes sense to store together from the application's viewpoint. Application logic can then be applied to reconciling differing copies of the same logical record.
Such a scheme seems a priori ill-suited for RDF, where the natural unit of concurrency control would seem to be the quad. We first note that only recently changed (i.e., DELETEd + INSERTed quads, as there is no UPDATE-in-place) need history information. This history information can be stored away from the quad itself, thus not disrupting compression. When detecting that one site has INSERTed a quad that another has DELETEd in the same general time period, application logic can still be applied for reading related quads in order to arrive at a decision on how to reconcile two databases that have diverged. The same can apply to conflicting values of properties that for the application should be single-valued. Comparing time-stamped transaction logs on quads is not fundamentally different from comparing record histories in Dynamo or PNUTS.
As we overcome the data size penalties that have until recently been associated with RDF, RDF becomes even more interesting as a data model for large online systems such as social network platforms where frequent application changes lead to volatility of schema. Key value stores are currently found in such applications, but they generally do not provide the query flexibility at which RDF excels.
Conclusions
We have gone over basic aspects of the endlessly complex and variable topic of transactions, and drawn parallels as well as outlined two basic differences between relational and RDF systems: What used to be REPEATABLE READ becomes SERIALIZABLE; and row-level locking becomes locking at the level of a single attribute value. For the rest, we see that the optimistic and pessimistic modes of concurrency control, as well as guidelines for writing transaction procedures, remain much the same.
Based on this overview, it should be possible to design an ACID test for describing the ACID behavior of benchmarked systems. We do not intend to make transaction support a qualification requirement for an RDF benchmark, but information on transaction support will still be valuable in comparing different systems.
|
03/22/2011 19:55 GMT
|
Modified:
03/22/2011 18:24 GMT
|
Transaction Semantics in RDF and Relational Models
[
Virtuso Data Space Bot
]
As a part of defining benchmark audit for testing ACID properties on RDF stores, we will here examine different RDF scenarios where lack of concurrency control causes inconsistent results. In so doing, we consider common implementation techniques and implications as concern locking (pessimistic) and multi-version (optimistic) concurrency control schemes.
In the following, we will talk in terms of triples, but the discussion can be trivially generalized to quads. We will use numbers for IRIs and literals. In most implementations, the internal representation for these is indeed a number (or at least some data type that has a well defined collation order). For ease of presentation, we consider a single index with key parts SPO. Any other index-like setting with any possible key order will have similar issues.
Insert (Create) and Delete
INSERT and DELETE as defined in SPARQL are queries which generate a result set which is then used for instantiating triple patterns. We note that a DELETE may delete a triple which the DELETE has not read; thus the delete set is not a subset of the read set. The SQL equivalent is the
DELETE FROM table WHERE key IN
( SELECT key1 FROM other_table )
expression, supposing it were implemented as a scan of other_table and an index lookup followed by DELETE on table.
The meaning of INSERT is that the triples in question exist after the operation, and the meaning of DELETE is that said triples do not exist. In a transactional context, this means that the after-image of the transaction is guaranteed either to have or not-have said triples.
Suppose that the triples { 1 0 0 }, { 1 5 6 }, and { 1 5 7 } exist in the beginning. If we DELETE { 1 ?x ?y } and concurrently INSERT { 1 2 4 . 1 2 3 . 1 3 5 }, then whichever was considered to be first by the concurrency control of the DBMS would complete first, and the other after that. Thus the end state would either have no triples with subject 1 or would have the three just inserted.
Suppose the INSERT inserts the first triple, { 1 2 4 }. The DELETE at the same time reads all triples with subject 1. The exclusive read waits for the uncommitted INSERT. The INSERT then inserts the second triple, { 1 2 3 }. Depending on the isolation of the read, this either succeeds, since no { 1 2 3 } was read, or causes a deadlock. The first corresponds to REPEATABLE READ isolation; the second to SERIALIZABLE.
We would not get the desired end-state of either all the inserted triples or no triples with subject 1 if the read or the DELETE were not serializable.
Furthermore if a DELETE template produced a triple that did not exist in the pre-image, the DELETE semantics still imply that this also does not exist in the after-image, which implies serializability.
Read and Update
Let us consider the prototypical transaction example of transferring funds from one account to another. Two balances are updated, and a history record is inserted.
The initial state is
a balance 10
b balance 10
We transfer 1 from a to b, and at the same time transfer 2 from b to a. The end state must have a at 11 and b at 9.
A relational database needs REPEATABLE READ isolation for this.
With RDF, txn1 reads that a has a balance of 10. At the same time, txn1 reads the balance of a. txn2 waits because the read of txn1 is exclusive. txn1 proceeds and read the balance of b. It then updates the balance of a and b.
All goes without the deadlock which is always cited in this scenario, because the locks are acquired in the same order. The act of updating the balance of a, since RDF does not really have an update-in-place, consists of deleting { a balance 10 } and inserting { a balance 9 }. This gets done and txn1 commits. At this point, txn2 proceeds after its wait on the row that stated { a balance 10 }. This row is now gone, and txn2 sees that a has no balance, which is quite possible in RDF's schema-less model.
We see that REPEATABLE READ is not adequate with RDF, even though it is with relational. The reason why there is no UPDATE-in-place is that the PRIMARY KEY of the triple includes all the parts, including the object. Even in a RDBMS, an UPDATE of a primary key part amounts to a DELETE-plus-INSERT. One could here argue that an implementation might still UPDATE-in-place if the key order were not changed. This would resolve the special case of the accounts but not a more general case.
Thus we see that the read of the balance must be SERIALIZABLE. This means that the read locks the space before the first balance, so that no insertion may take place. In this way the read of txn2 waits on the lock that is conceptually before the first possible match of { a balance ?x }.
locking order and OLTP
To implement TPC-C, I would update the table with the highest cardinality first, and then all tables in descending order of cardinality. In this way, the locks with the highest likelihood for contention are held for the least time. If locking multiple rows of a table, these should be locked in a deterministic order, e.g., lowest key-value first. In this way, the workload would not deadlock. In actual fact, with clusters and parallel execution, the lock acquisition will not be guaranteed to be serial, so deadlocks do not entirely go away, but still may get fewer. Besides, any outside transaction might still lock in the wrong order and cause deadlocks, which is why the OLTP application must in any case be built to deal with the possibility of deadlock.
This is the conventional relational view of the matter. In more recent times, in-memory schemes with deterministic lock acquisition (Abadi VLDB 2010) or single-threaded atomic execution of transactions (Uni Munich BIRTE workshop at VLDB2010, VoltDB) have been proposed. There the transaction is described as a stored procedure, possibly with extra annotations. These techniques might apply to RDF also. RDF is however an unlikely model for transaction-intensive applications, so we will not for now examine these further.
RDBMS usually implement row-level locking. This means that once a column of a row has an uncommitted state, any other transaction is prevented from changing the row. This has no ready RDF equivalent. RDF is usually implemented as a row-per-triple system and applying row-level locking to this does not give the semantic one expects of a relational row.
I would argue that it is not essential to enforce transactional guarantees in units of rows. The guarantees must apply between data that is read and written by a transaction. It does not need to apply to columns that the transaction does not reference. To take the TPC-C example, the new order transaction updates the stock level and the delivery transaction updates the delivery count on the stock table. In practice, a delivery and a new order falling on the same row of stock will lock each other out, but nothing in the semantics of the workload mandates this.
It does not seem a priori necessary to recreate the row as a unit of concurrency control in RDF. One could say that a multi-attribute whole (such as an address) ought to be atomic for concurrency control, but then applications updating addresses will most likely read and update all the fields together even if only the street name changes.
Pessimistic Vs. Optimistic Concurrency Control
We have so far spoken only in terms of row-level locking, which is to my knowledge the most widely used model in RDBMS, and one we implement ourselves. Some databases (e.g., MonetDB and VectorWise) implement optimistic concurrency control. The general idea is that each transaction has a read and write set and when a transaction commits, any other transactions whose read or write set intersects with the write set of the committing transaction are marked un-committable. Once a transaction thus becomes un-committable, it may presumably continue reading indefinitely but may no longer commit its updates. Optimistic concurrency is generally coupled with multi-version semantics where the pre-image of a transaction is a clean committed state of the database as of a specific point in time, i.e., snapshot isolation.
To implement SERIALIZABLE isolation, i.e., the guarantee that if a transaction twice performs a COUNT the result will be the same, one locks also the row that precedes the set of selected rows and marks each lock so as to prevent an insert to the right of the lock in key order. The same thing may be done in an optimistic setting.
Positional Handling of Updates in Column Stores [Heman, Zukowski, CWI science library] discusses management of multiple consecutive snapshots in some detail. The paper does not go into the details of different levels of isolation but nothing there suggests that serializability could not be supported. There is some complexity in marking the space between ordered rows as non-insertable across multiple versions but this should be feasible enough.
The issue of optimistic Vs. pessimistic concurrency does not seem to be affected by the differences between RDF and relational models. We note that an OLTP workload can be made to run with very few transaction aborts (deadlocks) by properly ordering operations when using a locking scheme. The same does not work with optimistic concurrency since updates happen immediately and transaction aborts occur whenever the writes of one intersect the reads or writes of another, regardless of the order in which these were made.
Developers seldom understand transactions; therefore DBMS should, within the limits of the possible, optimize locking order for locking schemes. A simple example is locking in key order when doing an operation on a set of values. A more complex variant would consist of analyzing data dependencies in stored procedures and reordering updates so as to get the highest cardinality tables first. We note that this latter trick also benefits optimistic schemes.
In RDF, the same principles apply but distinguishing cardinality of an updated set will have to rely on statistics of predicate cardinality. Such are anyhow needed for query optimization.
Eventual Consistency
Web scale systems that need to maintain consistent state across multiple data centers sometimes use "eventual consistency" schemes. Two-phase-commit becomes very inefficient as latency increases, thus strict transactional semantics have prohibitive cost if the system is more distributed than a cluster with a fast interconnect.
Eventual consistency schemes (Amazon Dynamo, Yahoo! PNUTS) maintain history information on the record which is the unit of concurrency control. The record is typically a non-first normal form chunk of related data that it makes sense to store together from the application's viewpoint. Application logic can then be applied to reconciling differing copies of the same logical record.
Such a scheme seems a priori ill-suited for RDF, where the natural unit of concurrency control would seem to be the quad. We first note that only recently changed (i.e., DELETEd + INSERTed quads, as there is no UPDATE-in-place) need history information. This history information can be stored away from the quad itself, thus not disrupting compression. When detecting that one site has INSERTed a quad that another has DELETEd in the same general time period, application logic can still be applied for reading related quads in order to arrive at a decision on how to reconcile two databases that have diverged. The same can apply to conflicting values of properties that for the application should be single-valued. Comparing time-stamped transaction logs on quads is not fundamentally different from comparing record histories in Dynamo or PNUTS.
As we overcome the data size penalties that have until recently been associated with RDF, RDF becomes even more interesting as a data model for large online systems such as social network platforms where frequent application changes lead to volatility of schema. Key value stores are currently found in such applications, but they generally do not provide the query flexibility at which RDF excels.
Conclusions
We have gone over basic aspects of the endlessly complex and variable topic of transactions, and drawn parallels as well as outlined two basic differences between relational and RDF systems: What used to be REPEATABLE READ becomes SERIALIZABLE; and row-level locking becomes locking at the level of a single attribute value. For the rest, we see that the optimistic and pessimistic modes of concurrency control, as well as guidelines for writing transaction procedures, remain much the same.
Based on this overview, it should be possible to design an ACID test for describing the ACID behavior of benchmarked systems. We do not intend to make transaction support a qualification requirement for an RDF benchmark, but information on transaction support will still be valuable in comparing different systems.
|
03/22/2011 19:55 GMT
|
Modified:
03/22/2011 18:24 GMT
|
RDF and Transactions
[
Orri Erling
]
I will here talk about RDF and transactions for developers in general. The next one talks about specifics and is for specialists.
Transactions are certainly not the first thing that comes to mind when one hears "RDF". We have at times used a recruitment questionnaire where we ask applicants to define a transaction. Many vaguely remember that it is a unit of work, but usually not more than that. We sometimes get questions from users about why they get an error message that says "deadlock". "Deadlock" is what happens when multiple users concurrently update balances on multiple bank accounts in the wrong order. What does this have to do with RDF?
There are in fact users who even use XA with a Virtuoso-based RDF application. Franz also has publicized their development of full ACID capabilities for AllegroGraph. RDF is a database schema model, and transactions will inevitably become an issue in databases.
At the same time, the developer population trained with MySQL and PHP is not particularly transaction-aware. Transactions have gone out of style, declares the No-SQL crowd. Well, it is not so much SQL they object to but ACID, i.e., transactional guarantees. We will talk more about this in the next post. The SPARQL language and protocol do not go into transactions, except for expressing the wish that an UPDATE request to an end-point be atomic. But beware -- atomicity is a gateway drug, and soon one finds oneself on full ACID.
If one says that a thing will either happen in its entirety or not at all, which is what (A) atomicity means, then the question arises of (I) isolation; that is, what happens if somebody else does something to the same data at the same time? Then comes the question of whether a thing, once having happened, will stay that way; i.e., (D) durability. Finally, there is (C) consistency, which means that the transaction's result must not contradict restrictions the database is supposed to enforce. RDF usually has no restrictions; thus consistency mostly means that the internal state of the DBMS must be consistent, e.g., different indices on triples/quads should contain the same data.
There are, of course, database-like consistency criteria that one can express in RDF Schema and OWL, concerning data types, mandatory presence of properties, or restrictions on cardinality (i.e., one may only have one spouse at a time, and the like).
If one indeed did enforce them all, then RDF would be very like the relational model -- with all the restrictions, but without the 40 years of work on RDBMS performance. For this reason, RDF use tends to involve data that is not structured enough to be a good fit for RDBMS.
There is of course the OWL side, where consistency is important but is defined in such complex ways that they again are not a good fit for RDBMS. RDF could be seen to be split between the schema-last world and the knowledge representation world. I will here focus on the schema-last side.
Transactions are relevant in RDF in two cases: 1. If data is trickle loaded in small chunks, one likes to know that the chunks do not get lost or corrupted; 2. If the application has any semantics that reserve resources, then these operations need transactions. The latter is not so common with RDF but examples include read-write situations, like checking if a seat is available and then reserving it. Transactionality guarantees that the same seat does not get reserved twice.
Web people argue with some justification that since the four cardinal virtues of database never existed on the web to begin with, applying strict ACID to web data is beside the point, like locking the stable after the horse has long since run away. This may be so; yet the systems used for processing data, whether that data is dirty or not, benefit from predictable operation under concurrency and from not losing data.
Analytics workloads are not primarily about transactions, but still need to specify what happens with updates. Analyzing data from measurements may not have concurrent updates, but there the transaction issue is replaced by the question of making explicit how the data was acquired and what processing has been applied to it before storage.
As mentioned before, the LOD2 project is at the crossroads of RDF and database. I construe its mission to be the making of RDF into a respectable database discipline. Database respectability in turn is as good as inconceivable without addressing the very bedrock on which this science was founded: transactions.
As previously argued, we need well-defined and auditable benchmarks. This again brings up the topic of transactions. Once we embark on the database benchmark route, there is no way around this. TPC-H mandates that the system under test support transactions, and the audit involves a test for this. We can do no less.
This has led me to more closely examine the issue of RDF and transactions, and whether there exist differences between transactions applied to RDF and to relational data.
As concerns Virtuoso, our position has been that one can get full ACID in Virtuoso, whether in SQL or SPARQL, by using a connected client (e.g., ODBC, JDBC, or the Jena or Sesame frameworks), and setting the isolation options on the connection. Having taken this step, one then must take the next step, which consists of dealing with deadlocks; i.e., with concurrent utilization, it may happen that the database at any time notifies the client that the transaction got aborted and the client must retry.
Web developers especially do not like this, because this is not what MySQL has taught them to expect. MySQL does have transactional back-ends like InnoDB, but often gets used without transactions.
With the March 2011 Virtuoso releases, we have taken a closer look at transactions with RDF. It is more practical to reduce the possibility of errors than to require developers to pay attention. For this reason we have automated isolation settings for RDF, greatly reduced the incidence of deadlocks, and even incorporated automatic deadlock retries where applicable.
If all users lock resources they need in the same order, there will be no deadlocks. This is what we do with RDF load in Virtuoso 7; thus any mix of concurrent INSERTs and DELETEs, if these are under a certain size (normally 10000 quads) are guaranteed never to fail due to locking. These could still fail due to running out of space, though. With previous versions, there always was a possibility of having an INSERT or DELETE fail because of deadlock with multiple users. Vectored INSERT and DELETE are sufficient for making web crawling or archive maintenance practically deadlock free, since there the primary transaction is the INSERT or DELETE of a small graph.
Furthermore, since the SPARQL protocol has no way of specifying transactions consisting of multiple client-server exchanges, the SPARQL end-point may deal with deadlocks by itself. If all else fails, it can simply execute requests one after the other, thus eliminating any possibility of locking. We note that many statements will be intrinsically free of deadlocks by virtue of always locking in key order, but this cannot be universally guaranteed with arbitrary size operations; thus concurrent operations might still sometimes deadlock. Anyway, vectored execution as introduced in Virtuoso 7, besides getting easily double-speed random access, also greatly reduces deadlocks by virtue of ordering operations.
In the next post we will talk about what transactions mean with RDF and whether there is any difference with the relational model.
|
03/22/2011 18:52 GMT
|
Modified:
03/22/2011 17:44 GMT
|
RDF and Transactions
[
Virtuso Data Space Bot
]
I will here talk about RDF and transactions for developers in general. The next one talks about specifics and is for specialists.
Transactions are certainly not the first thing that comes to mind when one hears "RDF". We have at times used a recruitment questionnaire where we ask applicants to define a transaction. Many vaguely remember that it is a unit of work, but usually not more than that. We sometimes get questions from users about why they get an error message that says "deadlock". "Deadlock" is what happens when multiple users concurrently update balances on multiple bank accounts in the wrong order. What does this have to do with RDF?
There are in fact users who even use XA with a Virtuoso-based RDF application. Franz also has publicized their development of full ACID capabilities for AllegroGraph. RDF is a database schema model, and transactions will inevitably become an issue in databases.
At the same time, the developer population trained with MySQL and PHP is not particularly transaction-aware. Transactions have gone out of style, declares the No-SQL crowd. Well, it is not so much SQL they object to but ACID, i.e., transactional guarantees. We will talk more about this in the next post. The SPARQL language and protocol do not go into transactions, except for expressing the wish that an UPDATE request to an end-point be atomic. But beware -- atomicity is a gateway drug, and soon one finds oneself on full ACID.
If one says that a thing will either happen in its entirety or not at all, which is what (A) atomicity means, then the question arises of (I) isolation; that is, what happens if somebody else does something to the same data at the same time? Then comes the question of whether a thing, once having happened, will stay that way; i.e., (D) durability. Finally, there is (C) consistency, which means that the transaction's result must not contradict restrictions the database is supposed to enforce. RDF usually has no restrictions; thus consistency mostly means that the internal state of the DBMS must be consistent, e.g., different indices on triples/quads should contain the same data.
There are, of course, database-like consistency criteria that one can express in RDF Schema and OWL, concerning data types, mandatory presence of properties, or restrictions on cardinality (i.e., one may only have one spouse at a time, and the like).
If one indeed did enforce them all, then RDF would be very like the relational model -- with all the restrictions, but without the 40 years of work on RDBMS performance. For this reason, RDF use tends to involve data that is not structured enough to be a good fit for RDBMS.
There is of course the OWL side, where consistency is important but is defined in such complex ways that they again are not a good fit for RDBMS. RDF could be seen to be split between the schema-last world and the knowledge representation world. I will here focus on the schema-last side.
Transactions are relevant in RDF in two cases: 1. If data is trickle loaded in small chunks, one likes to know that the chunks do not get lost or corrupted; 2. If the application has any semantics that reserve resources, then these operations need transactions. The latter is not so common with RDF but examples include read-write situations, like checking if a seat is available and then reserving it. Transactionality guarantees that the same seat does not get reserved twice.
Web people argue with some justification that since the four cardinal virtues of database never existed on the web to begin with, applying strict ACID to web data is beside the point, like locking the stable after the horse has long since run away. This may be so; yet the systems used for processing data, whether that data is dirty or not, benefit from predictable operation under concurrency and from not losing data.
Analytics workloads are not primarily about transactions, but still need to specify what happens with updates. Analyzing data from measurements may not have concurrent updates, but there the transaction issue is replaced by the question of making explicit how the data was acquired and what processing has been applied to it before storage.
As mentioned before, the LOD2 project is at the crossroads of RDF and database. I construe its mission to be the making of RDF into a respectable database discipline. Database respectability in turn is as good as inconceivable without addressing the very bedrock on which this science was founded: transactions.
As previously argued, we need well-defined and auditable benchmarks. This again brings up the topic of transactions. Once we embark on the database benchmark route, there is no way around this. TPC-H mandates that the system under test support transactions, and the audit involves a test for this. We can do no less.
This has led me to more closely examine the issue of RDF and transactions, and whether there exist differences between transactions applied to RDF and to relational data.
As concerns Virtuoso, our position has been that one can get full ACID in Virtuoso, whether in SQL or SPARQL, by using a connected client (e.g., ODBC, JDBC, or the Jena or Sesame frameworks), and setting the isolation options on the connection. Having taken this step, one then must take the next step, which consists of dealing with deadlocks; i.e., with concurrent utilization, it may happen that the database at any time notifies the client that the transaction got aborted and the client must retry.
Web developers especially do not like this, because this is not what MySQL has taught them to expect. MySQL does have transactional back-ends like InnoDB, but often gets used without transactions.
With the March 2011 Virtuoso releases, we have taken a closer look at transactions with RDF. It is more practical to reduce the possibility of errors than to require developers to pay attention. For this reason we have automated isolation settings for RDF, greatly reduced the incidence of deadlocks, and even incorporated automatic deadlock retries where applicable.
If all users lock resources they need in the same order, there will be no deadlocks. This is what we do with RDF load in Virtuoso 7; thus any mix of concurrent INSERTs and DELETEs, if these are under a certain size (normally 10000 quads) are guaranteed never to fail due to locking. These could still fail due to running out of space, though. With previous versions, there always was a possibility of having an INSERT or DELETE fail because of deadlock with multiple users. Vectored INSERT and DELETE are sufficient for making web crawling or archive maintenance practically deadlock free, since there the primary transaction is the INSERT or DELETE of a small graph.
Furthermore, since the SPARQL protocol has no way of specifying transactions consisting of multiple client-server exchanges, the SPARQL end-point may deal with deadlocks by itself. If all else fails, it can simply execute requests one after the other, thus eliminating any possibility of locking. We note that many statements will be intrinsically free of deadlocks by virtue of always locking in key order, but this cannot be universally guaranteed with arbitrary size operations; thus concurrent operations might still sometimes deadlock. Anyway, vectored execution as introduced in Virtuoso 7, besides getting easily double-speed random access, also greatly reduces deadlocks by virtue of ordering operations.
In the next post we will talk about what transactions mean with RDF and whether there is any difference with the relational model.
|
03/22/2011 18:52 GMT
|
Modified:
03/22/2011 17:44 GMT
|
|
|