Details
Virtuso Data Space Bot
Burlington, United States
Subscribe
Post Categories
Recent Articles
Display Settings
|
Showing posts in all categories Refresh
Transaction Semantics in RDF and Relational Models
As a part of defining benchmark audit for testing ACID properties on RDF stores, we will here examine different RDF scenarios where lack of concurrency control causes inconsistent results. In so doing, we consider common implementation techniques and implications as concern locking (pessimistic) and multi-version (optimistic) concurrency control schemes.
In the following, we will talk in terms of triples, but the discussion can be trivially generalized to quads. We will use numbers for IRIs and literals. In most implementations, the internal representation for these is indeed a number (or at least some data type that has a well defined collation order). For ease of presentation, we consider a single index with key parts SPO. Any other index-like setting with any possible key order will have similar issues.
Insert (Create) and Delete
INSERT and DELETE as defined in SPARQL are queries which generate a result set which is then used for instantiating triple patterns. We note that a DELETE may delete a triple which the DELETE has not read; thus the delete set is not a subset of the read set. The SQL equivalent is the
DELETE FROM table WHERE key IN
( SELECT key1 FROM other_table )
expression, supposing it were implemented as a scan of other_table and an index lookup followed by DELETE on table.
The meaning of INSERT is that the triples in question exist after the operation, and the meaning of DELETE is that said triples do not exist. In a transactional context, this means that the after-image of the transaction is guaranteed either to have or not-have said triples.
Suppose that the triples { 1 0 0 }, { 1 5 6 }, and { 1 5 7 } exist in the beginning. If we DELETE { 1 ?x ?y } and concurrently INSERT { 1 2 4 . 1 2 3 . 1 3 5 }, then whichever was considered to be first by the concurrency control of the DBMS would complete first, and the other after that. Thus the end state would either have no triples with subject 1 or would have the three just inserted.
Suppose the INSERT inserts the first triple, { 1 2 4 }. The DELETE at the same time reads all triples with subject 1. The exclusive read waits for the uncommitted INSERT. The INSERT then inserts the second triple, { 1 2 3 }. Depending on the isolation of the read, this either succeeds, since no { 1 2 3 } was read, or causes a deadlock. The first corresponds to REPEATABLE READ isolation; the second to SERIALIZABLE.
We would not get the desired end-state of either all the inserted triples or no triples with subject 1 if the read or the DELETE were not serializable.
Furthermore if a DELETE template produced a triple that did not exist in the pre-image, the DELETE semantics still imply that this also does not exist in the after-image, which implies serializability.
Read and Update
Let us consider the prototypical transaction example of transferring funds from one account to another. Two balances are updated, and a history record is inserted.
The initial state is
a balance 10
b balance 10
We transfer 1 from a to b, and at the same time transfer 2 from b to a. The end state must have a at 11 and b at 9.
A relational database needs REPEATABLE READ isolation for this.
With RDF, txn1 reads that a has a balance of 10. At the same time, txn1 reads the balance of a. txn2 waits because the read of txn1 is exclusive. txn1 proceeds and read the balance of b. It then updates the balance of a and b.
All goes without the deadlock which is always cited in this scenario, because the locks are acquired in the same order. The act of updating the balance of a, since RDF does not really have an update-in-place, consists of deleting { a balance 10 } and inserting { a balance 9 }. This gets done and txn1 commits. At this point, txn2 proceeds after its wait on the row that stated { a balance 10 }. This row is now gone, and txn2 sees that a has no balance, which is quite possible in RDF's schema-less model.
We see that REPEATABLE READ is not adequate with RDF, even though it is with relational. The reason why there is no UPDATE-in-place is that the PRIMARY KEY of the triple includes all the parts, including the object. Even in a RDBMS, an UPDATE of a primary key part amounts to a DELETE-plus-INSERT. One could here argue that an implementation might still UPDATE-in-place if the key order were not changed. This would resolve the special case of the accounts but not a more general case.
Thus we see that the read of the balance must be SERIALIZABLE. This means that the read locks the space before the first balance, so that no insertion may take place. In this way the read of txn2 waits on the lock that is conceptually before the first possible match of { a balance ?x }.
locking order and OLTP
To implement TPC-C, I would update the table with the highest cardinality first, and then all tables in descending order of cardinality. In this way, the locks with the highest likelihood for contention are held for the least time. If locking multiple rows of a table, these should be locked in a deterministic order, e.g., lowest key-value first. In this way, the workload would not deadlock. In actual fact, with clusters and parallel execution, the lock acquisition will not be guaranteed to be serial, so deadlocks do not entirely go away, but still may get fewer. Besides, any outside transaction might still lock in the wrong order and cause deadlocks, which is why the OLTP application must in any case be built to deal with the possibility of deadlock.
This is the conventional relational view of the matter. In more recent times, in-memory schemes with deterministic lock acquisition (Abadi VLDB 2010) or single-threaded atomic execution of transactions (Uni Munich BIRTE workshop at VLDB2010, VoltDB) have been proposed. There the transaction is described as a stored procedure, possibly with extra annotations. These techniques might apply to RDF also. RDF is however an unlikely model for transaction-intensive applications, so we will not for now examine these further.
RDBMS usually implement row-level locking. This means that once a column of a row has an uncommitted state, any other transaction is prevented from changing the row. This has no ready RDF equivalent. RDF is usually implemented as a row-per-triple system and applying row-level locking to this does not give the semantic one expects of a relational row.
I would argue that it is not essential to enforce transactional guarantees in units of rows. The guarantees must apply between data that is read and written by a transaction. It does not need to apply to columns that the transaction does not reference. To take the TPC-C example, the new order transaction updates the stock level and the delivery transaction updates the delivery count on the stock table. In practice, a delivery and a new order falling on the same row of stock will lock each other out, but nothing in the semantics of the workload mandates this.
It does not seem a priori necessary to recreate the row as a unit of concurrency control in RDF. One could say that a multi-attribute whole (such as an address) ought to be atomic for concurrency control, but then applications updating addresses will most likely read and update all the fields together even if only the street name changes.
Pessimistic Vs. Optimistic Concurrency Control
We have so far spoken only in terms of row-level locking, which is to my knowledge the most widely used model in RDBMS, and one we implement ourselves. Some databases (e.g., MonetDB and VectorWise) implement optimistic concurrency control. The general idea is that each transaction has a read and write set and when a transaction commits, any other transactions whose read or write set intersects with the write set of the committing transaction are marked un-committable. Once a transaction thus becomes un-committable, it may presumably continue reading indefinitely but may no longer commit its updates. Optimistic concurrency is generally coupled with multi-version semantics where the pre-image of a transaction is a clean committed state of the database as of a specific point in time, i.e., snapshot isolation.
To implement SERIALIZABLE isolation, i.e., the guarantee that if a transaction twice performs a COUNT the result will be the same, one locks also the row that precedes the set of selected rows and marks each lock so as to prevent an insert to the right of the lock in key order. The same thing may be done in an optimistic setting.
Positional Handling of Updates in Column Stores [Heman, Zukowski, CWI science library] discusses management of multiple consecutive snapshots in some detail. The paper does not go into the details of different levels of isolation but nothing there suggests that serializability could not be supported. There is some complexity in marking the space between ordered rows as non-insertable across multiple versions but this should be feasible enough.
The issue of optimistic Vs. pessimistic concurrency does not seem to be affected by the differences between RDF and relational models. We note that an OLTP workload can be made to run with very few transaction aborts (deadlocks) by properly ordering operations when using a locking scheme. The same does not work with optimistic concurrency since updates happen immediately and transaction aborts occur whenever the writes of one intersect the reads or writes of another, regardless of the order in which these were made.
Developers seldom understand transactions; therefore DBMS should, within the limits of the possible, optimize locking order for locking schemes. A simple example is locking in key order when doing an operation on a set of values. A more complex variant would consist of analyzing data dependencies in stored procedures and reordering updates so as to get the highest cardinality tables first. We note that this latter trick also benefits optimistic schemes.
In RDF, the same principles apply but distinguishing cardinality of an updated set will have to rely on statistics of predicate cardinality. Such are anyhow needed for query optimization.
Eventual Consistency
Web scale systems that need to maintain consistent state across multiple data centers sometimes use "eventual consistency" schemes. Two-phase-commit becomes very inefficient as latency increases, thus strict transactional semantics have prohibitive cost if the system is more distributed than a cluster with a fast interconnect.
Eventual consistency schemes (Amazon Dynamo, Yahoo! PNUTS) maintain history information on the record which is the unit of concurrency control. The record is typically a non-first normal form chunk of related data that it makes sense to store together from the application's viewpoint. Application logic can then be applied to reconciling differing copies of the same logical record.
Such a scheme seems a priori ill-suited for RDF, where the natural unit of concurrency control would seem to be the quad. We first note that only recently changed (i.e., DELETEd + INSERTed quads, as there is no UPDATE-in-place) need history information. This history information can be stored away from the quad itself, thus not disrupting compression. When detecting that one site has INSERTed a quad that another has DELETEd in the same general time period, application logic can still be applied for reading related quads in order to arrive at a decision on how to reconcile two databases that have diverged. The same can apply to conflicting values of properties that for the application should be single-valued. Comparing time-stamped transaction logs on quads is not fundamentally different from comparing record histories in Dynamo or PNUTS.
As we overcome the data size penalties that have until recently been associated with RDF, RDF becomes even more interesting as a data model for large online systems such as social network platforms where frequent application changes lead to volatility of schema. Key value stores are currently found in such applications, but they generally do not provide the query flexibility at which RDF excels.
Conclusions
We have gone over basic aspects of the endlessly complex and variable topic of transactions, and drawn parallels as well as outlined two basic differences between relational and RDF systems: What used to be REPEATABLE READ becomes SERIALIZABLE; and row-level locking becomes locking at the level of a single attribute value. For the rest, we see that the optimistic and pessimistic modes of concurrency control, as well as guidelines for writing transaction procedures, remain much the same.
Based on this overview, it should be possible to design an ACID test for describing the ACID behavior of benchmarked systems. We do not intend to make transaction support a qualification requirement for an RDF benchmark, but information on transaction support will still be valuable in comparing different systems.
|
03/22/2011 19:55 GMT-0500
|
Modified:
03/22/2011 18:24 GMT-0500
|
Benchmarks, Redux (part 1): On RDF Benchmarks
This post introduces a series on RDF benchmarking. In these posts I will cover the following:
-
Correct misleading information about us in the recent Berlin report: The load rate is off-the wall and the update mix is missing. We supply the right numbers and explain how to load things so that one gets decent performance.
-
Discuss configuration options for Virtuoso.
-
Tell a story about multithreading and its perils and how vectoring and scale-out can save us.
-
Analyze the run time behavior of Virtuoso 6 Single, 6 Cluster, and 7 Single.
-
Look at the benefits of SSDs (solid-state storage devices) over HDDs (hard disk devices; spinning platters), and I/O matters in general.
-
Talk in general about modalities of benchmark running, and how to reconcile vendors doing what they know best with the air of legitimacy of a third party. Whether to do things a la TPC or a la TREC? We will hopefully try a bit of both, at least so I have proposed to our partners in LOD2, the EU FP7 that also funded the recent Berlin report.
-
Outline the desiderata for an RDF benchmark that is not just an RDF-ized relational workload, the Social Intelligence Benchmark.
-
Talk about BSBM in specific. What does it measure?
-
Discuss some experiments with the BI use case of BSBM.
-
Document how the results mentioned here were obtained and suggest practices for benchmark running and disclosure.
The background is that the LOD2 FP7 project is supposed to deliver a report about the state of the art and benchmark laboratory by March 1. The Berlin report is a part thereof. In the project proposal we talk about an ongoing benchmarking activity and about having up-to-date installations of the relevant RDF stores and RDBMS.
Since this is taxpayer money for supposedly the common good, I see no reason why such a useful thing should be restricted to the project participants. On the other hand, running a display window of stuff for benchmarking, when in at least in some cases licenses prohibit unauthorized publishing of benchmark results might be seen to conflict with the spirit of the license if not its letter. We will see.
For now, my take is that we want to run benchmarks of all interesting software, inviting the vendors to tell us how to do that if they will, and maybe even letting them perform those runs themselves. Then we promise not to disclose results without the vendor's permission. Access to the installations is limited to whoever operates the equipment. Configuration files and detailed hardware specs and such on the other hand will be made public. If a run is published, it will be with permission and in a format that includes full information for replicating the experiment.
In the LOD2 proposal we also in so many words say that we will stretch the limits of the state of the art. This stretching is surely not limited to the project's own products but should also include the general benchmarking aspect. I will say with confidence that running single server benchmarks at a max 200 Mtriples of data is not stretching anything.
So to ameliorate this situation, I thought to run the same at 10x the scale on a couple of large boxes we have access to. 1 and 2 billion triples are still comfortably single server scales. Then we could go for example to Giovanni's cluster at DERI and do 10 and 20 billion triples, this should fly reasonably on 8 or 16 nodes of the DERI gear. Or we might talk to SEALS who by now should have their own lab. Even Amazon EC2 might be an option, although not the preferred one.
So I asked everybody about config instructions, which produced a certain amount of dismay as I might be said to be biased and to be skirting the edges of conflict of interest. The inquiry was not altogether negative though since Ontotext and Garlik provided some information. We will look into these this and next week. We will not publish any information without asking first.
In this series of posts I will only talk about OpenLink Software.
Benchmarks, Redux Series
- Benchmarks, Redux (part 1): On RDF Benchmarks (this post)
-
Benchmarks, Redux (part 2): A Benchmarking Story
-
Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
-
Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
-
Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
-
Benchmarks, Redux (part 6): BSBM and I/O, continued
-
Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
-
Benchmarks, Redux (part 8): BSBM Explore and Update
-
Benchmarks, Redux (part 9): BSBM With Cluster
-
Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
-
Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks
-
Benchmarks, Redux (part 12): Our Own BSBM Results Report
-
Benchmarks, Redux (part 13): BSBM BI Modifications
-
Benchmarks, Redux (part 14): BSBM BI Mix
-
Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
|
02/28/2011 15:20 GMT-0500
|
Modified:
03/14/2011 17:16 GMT-0500
|
VLDB Semdata Workshop - The New Frontier of Semdata
This is a revised version of the talk I will be giving at the Semdata workshop at VLDB 2010.
The paper shows how we store TPC-H data as RDF with relational-level efficiency and how we query both RDF and relational versions in comparable time. We also compare row-wise and column-wise storage formats as implemented in Virtuoso.
A question that has come up a few times during the Semdata initiative is how semantic data will avoid the fate of other would-be database revolutions like OODBMS and deductive databases.
The need and opportunity are driven by the explosion of data in quantity and diversity of structure. The competition consists of analytics RDBMS, point solutions done with map-reduce or the like, and lastly in some cases from key-value stores with relaxed schema but limited querying.
The benefits of RDF are the ever expanding volume of data published in it, reuse of vocabulary, and well-defined semantics. The downside is efficiency. This is not so much a matter of absolute scalability — you can run an RDF database on a cluster — but a question of relative cost as opposed to alternatives.
The baseline is that for relational-style queries, one should get relational performance or close enough. We outline in the paper how RDF reduces to a run-time-typed relational column-store, and gets all the compression and locality advantages traditionally associated with such. After memory is no longer the differentiator, the rest is engineering. So much for the scalability barrier to adoption.
I do not need to talk here about the benefits of linked data and more or less ad hoc integration per se. But again, to make these practical, there are logistics to resolve: How to keep data up to date? How to distribute it incrementally? How to monetize freshness? We propose some solutions for these, looking at diverse-RDF replication and RDB-to-RDF replication in Virtuoso.
But to realize the ultimate promise of RDF/Linked Data/Semdata, however we call it, we must look farther into the landscape of what is being done with big data. Here we are no longer so much running against the RDBMS, but against map-reduce and key-value stores.
Given the psychology of geekdom, the charm of map-reduce is understandable: One controls what is going on, can work in the usual languages, can run on big iron without being picked to pieces by the endless concurrency and timing and order-of-events issues one gets when programming a cluster. Tough for the best, and unworkable for the rest.
The key-value store has some of the same appeal, as it is the DBMS laid bare, so to say, made understandable, without the again intractably-complex questions of fancy query planning and distributed ACID transactions. The psychological rewards of the sense of control are there, never mind the complex query; one can always hard code a point solution for the business question, if really must — maybe even in map-reduce.
Besides, for some things that go beyond SQL (for example, with graph structures), there really isn't a good solution.
Now, enter Vertica, Greenplum, VectorWise (a MonetDB project derivative from Ingres) and Virtuoso, maybe others, who all propose some combination of SQL- and explicit map-reduce-style control structures. This is nice but better is possible.
Here we find the next frontier of Semdata. Take Joe Hellerstein et al's work on declarative logic for the data centric data center.
We have heard it many times — when the data is big, the logic must go to it. We can take declarative, location-conscious rules, à la BOOM and BLOOM, and combine these with the declarative query, well-defined semantics, parallel-database capability of the leading RDF stores. Merge this with locality compression and throughput from the best analytics DBMS.
Here we have a data infrastructure that subsumes map-reduce as a special case of arbitrary distributed-parallel control flow, can send the processing to the data, and has flexible queries and schema-last capability.
Further, since RDF more or less reduces to relational columns, the techniques of caching and reuse and materialized joins and demand-driven indexing, à la MonetDB, are applicable with minimal if any adaptation.
Such a hybrid database-fusion frontier is relevant because it addresses heterogenous, large-scale data, with operations that are not easy to reduce to SQL, still without loss of the advantages of SQL. Apply this to anything from enhancing the business intelligence process by faster integration, including integration with linked open data to the map-reduce bulk processing of today. Do it with strong semantics and inference close to the data.
In short, RDF stays relevant by tackling real issues, with scale second to none, and decisive advantages in time-to-integrate and expressive power.
Last week I was at the LOD2 kick off and a LarKC meeting. The capabilities envisioned in this and the following post mirror our commitments to the EU co-funded LOD2 project. This week is VLDB and the Semdata workshop. I will talk more about how these trends are taking shape within the Virtuoso product development roadmap in future posts.
|
09/13/2010 18:09 GMT-0500
|
Modified:
09/21/2010 10:52 GMT-0500
|
Social Web Camp (#5 of 5)
(Last of five posts related to the WWW 2009 conference, held the week of April 20, 2009.)
The social networks camp was interesting, with a special meeting around Twitter. Half jokingly, we (that is, the OpenLink folks attending) concluded that societies would never be completely classless, although mobility between, as well as criteria for membership in, given classes would vary with time and circumstance. Now, there would be a new class division between people for whom micro-blogging is obligatory and those for whom it is an option.
By my experience, a great deal is possible in a short time, but this possibility depends on focus and concentration. These are increasingly rare. I am a great believer in core competence and focus. This is not only for geeks — one can have a lot of breadth-of-scope but this too depends on not getting sidetracked by constant information overload.
Insofar as personal success depends on constant reaction to online social media, this comes at a cost in time and focus and this cost will have to be managed somehow, for example by automation or outsourcing. But if the social media is only automated fronts twitting and re-twitting among themselves, a bit like electronic trading systems do with securities, with or without human operators, the value of the medium decreases.
There are contradictory requirements. On one hand, what is said in electronic media is essentially permanent, so one had best only say things that are well considered. On the other hand, one must say these things without adequate time for reflection or analysis. To cope with this, one must have a well-rehearsed position that is compacted so that it fits in a short format and is easy to remember and unambiguous to express. A culture of pre-cooked fast-food advertising cuts down on depth. Real-world things are complex and multifaceted. Besides, prevalent patterns of communication train the brain for a certain mode of functioning. If we train for rapid-fire 140-character messaging, we optimize one side but probably at the expense of another. In the meantime, the world continues developing increased complexity by all kinds of emergent effects. Connectivity is good but don't get lost in it.
There is a CIA memorandum about how analysts misinterpret data and see what they want to see. This is a relevant resource for understanding some psychology of perception and memory. With the information overload, largely driven by user generated content, interpreting fragmented and variously-biased real-time information is not only for the analyst but for everyone who needs to intelligently function in cyber-social space.
I participated in discussions on security and privacy and on mobile social networks and context.
For privacy, the main thing turned out to be whether people should be protected from themselves. Should information expire? Will it get buried by itself under huge volumes of new content? Well, for purposes of visibility, it will certainly get buried and will require constant management to stay visible. But for purposes of future finding of dirt, it will stay findable for those who are looking.
There is also the corollary of setting security for resources, like documents, versus setting security for statements, i.e., structured data like social networks. As I have blogged before, policies à la SQL do not work well when schema is fluid and end-users can't be expected to formulate or understand these. Remember Ted Nelson? A user interface should be such that a beginner understands it in 10 seconds in an emergency. The user interaction question is how to present things so that the user understands who will have access to what content. Also, users should themselves be able to check what potentially sensitive information can be found out about them. A service along the lines of Garlic's Data Patrol should be a part of the social web infrastructure of the future.
People at MIT have developed AIR (Accountability In RDF) for expressing policies about what can be done with data and for explaining why access is denied if it is denied. However, if we at all look at the history of secrets, it is rather seldom that one hears that access to information about X is restricted to compartment so-and-so; it is much more common to hear that there is no X. I would say that a policy system that just leaves out information that is not supposed to be available will please the users more. This is not only so for organizations; it is fully plausible that an individual might not wish to expose even the existence of some selected inner circle of friends, their parties together, or whatever.
In conclusion, there is no self-evident solution for careless use of social media. A site that requires people to confirm multiple times that they know what they are doing when publishing a photo will not get much use. We will see.
For mobility, there was some talk about the context of usage. Again, this is difficult. For different contexts, one would for example disclose one's location at the granularity of the city; for some other purposes, one would say which conference room one is in.
Embarrassing social situations may arise if mobile devices are too clever: If information about travel is pushed into the social network, one would feel like having to explain why one does not call on such-and-such a person and so on. Too much initiative in the mobile phone seems like a recipe for problems.
There is a thin line between convenience and having IT infrastructure rule one's life. The complexities and subtleties of social situations ought not to be reduced to the level of if-then rules. People and their interactions are more complex than they themselves often realize. A system is not its own metasystem, as Gödel put it. Similarly, human self-knowledge, let alone knowledge about another, is by this very principle only approximate. Not to forget what psychology tells us about state-dependent recall and of how circumstance can evoke patterns of behavior before one even notices. The history of expert systems did show that people do not do very well at putting their skills in the form of if-then rules. Thus automating sociality past a certain point seems a problematic proposition.
|
04/30/2009 12:14 GMT-0500
|
Modified:
04/30/2009 12:51 GMT-0500
|
Beyond Applications - Introducing the Planetary Datasphere (Part 2)
We have looked at the general implications of the DataSphere, a universal, ubiquitous database infrastructure, on end-user experience and application development and content. Now we will look at what this means at the back end, from hosting to security to server software and hardware.
Application Hosting
For the infrastructure provider, hosting the DataSphere is no different from hosting large Web 2.0 sites. This may be paid for by users, as in the cloud computing model where users rent capacity for their own purposes, or by advertisers, as in most of Web 2.0.
Clouds play a role in this as places with high local connectivity. The DataSphere is the atmosphere; the Cloud is an atmospheric phenomenon.
What of Proprietary Data and its Security?
Having proprietary data does not imply using a proprietary language. I would say that for any domain of discourse, no matter how private or specialized, at least some structural concepts can be borrowed from public, more generic sources. This lowers training thresholds and facilitates integration. Being able to integrate does not imply opening one's own data. To take an analogy, if you have a bunker with closed circuit air recycling, you still breathe air, even if that air is cut off from the atmosphere at large. For places with complex existing RDBMS security, the best is to map the RDBMS to RDF on the fly, always running all requests through the RDBMS. This implicitly preserves any policy or label based security schemes.
What of Individual Privacy on the Open Web?
The more complex situations will be found in environments with mixed security needs, as in social networking with partly-open and partly-closed profiles. The FOAF+SSL solution with https:// URIs is one approach. For query processing, we have a question of enforcing instance-level policies. In the DataSphere, granting privileges on tables and views no longer makes sense. In SQL, a policy means that behind the scenes the DBMS will add extra criteria to queries and updates depending on who is issuing them. The query processor adds conditions like getting the user's department ID and comparing it to the department ID on the payroll record. Labeled security is a scheme where data rows themselves contain security tags and the DBMS enforces these, row by row.
I would say that these techniques are suited for highly-structured situations where the roles, compartments, and needs are clear, and where the organization has the database know-how to write, test, and deploy such rules by the table, row, and column. This does not sit well with schema-last. I would not bet much on an average developer's capacity for making airtight policies on RDF data where not even 100% schema-adherence is guaranteed.
Doing security at the RDF graph level seems more appropriate. In many use cases, the graph is analogous to a photo album or a file system directory. A Data Space can be divided into graphs to provide more granularity for expressing topic, provenance, or security. If policy conditions apply mostly to the graph, then things are not as likely to slip by, for example, policy rules missing some infrequent misuse of the schema. In these cases, the burden on the query processor is also not excessive: Just as with documents, the container (table, graph) is the object of access grants, not the individual sentences (DBMS records, RDF triples) in the document.
It is left to the application to present a choice of graph level policies to the user. Exactly what these will be depends on the domain of discourse. A policy might restrict access to a meeting in a calendar to people whose OpenIDs figure in the attendee list, or limit access to a photo album to people mentioned in the owner's social network. Defining such policies is typically a task for the application developer.
The difference between the Document Web and the Linked Data Web is that while the Document Web enforces security when a thing is returned to the user, Linked Data Web enforcement must occur whenever a query references something, even if this is an intermediate result not directly shown to the user.
The DataSphere will offer a generic policy scheme, filtering what graphs are accessed in a given query situation. Other applications may then verify the safety of one's disclosed information using the same DataSphere infrastructure. Of course, the user must rely on the infrastructure provider to correctly enforce these rules. Then again, some users will operate and audit their own infrastructure anyway.
Federation vs. Centralization
On the open web, there is the question of federation vs. centralization. If an application is seen to be an interface to a vocabulary, it becomes more agnostic with respect to this. In practice, if we are talking about hosted services, what is hosted together joins much faster. Data Spaces with lots of interlinking, such as closely connected social networks, will tend to cluster together on the same cloud to facilitate joint operation. Data is ubiquitous and not location-conscious, but what one can efficiently do with it depends on location. Joint access patterns favor joint location. Due to technicalities of the matter, single database clusters will run complex queries within the cluster 100 to 1000 times faster than between clusters. The size of such data clouds may be in the hundreds-of-billions of triples. It seems to make sense to have data belonging to same-type or jointly-used applications close together. In practice, there will arise partitioning by type of usage, user profile, etc., but this is no longer airtight and applications more-or-less float on top of all of this.
A search engine can host a copy of the Document Web and allow text lookups on it. But a text lookup is a single well-defined query that happens to parallelize and partition very well. A search engine can also have all the structured public data copied, but the problem there is that queries are a lot less predictable and may take orders of magnitude more resources than a single text lookup. As a partial answer, even now, we can set up a database so that the first million single-row joins cost the user nothing, but doing more requires a special subscription.
The cost for hosting a trillion triples will vary radically in function of what throughput is promised. This may result in pricing per service level, a bit like ISP pricing varies in function of promised connectivity. Queries can be run for free if no throughput guarantee applies, and might cost more if the host promises at least five-million joins-per-second including infrequently-accessed data.
Performance and cost dynamics will probably lead to the emergence of domain-specific clusters of colocated Data Spaces. The landscape will be hybrid, where usage drives data colocation. A single Google is not a practical solution to the world's spectrum of query needs.
What is the Cost of Schema-Last?
The DataSphere proposition is predicated on a worldwide database fabric that can store anything, just like a network can transport anything. It cannot enforce a fixed schema, just like TCP/IP cannot say that it will transport only email. This is continuous schema evolution. Well, TCP/IP can transport anything but it does transport a lot of HTML and email. Similarly, the DataSphere can optimize for some common vocabularies.
We have seen that an application-specific relational schema is often 10 times more efficient than an equivalent completely generic RDF representation of the same thing. The gap may narrow, but task specific representations will keep an edge. We ought to know, as we do both.
While anything can be represented, the masses are not that creative. For any data-hosting provider, making a specialized representation for the top 100 entities may cut data size in half or better. This is a behind-the-scenes optimization that will in time be a matter of course.
Historically, our industry has been driven by two phenomena:
-
New PCs every 2 years. To make this necessary, Windows has been getting bigger and bigger, and not upgrading is not an option if one must exchange documents with new data formats and keep up with security.
-
Agility, or ad hoc over planned. The reason the RDBMS won over CODASYL network databases was that one did not have to define what queries could be made when creating the database. With the Linked Data Web, we have one more step in this direction when we say that one does not have to decide what can be represented when creating the database.
To summarize, there is some cost to schema-last, but then our industry needs more complexity to keep justifying constant investment. The cost is in this sense not all bad.
Building the DataSphere may be the next great driver of server demand. As a case in point, Cisco, whose fortune was made when the network became ubiquitous, just entered the server game. It's in the air.
DataSphere Precursors
Right now, we have the Linked Open Data movement with lots of new data being added. We have the drive for data- and reputation-portability. We have Freebase as a demonstrator of end-users actually producing structured data. We have convergence of terminology around DBpedia, FOAF, SIOC, and more. We have demonstrators of useful data integration on the RDF stack in diverse fields, especially life sciences.
We have a totally ubiquitous network for the distribution of this, plus database technology to make this work.
We have a practical need for semantics, as search is getting saturated, email is getting killed by spam, and information overload is a constant. Social networks can be leveraged for solving a lot of this, if they can only be opened.
Of course, there is a call for transparency in society at large. Well, the battle of transparency vs. spin is a permanent feature of human existence but even there, we cannot ignore the possibilities of open data.
Databases and Servers
Technically, what does this take? Mostly, this takes a lot of memory. The software is there and we are productizing it as we speak. As with other data intensive things, the key is scalable querying over clusters of commodity servers. Nothing we have not heard before. Of course, the DBMS must know about RDF specifics to get the right query plans and so on but this we have explained elsewhere.
This all comes down to the cost of memory. No amount of CPU or network speed will make any difference if data is not in memory. Right now, a board with 8G and a dual core AMD X86-64 and 4 disks may cost about $700. 2 x 4 core Xeon and 16G and 8 disks may be $4000, counting just the components. In our experience, about 32G per billion triples is a minimum. This must be backed by a few independent disks so as to fill the cache in parallel. A cluster with 1 TB of RAM would be under $100K if built from low end boards.
The workload is all about large joins across partitions. The queries parallelize well, thus using the largest and most expensive machines for building blocks is not cost efficient. Having absolutely everything in RAM is also not cost efficient, but it is necessary to have many disks to absorb the random access load. Disk access is predominantly random, unlike some analytics workloads that can read serially. If SSD's get a bit cheaper, one could have SSD for the database and disk for backup.
With large data centers, redundancy becomes an issue. The most cost effective redundancy is simply storing partitions in duplicate or triplicate on different commodity servers. The DBMS software should handle the replication and fail-over.
For operating such systems, scaling-on-demand is necessary. Data must move between servers, and adding or replacing servers should be an on-the-fly operation. Also, since access is essentially never uniform, the most commonly accessed partitions may benefit from being kept in more copies than less frequently accessed ones. The DBMS must be essentially self administrating since these things are quite complex and easily intractable if one does not have in depth understanding of this rather complex field.
The best price point for hardware varies with time. Right now, the optimum is to have many basic motherboards with maximum memory in a rack unit, then another unit with local disks for each motherboard. Much cheaper than SAN's and Infiniband fabrics.
Conclusions and Next Steps
The ingredients and use cases are there. If server clusters with 1TB RAM begin under $100K, the cost of deployment is small compared to personnel costs.
Bootstrapping the DataSphere from current Linked Open Data, such as DBpedia, OpenCYC, Freebase, and every sort of social network, is feasible. Aside from private data integration and analytics efforts and E-science, the use cases are liberating social networks and C2C and some aspects of search from silos, overcoming spam, and mass use of semantics extracted from text. Emergent effects will then carry the ball to places we have not yet been.
The Linked Data Web has its origins in Semantic Web research, and many of the present participants come from these circles. Things may have been slowed down by a disconnect, only too typical of human activity, between Semantic Web research on one hand and database engineering on the other. Right now, the challenge is one of engineering. As documented on this blog, we have worked quite a bit on cluster databases, mostly but not exclusively with RDF use cases. The actual challenges of this are however not at all what is discussed in Semantic Web conferences. These have to do with complexities of parallelism, timing, message bottlenecks, transactions, and the like, i.e., hardcore engineering. These are difficult beyond what the casual onlooker might guess but not impossible. The details that remain to be worked out are nothing semantic, they are hardcore database, concerning automatic provisioning and such matters.
It is as if the Semantic Web people look with envy at the Web 2.0 side where there are big deployments in production, yet they do not seem quite ready to take the step themselves. Well, I will write some other time about research and engineering. For now, the message is &mdash go for it. Stay tuned for more announcements, as we near production with our next generation of software.
Related
|
03/25/2009 10:50 GMT-0500
|
Modified:
03/25/2009 12:31 GMT-0500
|
Virtuoso RDF: A Getting Started Guide for the Developer
It is a long standing promise of mine to dispel the false impression that using Virtuoso to work with RDF is complicated.
The purpose of this presentation is to show a programmer how to put RDF into Virtuoso and how to query it. This is done programmatically, with no confusing user interfaces.
You should have a Virtuoso Open Source tree built and installed. We will look at the LUBM benchmark demo that comes with the package. All you need is a Unix shell. Running the shell under emacs (m-x shell) is the best. But the open source isql utility should have command line editing also. The emacs shell is however convenient for cutting and pasting things between shell and files.
To get started, cd into binsrc/tests/lubm.
To verify that this works, you can do
./test_server.sh virtuoso-t
This will test the server with the LUBM queries. This should report 45 tests passed. After this we will do the tests step-by-step.
Loading the Data
The file lubm-load.sql contains the commands for loading the LUBM single university qualification database.
The data files themselves are in lubm_8000, 15 files in RDFXML.
There is also a little ontology called inf.nt. This declares the subclass and subproperty relations used in the benchmark.
So now let's go through this procedure.
Start the server:
$ virtuoso-t -f &
This starts the server in foreground mode, and puts it in the background of the shell.
Now we connect to it with the isql utility.
$ isql 1111 dba dba
This gives a SQL> prompt. The default username and password are both dba.
When a command is SQL, it is entered directly. If it is SPARQL, it is prefixed with the keyword sparql. This is how all the SQL clients work. Any SQL client, such as any ODBC or JDBC application, can use SPARQL if the SQL string starts with this keyword.
The lubm-load.sql file is quite self-explanatory. It begins with defining an SQL procedure that calls the RDF/XML load function, DB..RDF_LOAD_RDFXML, for each file in a directory.
Next it calls this function for the lubm_8000 directory under the server's working directory.
sparql
CLEAR GRAPH <lubm>;
sparql
CLEAR GRAPH <inf>;
load_lubm ( server_root() || '/lubm_8000/' );
Then it verifies that the right number of triples is found in the <lubm> graph.
sparql
SELECT COUNT(*)
FROM <lubm>
WHERE { ?x ?y ?z } ;
The echo commands below this are interpreted by the isql utility, and produce output to show whether the test was passed. They can be ignored for now.
Then it adds some implied subOrganizationOf triples. This is part of setting up the LUBM test database.
sparql
PREFIX ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#>
INSERT
INTO GRAPH <lubm>
{ ?x ub:subOrganizationOf ?z }
FROM <lubm>
WHERE { ?x ub:subOrganizationOf ?y .
?y ub:subOrganizationOf ?z .
};
Then it loads the ontology file, inf.nt, using the Turtle load function, DB.DBA.TTLP. The arguments of the function are the text to load, the default namespace prefix, and the URI of the target graph.
DB.DBA.TTLP ( file_to_string ( 'inf.nt' ),
'http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl',
'inf'
) ;
sparql
SELECT COUNT(*)
FROM <inf>
WHERE { ?x ?y ?z } ;
Then we declare that the triples in the <inf> graph can be used for inference at run time. To enable this, a SPARQL query will declare that it uses the 'inft' rule set. Otherwise this has no effect.
rdfs_rule_set ('inft', 'inf');
This is just a log checkpoint to finalize the work and truncate the transaction log. The server would also eventually do this in its own time.
checkpoint;
Now we are ready for querying.
Querying the Data
The queries are given in 3 different versions: The first file, lubm.sql, has the queries with most inference open coded as UNIONs. The second file, lubm-inf.sql, has the inference performed at run time using the ontology information in the <inf> graph we just loaded. The last, lubm-phys.sql, relies on having the entailed triples physically present in the <lubm> graph. These entailed triples are inserted by the SPARUL commands in the lubm-cp.sql file.
If you wish to run all the commands in a SQL file, you can type load <filename>; (e.g., load lubm-cp.sql;) at the SQL> prompt. If you wish to try individual statements, you can paste them to the command line.
For example:
SQL> sparql
PREFIX ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#>
SELECT *
FROM <lubm>
WHERE { ?x a ub:Publication .
?x ub:publicationAuthor <http://www.Department0.University0.edu/AssistantProfessor0>
};
VARCHAR
_______________________________________________________________________
http://www.Department0.University0.edu/AssistantProfessor0/Publication0
http://www.Department0.University0.edu/AssistantProfessor0/Publication1
http://www.Department0.University0.edu/AssistantProfessor0/Publication2
http://www.Department0.University0.edu/AssistantProfessor0/Publication3
http://www.Department0.University0.edu/AssistantProfessor0/Publication4
http://www.Department0.University0.edu/AssistantProfessor0/Publication5
6 Rows. -- 4 msec.
To stop the server, simply type shutdown; at the SQL> prompt.
If you wish to use a SPARQL protocol end point, just enable the HTTP listener. This is done by adding a stanza like —
[HTTPServer]
ServerPort = 8421
ServerRoot = .
ServerThreads = 2
— to the end of the virtuoso.ini file in the lubm directory. Then shutdown and restart (type shutdown; at the SQL> prompt and then virtuoso-t -f & at the shell prompt).
Now you can connect to the end point with a web browser. The URL is http://localhost:8421/sparql. Without parameters, this will show a human readable form. With parameters, this will execute SPARQL.
We have shown how to load and query RDF with Virtuoso using the most basic SQL tools. Next you can access RDF from, for example, PHP, using the PHP ODBC interface.
To see how to use Jena or Sesame with Virtuoso, look at Native RDF Storage Providers. To see how RDF data types are supported, see Extension datatype for RDF
To work with large volumes of data, you must add memory to the configuration file and use the row-autocommit mode, i.e., do log_enable (2); before the load command. Otherwise Virtuoso will do the entire load as a single transaction, and will run out of rollback space. See documentation for more.
|
12/17/2008 12:31 GMT-0500
|
Modified:
12/17/2008 12:41 GMT-0500
|
"E Pluribus Unum", or "Inversely Functional Identity", or "Smooshing Without the Stickiness" (re-updated)
What a terrible word, smooshing... I have understood it to mean that when you have two names for one thing, you give each all the attributes of the other. This smooshes them together, makes them interchangeable.
This is complex, so I will begin with the point and the interested may read on for the details and implications. Starting with soon to be released version 6, Virtuoso allows you to say that two things, if they share a uniquely identifying property, are the same. Examples of uniquely identifying properties would be a book's ISBN number, or a person's social security plus full name. In relational language this is a unique key, and in RDF parlance, an inverse functional property.
In most systems, such problems are dealt with as a preprocessing step before querying. For example, all the items that are considered the same will get the same properties or at load time all identifiers will be normalized according to some application rules. This is good if the rules are clear and understood. This is so in closed situations, where things tend to have standard identifiers to begin with. But on the open web this is not so clear cut.
In this post, we show how to do these things ad hoc, without materializing anything. At the end, we also show how to materialize identity and what the consequences of this are with open web data. We use real live web crawls from the Billion Triples Challenge data set.
On the linked data web, there are independently arising descriptions of the same thing and thus arises the need to smoosh, if these are to be somehow integrated. But this is only the beginning of the problems.
To address these, we have added the option of specifying that some property will be considered inversely functional in a query. This is done at run time and the property does not really have to be inversely functional in the pure sense. foaf:name will do for an example. This simply means that for purposes of the query concerned, two subjects which have at least one foaf:name in common are considered the same. In this way, we can join between FOAF files. With the same database, a query about music preferences might consider having the same name as "same enough," but a query about criminal prosecution would obviously need to be more precise about sameness.
Our ontology is defined like this:
-- Populate a named graph with the triples you want to use in query time inferencing
ttlp ( '
@prefix foaf: <xmlns="http" xmlns.com="xmlns.com" foaf="foaf">
</>
@prefix owl: <xmlns="http" www.w3.org="www.w3.org" owl="owl">
</>
foaf:mbox_sha1sum a owl:InverseFunctionalProperty .
foaf:name a owl:InverseFunctionalProperty .
',
'xx',
'b3sifp'
);
-- Declare that the graph contains an ontology for use in query time inferencing
rdfs_rule_set ( 'http://example.com/rules/b3sifp#',
'b3sifp'
);
Then use it:
sparql
DEFINE input:inference "http://example.com/rules/b3sifp#"
SELECT DISTINCT ?k ?f1 ?f2
WHERE { ?k foaf:name ?n .
?n bif:contains "'Kjetil Kjernsmo'" .
?k foaf:knows ?f1 .
?f1 foaf:knows ?f2
};
VARCHAR VARCHAR VARCHAR
______________________________________ _______________________________________________ ______________________________
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/dajobe
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/net_twitter
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/amyvdh
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/pom
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/mattb
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/davorg
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/distobj
http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/perigrin
....
Without the inference, we get no matches. This is because the data in question has one graph per FOAF file, and blank nodes for persons. No graph references any person outside the ones in the graph. So if somebody is mentioned as known, then without the inference there is no way to get to what that person's FOAF file says, since the same individual will be a different blank node there. The declaration in the context named b3sifp just means that all things with a matching foaf:name or foaf:mbox_sha1sum are the same.
Sameness means that two are the same for purposes of DISTINCT or GROUP BY, and if two are the same, then both have the UNION of all of the properties of both.
If this were a naive smoosh, then the individuals would have all the same properties but would not be the same for DISTINCT.
If we have complex application rules for determining whether individuals are the same, then one can materialize owl:sameAs triples and keep them in a separate graph. In this way, the original data is not contaminated and the materialized volume stays reasonable — nothing like the blow-up of duplicating properties across instances.
The pro-smoosh argument is that if every duplicate makes exactly the same statements, then there is no great blow-up. Best and worst cases will always depend on the data. In rough terms, the more ad hoc the use, the less desirable the materialization. If the usage pattern is really set, then a relational-style application-specific representation with identity resolved at load time will perform best. We can do that too, but so can others.
The principal point is about agility as concerns the inference. Run time is more agile than materialization, and if the rules change or if different users have different needs, then materialization runs into trouble. When talking web scale, having multiple users is a given; it is very uneconomical to give everybody their own copy, and the likelihood of a user accessing any significant part of the corpus is minimal. Even if the queries were not limited, the user would typically not wait for the answer of a query doing a scan or aggregation over 1 billion blog posts or something of the sort. So queries will typically be selective. Selective means that they do not access all of the data, hence do not benefit from ready-made materialization for things they do not even look at.
The exception is corpus-wide statistics queries. But these will not be done in interactive time anyway, and will not be done very often. Plus, since these do not typically run all in memory, these are disk bound. And when things are disk bound, size matters. Reading extra entailment on the way is just a performance penalty.
Enough talk. Time for an experiment. We take the Yahoo and Falcon web crawls from the Billion Triples Challenge set, and do two things with the FOAF data in them:
- Resolve identity at insert time. We remove duplicate person URIs, and give the single URI all the properties of all the duplicate URIs. We expect these to be most often repeats. If a person references another person, we normalize this reference to go to the single URI of the referenced person.
- Give every duplicate URI of a person all the properties of all the duplicates. If these are the same value, the data should not get much bigger, or so we think.
For the experiment, we will consider two people the same if they have the same foaf:name and are both instances of foaf:Person. This gets some extra hits but should not be statistically significant.
The following is a commented SQL script performing the smoosh. We play with internal IDs of things, thus some of these operations cannot be done in SPARQL alone. We use SPARQL where possible for readability. As the documentation states, iri_to_id converts from the qualified name of an IRI to its ID and id_to_iri does the reverse.
We count the triples that enter into the smoosh:
-- the name is an existence because else we'd get several times more due to
-- the names occurring in many graphs
sparql
SELECT COUNT(*)
WHERE { { SELECT DISTINCT ?person
WHERE { ?person a foaf:Person }
} .
FILTER ( bif:exists ( SELECT (1)
WHERE { ?person foaf:name ?nn }
)
) .
?person ?p ?o
};
-- We get 3284674
We make a few tables for intermediate results.
-- For each distinct name, gather the properties and objects from
-- all subjects with this name
CREATE TABLE name_prop
( np_name ANY,
np_p IRI_ID_8,
np_o ANY,
PRIMARY KEY ( np_name,
np_p,
np_o
)
);
ALTER INDEX name_prop
ON name_prop
PARTITION ( np_name VARCHAR (-1, 0hexffff) );
-- Map from name to canonical IRI used for the name
CREATE TABLE name_iri ( ni_name ANY PRIMARY KEY,
ni_s IRI_ID_8
);
ALTER INDEX name_iri
ON name_iri
PARTITION ( ni_name VARCHAR (-1, 0hexffff) );
-- Map from person IRI to canonical person IRI
CREATE TABLE pref_iri
( i IRI_ID_8,
pref IRI_ID_8,
PRIMARY KEY ( i )
);
ALTER INDEX pref_iri
ON pref_iri
PARTITION ( i INT (0hexffff00) );
-- a table for the materialization where all aliases get all properties of every other
CREATE TABLE smoosh_ct
( s IRI_ID_8,
p IRI_ID_8,
o ANY,
PRIMARY KEY ( s,
p,
o
)
);
ALTER INDEX smoosh_ct
ON smoosh_ct
PARTITION ( s INT (0hexffff00) );
-- disable transaction log and enable row auto-commit. This is necessary, otherwise
-- bulk operations are done transactionally and they will run out of rollback space.
LOG_ENABLE (2);
-- Gather all the properties of all persons with a name under that name.
-- INSERT SOFT means that duplicates are ignored
INSERT SOFT name_prop
SELECT "n", "p", "o"
FROM ( sparql
DEFINE output:valmode "LONG"
SELECT ?n ?p ?o
WHERE { ?x a foaf:Person .
?x foaf:name ?n .
?x ?p ?o
}
) xx ;
-- Now choose for each name the canonical IRI
INSERT INTO name_iri
SELECT np_name,
( SELECT MIN (s)
FROM rdf_quad
WHERE o = np_name
AND p = IRI_TO_ID ('http://xmlns.com/foaf/0.1/name')
) AS mini
FROM name_prop
WHERE np_p = IRI_TO_ID ('http://xmlns.com/foaf/0.1/name') ;
-- For each person IRI, map to the canonical IRI of that person
INSERT SOFT pref_iri (i, pref)
SELECT s,
ni_s
FROM name_iri,
rdf_quad
WHERE o = ni_name
AND p = IRI_TO_ID ('http://xmlns.com/foaf/0.1/name') ;
-- Make a graph where all persons have one iri with all the properties of all aliases
-- and where person-to-person refs are canonicalized
INSERT SOFT rdf_quad (g,s,p,o)
SELECT IRI_TO_ID ('psmoosh'),
ni_s,
np_p,
COALESCE ( ( SELECT pref
FROM pref_iri
WHERE i = np_o
),
np_o
)
FROM name_prop,
name_iri
WHERE ni_name = np_name
OPTION ( loop, quietcast ) ;
-- A little explanation: The properties of names are copied into rdf_quad with the name
-- replaced with its canonical IRI. If the object has a canonical IRI, this is used as
-- the object, else the object is unmodified. This is the COALESCE with the sub-query.
-- This takes a little time. To check on the progress, take another connection to the
-- server and do
STATUS ('cluster');
-- It will return something like
-- Cluster 4 nodes, 35 s. 108 m/s 1001 KB/s 75% cpu 186% read 12% clw threads 5r 0w 0i
-- buffers 549481 253929 d 8 w 0 pfs
-- Now finalize the state; this makes it permanent. Else the work will be lost on server
-- failure, since there was no transaction log
CL_EXEC ('checkpoint');
-- See what we got
sparql
SELECT COUNT (*)
FROM <psmoosh>
WHERE {?s ?p ?o};
-- This is 2253102
-- Now make the copy where all have the properties of all synonyms. This takes so much
-- space we do not insert it as RDF quads, but make a special table for it so that we can
-- run some statistics. This saves time.
INSERT SOFT smoosh_ct (s, p, o)
SELECT s, np_p, np_o
FROM name_prop,
rdf_quad
WHERE o = np_name
AND p = IRI_TO_ID ('http://xmlns.com/foaf/0.1/name') ;
-- as above, INSERT SOFT so as to ignore duplicates
SELECT COUNT (*)
FROM smoosh_ct;
-- This is 167360324
-- Find out where the bloat comes from
SELECT TOP 20 COUNT (*),
ID_TO_IRI (p)
FROM smoosh_ct
GROUP BY p
ORDER BY 1 DESC;
The results are:
54728777 http://www.w3.org/2002/07/owl#sameAs
48543153 http://xmlns.com/foaf/0.1/knows
13930234 http://www.w3.org/2000/01/rdf-schema#seeAlso
12268512 http://xmlns.com/foaf/0.1/interest
11415867 http://xmlns.com/foaf/0.1/nick
6683963 http://xmlns.com/foaf/0.1/weblog
6650093 http://xmlns.com/foaf/0.1/depiction
4231946 http://xmlns.com/foaf/0.1/mbox_sha1sum
4129629 http://xmlns.com/foaf/0.1/homepage
1776555 http://xmlns.com/foaf/0.1/holdsAccount
1219525 http://xmlns.com/foaf/0.1/based_near
305522 http://www.w3.org/1999/02/22-rdf-syntax-ns#type
274965 http://xmlns.com/foaf/0.1/name
155131 http://xmlns.com/foaf/0.1/dateOfBirth
153001 http://xmlns.com/foaf/0.1/img
111130 http://www.w3.org/2001/vcard-rdf/3.0#ADR
52930 http://xmlns.com/foaf/0.1/gender
48517 http://www.w3.org/2004/02/skos/core#subject
45697 http://www.w3.org/2000/01/rdf-schema#label
44860 http://purl.org/vocab/bio/0.1/olb
Now compare with the predicate distribution of the smoosh with identities canonicalized
sparql
SELECT COUNT (*) ?p
FROM <psmoosh>
WHERE { ?s ?p ?o }
GROUP BY ?p
ORDER BY 1 DESC
LIMIT 20;
Results are:
748311 http://xmlns.com/foaf/0.1/knows
548391 http://xmlns.com/foaf/0.1/interest
140531 http://www.w3.org/2000/01/rdf-schema#seeAlso
105273 http://www.w3.org/1999/02/22-rdf-syntax-ns#type
78497 http://xmlns.com/foaf/0.1/name
48099 http://www.w3.org/2004/02/skos/core#subject
45179 http://xmlns.com/foaf/0.1/depiction
40229 http://www.w3.org/2000/01/rdf-schema#comment
38272 http://www.w3.org/2000/01/rdf-schema#label
37378 http://xmlns.com/foaf/0.1/nick
37186 http://dbpedia.org/property/abstract
34003 http://xmlns.com/foaf/0.1/img
26182 http://xmlns.com/foaf/0.1/homepage
23795 http://www.w3.org/2002/07/owl#sameAs
17651 http://xmlns.com/foaf/0.1/mbox_sha1sum
17430 http://xmlns.com/foaf/0.1/dateOfBirth
15586 http://xmlns.com/foaf/0.1/page
12869 http://dbpedia.org/property/reference
12497 http://xmlns.com/foaf/0.1/weblog
12329 http://blogs.yandex.ru/schema/foaf/school
We can drop the owl:sameAs triples from the count, so the bloat is a bit less by that but it still is tens of times larger than the canonicalized copy or the initial state.
Now, when we try using the psmoosh graph, we still get different results from the results with the original data. This is because foaf:knows relations to things with no foaf:name are not represented in the smoosh. The exist:
sparql
SELECT COUNT (*)
WHERE { ?s foaf:knows ?thing .
FILTER ( !bif:exists ( SELECT (1)
WHERE { ?thing foaf:name ?nn }
)
)
};
-- 1393940
So the smoosh graph is not an accurate rendition of the social network. It would have to be smooshed further to be that, since the data in the sample is quite irregular. But we do not go that far here.
Finally, we calculate the smoosh blow up factors. We do not include owl:sameAs triples in the counts.
select (167360324 - 54728777) / 3284674.0;
34.290022997716059
select 2229307 / 3284674.0;
= 0.678699621332284
So, to get a smoosh that is not really the equivalent of the original, either multiply the original triple count by 34 or 0.68, depending on whether synonyms are collapsed or not.
Making the smooshes does not take very long, some minutes for the small one. Inserting the big one would be longer, a couple of hours maybe. It was 33 minutes for filling the smoosh_ct table. The metrics were not with optimal tuning so the performance numbers just serve to show that smooshing takes time. Probably more time than allowable in an interactive situation, no matter how the process is optimized.
|
12/16/2008 14:14 GMT-0500
|
Modified:
12/16/2008 15:01 GMT-0500
|
Virtuoso Vs. MySQL: Setting the Berlin Record Straight (update 2)
In the context of the Berlin SPARQL Benchmark, I have repeatedly written about measurement procedures and steady state. The point is that the numbers at larger scales are unreliable due to cache behavior if one is not careful about measurement and does not have adequate warmup. Thus it came to pass that one cut of the BSBM paper had 3 seconds for MySQL and 100 for Virtuoso, basically through ignoring cache effects.
So we decided to do it ourselves.
The score is (updated with revised innodb_buffer_pool_size setting, based on advice noted down below):
| n-clients |
Virtuoso |
MySQL (with increased buffer pool size) |
MySQL (with default buffer poll size) |
| 1 |
41,161.33 |
27,023.11 |
12,171.41 |
| 4 |
127,918.30 |
(pending) |
37,566.82 |
| 8 |
218,162.29 |
105,524.23 |
51,104.39 |
| 16 |
214,763.58 |
98,852.42 |
47,589.18 |
The metric is the query mixes per hour from the BSBM test driver output. For the interested, the complete output is here.
The benchmark is pure SQL, nothing to do with SPARQL or RDF.
The hardware is 2 x Xeon 5345 (2 x quad core, 2.33 GHz), 16 G RAM. The OS is 64-bit Debian Linux.
The benchmark was run at a scale of 200,000. Each run had 2000 warm-up query mixes and 500 measured query mixes, which gives steady state, eliminating any effects of OS disk cache and the like. Both databases were configured to use 8G for disk cache. The test effectively runs from memory. We ran an analyze table on each MySQL table but noticed that this had no effect. Virtuoso does the stats sampling on the go; possibly MySQL also since the explicit stats did not make any difference. The MySQL tables were served by the InnoDB engine. MySQL appears to cache results of queries in some cases. This was not apparent in the tests.
The versions are 5.09 for Virtuoso and 5.1.29 for MySQL. You can download and examine --
MySQL ought to do better. We suspect that here, just as in the TPC-D experiment we made way back, the query plans are not quite right. Also we rarely saw over 300% CPU utilization for MySQL. It is possible there is a config parameter that affects this. The public is invited to tell us about such.
Update:
Andreas Schultz of the BSBM team advised us to increase the innodb_buffer_pool_size setting in the MySQL config. We did and it produced some improvement. Indeed, this is more like it, as we now see CPU utilization around 700% instead of the 300% in the previously published run, which rendered it suspect. Also, our experiments with TPC-D led us to expect better. We ran these things a few times so as to have warm cache.
On the first run, we noticed that the Innodb warm up time was somewhere well in excess of 2000 query mixes. Another time, we should make a graph of throughput as a function of time for both MySQL and Virtuoso. We recently made a greedy prefetch hack that should give us some mileage there. For the next BSBM, all we can advise is to run larger scale system for half an hour first and then measure and then measure again. If the second measurement is the same as the first then it is good.
As always, since MySQL is not our specialty, we confidently invite the public to tell us how to make it run faster. So, unless something more turns up, our next trial is a revisit of TPC-H.
|
11/20/2008 11:06 GMT-0500
|
Modified:
11/24/2008 10:15 GMT-0500
|
ISWC 2008: Some Questions
Inference: Is it always forward chaining?
We got a number of questions about Virtuoso's inference support. It seems that we are the odd one out, as we do not take it for granted that inference ought to consist of materializing entailment.
Firstly, of course one can materialize all one wants with Virtuoso. The simplest way to do this is using SPARUL. With the recent transitivity extensions to SPARQL, it is also easy to materialize implications of transitivity with a single statement. Our point is that for trivial entailment such as subclass, sub-property, single transitive property, and owl:sameAs, we do not require materialization, as we can resolve these at run time also, with backward-chaining built into the engine.
For more complex situations, one needs to materialize the entailment. At the present time, we know how to generalize our transitive feature to run arbitrary backward-chaining rules, including recursive ones. We could have a sort of Datalog backward-chaining embedded in our SQL/SPARQL and could run this with good parallelism, as the transitive feature already works with clustering and partitioning without dying of message latency. Exactly when and how we do this will be seen. Even if users want entailment to be materialized, such a rule system could be used for producing the materialization at good speed.
We had a word with Ian Horrocks on the question. He noted that it is often naive on behalf of the community to tend to equate description of semantics with description of algorithm. The data need not always be blown up.
The advantage of not always materializing is that the working set stays better. Once the working set is no longer in memory, response times jump disproportionately. Also, if the data changes or is retracted or is unreliable, one can end up doing a lot of extra work with materialization. Consider the effect of one malicious sameAs statement. This can lead to a lot of effects that are hard to retract. On the other hand, if running in memory with static data such as the LUBM benchmark, the queries run some 20% faster if entailment subclasses and sub-properties are materialized rather than done at run time.
Genetic Algorithms for SPARQL?
Our compliments for the wildest idea of the conference go to Eyal Oren, Christophe Guéret, and Stefan Schlobach, et al, for their paper on using genetic algorithms for guessing how variables in a SPARQL query ought to be instantiated. Prisoners of our "conventional wisdom" as we are, this might never have occurred to us.
Schema Last?
It is interesting to see how the industry comes to the semantic web conferences talking about schema last while at the same time the traditional semantic web people stress enforcing schema constraints and making more predictably performing and database friendlier logics. So do the extremes converge.
There is a point to schema last. RDF is very good for getting a view of ad hoc or unknown data. One can just load and look at what there is. Also, additions of unforeseen optional properties or relations to the schema are easy and efficient. However, it seems that a really high traffic online application would always benefit from having some application specific data structures. Such could also save considerably in hardware.
It is not a sharp divide between RDF and relational application oriented representation. We have the capabilities in our RDB to RDF mapping. We just need to show this and have SPARUL and data loading
|
11/04/2008 15:54 GMT-0500
|
Modified:
11/04/2008 14:37 GMT-0500
|
ISWC 2008: Billion Triples Challenge
We showed our billion triples demo at the ISWC 2008 poster session. Generally people liked what they saw, as we basically did what one always had wanted to do with SPARQL but never could. This means firstly full SQL parity, with sub-queries, aggregation, full text, etc. Beyond SQL, we have transitive sub-queries, owl:sameAs at run time, and other inference things, all on demand.
The live demo is at http://b3s.openlinksw.com/. This site is under development and may not be on all the time. We are taking it in the direction of hosting the whole LOD cloud. This is an evolving operation where we will continue showcasing how one can ask increasingly interesting questions from a growing online database, in the spirit of the billion triples charter.
In the words of Jim Hendler, we were not selected for the finale because this would have made the challenge a database shootout instead of a more research-oriented event. There is some point to this since if the event becomes like the TPC benchmarks, this will limit the entrance to full time database players. Anyway, we got a special mention in the intro of the challenge track.
The winner was Semaplorer, a federated SPARQL query system. There is some merit to this, as we ourselves are not convinced that centralization is always the right direction. As discussed in the DARQ Matter of Federation post, we have a notion of how to do this production-strength with our cluster engine, now also over wide area networks. We shall see.
Why Not Just Join?
The entries from Deri and LARKC (MaRVIN, "Massive RDF Versatile Inference Network") were doing materialization of inference results in a cluster environment. The thing they were not doing was joining across partitions. Thus, the data was partitioned on whatever criterion and then the data in each partition was further refined according to rules known to all partitions. Deri did not address joining further.
"Nature shall be the guide of the alchemist," goes the old adage. We can look at MaRVIN as an example of this dictum. Networks of people are low bandwidth, not nearly fully connected. Asking a colleague for information is expensive and subject to misunderstanding; asking another research group might never produce an answer.
Even looking at one individual, we have no reason to think that the human expert would do complete reasoning. Indeed, the brain is a sort of compute cluster, but it does not have flat latency point to point connectivity — some joins are fast; others are not even tried, for all we know.
A database running on a cluster is a sort of counter-example. A database with RDF workload will end up joining across partitions pretty much all of the time.
MaRVIN's approach to joining could be likened to a country dance: Boys get to take a whirl with different girls according to a complex pattern. For match-making, some matches are produced early but one never knows if the love of a lifetime might be just around the corner. Also, if the dancers are inexperienced, they will have little ability to evaluate how good a match they have with their partner. A few times around the dance floor are needed to get the hang of things.
The question is, at what point will it no longer be possible to join across the database? This depends on the interconnect latency. The higher the latency, the more useful the square-dancing approach becomes.
Another practical consideration is the fact that RDF reasoners are not usually built for distributed memory multiprocessors. If the reasoner must be a plug-in component, then it cannot be expected to be written for grids.
We can think of a product safety use case: Find cosmetics that have ingredients that are considered toxic in the amounts they are present in each product. This can be done as a database query with some transitive operations, like running through a cosmetics taxonomy and a poisons database. If the business logic deciding whether the presence of an ingredient in the product is a health hazard is very complex, we can get a lot of joins.
The MaRVIN way would be to set up a ball where each lipstick and eyeliner dances with every poison and then see if matches are made. The matching logic could be arbitrarily complex since it would run locally. Of course here, some domain knowledge is needed in order to set up the processing so that each product and poison carry all the associated information with them. Dancing with half a partner can bias one's perceptions: Again, it is like nature, sometimes not all cards are on the table.
It would seem that there is some setup involved before answering a question: Composition of partitions, frequency of result exchange, etc. How critical the domain knowledge implicit in the setup is for the quality of results is an interesting question.
The question is, at what point will a cluster using distributed database operations for inference become impractical? Of course, it is impractical from the get-go if the reasoners and query processors are not made for this. But what if they are? We are presently evaluating different message patterns for joining between partitions. The baseline is some 250,000 random single-triple lookups per second per core. Using a cluster increases this throughput. The increase is more or less linear depending on whether all intermediate results pass via one coordinating node (worst case) or whether each node can decide which other node will do the next join step for each result (best case). For example, a DISTINCT operation requires that data passes through a single place but JOINing and aggregation in general do not.
We will still publish numbers during this November.
|
11/04/2008 15:52 GMT-0500
|
Modified:
11/04/2008 14:36 GMT-0500
|
|
|