Details
OpenLink Software
Burlington, United States
Subscribe
Post Categories
Recent Articles
Community Member Blogs
Display Settings
Translate
|
Showing posts in all categories Refresh
Transaction Semantics in RDF and Relational Models
[
Orri Erling
]
As a part of defining benchmark audit for testing ACID properties on RDF stores, we will here examine different RDF scenarios where lack of concurrency control causes inconsistent results. In so doing, we consider common implementation techniques and implications as concern locking (pessimistic) and multi-version (optimistic) concurrency control schemes.
In the following, we will talk in terms of triples, but the discussion can be trivially generalized to quads. We will use numbers for IRIs and literals. In most implementations, the internal representation for these is indeed a number (or at least some data type that has a well defined collation order). For ease of presentation, we consider a single index with key parts SPO. Any other index-like setting with any possible key order will have similar issues.
Insert (Create) and Delete
INSERT and DELETE as defined in SPARQL are queries which generate a result set which is then used for instantiating triple patterns. We note that a DELETE may delete a triple which the DELETE has not read; thus the delete set is not a subset of the read set. The SQL equivalent is the
DELETE FROM table WHERE key IN
( SELECT key1 FROM other_table )
expression, supposing it were implemented as a scan of other_table and an index lookup followed by DELETE on table.
The meaning of INSERT is that the triples in question exist after the operation, and the meaning of DELETE is that said triples do not exist. In a transactional context, this means that the after-image of the transaction is guaranteed either to have or not-have said triples.
Suppose that the triples { 1 0 0 }, { 1 5 6 }, and { 1 5 7 } exist in the beginning. If we DELETE { 1 ?x ?y } and concurrently INSERT { 1 2 4 . 1 2 3 . 1 3 5 }, then whichever was considered to be first by the concurrency control of the DBMS would complete first, and the other after that. Thus the end state would either have no triples with subject 1 or would have the three just inserted.
Suppose the INSERT inserts the first triple, { 1 2 4 }. The DELETE at the same time reads all triples with subject 1. The exclusive read waits for the uncommitted INSERT. The INSERT then inserts the second triple, { 1 2 3 }. Depending on the isolation of the read, this either succeeds, since no { 1 2 3 } was read, or causes a deadlock. The first corresponds to REPEATABLE READ isolation; the second to SERIALIZABLE.
We would not get the desired end-state of either all the inserted triples or no triples with subject 1 if the read or the DELETE were not serializable.
Furthermore if a DELETE template produced a triple that did not exist in the pre-image, the DELETE semantics still imply that this also does not exist in the after-image, which implies serializability.
Read and Update
Let us consider the prototypical transaction example of transferring funds from one account to another. Two balances are updated, and a history record is inserted.
The initial state is
a balance 10
b balance 10
We transfer 1 from a to b, and at the same time transfer 2 from b to a. The end state must have a at 11 and b at 9.
A relational database needs REPEATABLE READ isolation for this.
With RDF, txn1 reads that a has a balance of 10. At the same time, txn1 reads the balance of a. txn2 waits because the read of txn1 is exclusive. txn1 proceeds and read the balance of b. It then updates the balance of a and b.
All goes without the deadlock which is always cited in this scenario, because the locks are acquired in the same order. The act of updating the balance of a, since RDF does not really have an update-in-place, consists of deleting { a balance 10 } and inserting { a balance 9 }. This gets done and txn1 commits. At this point, txn2 proceeds after its wait on the row that stated { a balance 10 }. This row is now gone, and txn2 sees that a has no balance, which is quite possible in RDF's schema-less model.
We see that REPEATABLE READ is not adequate with RDF, even though it is with relational. The reason why there is no UPDATE-in-place is that the PRIMARY KEY of the triple includes all the parts, including the object. Even in a RDBMS, an UPDATE of a primary key part amounts to a DELETE-plus-INSERT. One could here argue that an implementation might still UPDATE-in-place if the key order were not changed. This would resolve the special case of the accounts but not a more general case.
Thus we see that the read of the balance must be SERIALIZABLE. This means that the read locks the space before the first balance, so that no insertion may take place. In this way the read of txn2 waits on the lock that is conceptually before the first possible match of { a balance ?x }.
locking order and OLTP
To implement TPC-C, I would update the table with the highest cardinality first, and then all tables in descending order of cardinality. In this way, the locks with the highest likelihood for contention are held for the least time. If locking multiple rows of a table, these should be locked in a deterministic order, e.g., lowest key-value first. In this way, the workload would not deadlock. In actual fact, with clusters and parallel execution, the lock acquisition will not be guaranteed to be serial, so deadlocks do not entirely go away, but still may get fewer. Besides, any outside transaction might still lock in the wrong order and cause deadlocks, which is why the OLTP application must in any case be built to deal with the possibility of deadlock.
This is the conventional relational view of the matter. In more recent times, in-memory schemes with deterministic lock acquisition (Abadi VLDB 2010) or single-threaded atomic execution of transactions (Uni Munich BIRTE workshop at VLDB2010, VoltDB) have been proposed. There the transaction is described as a stored procedure, possibly with extra annotations. These techniques might apply to RDF also. RDF is however an unlikely model for transaction-intensive applications, so we will not for now examine these further.
RDBMS usually implement row-level locking. This means that once a column of a row has an uncommitted state, any other transaction is prevented from changing the row. This has no ready RDF equivalent. RDF is usually implemented as a row-per-triple system and applying row-level locking to this does not give the semantic one expects of a relational row.
I would argue that it is not essential to enforce transactional guarantees in units of rows. The guarantees must apply between data that is read and written by a transaction. It does not need to apply to columns that the transaction does not reference. To take the TPC-C example, the new order transaction updates the stock level and the delivery transaction updates the delivery count on the stock table. In practice, a delivery and a new order falling on the same row of stock will lock each other out, but nothing in the semantics of the workload mandates this.
It does not seem a priori necessary to recreate the row as a unit of concurrency control in RDF. One could say that a multi-attribute whole (such as an address) ought to be atomic for concurrency control, but then applications updating addresses will most likely read and update all the fields together even if only the street name changes.
Pessimistic Vs. Optimistic Concurrency Control
We have so far spoken only in terms of row-level locking, which is to my knowledge the most widely used model in RDBMS, and one we implement ourselves. Some databases (e.g., MonetDB and VectorWise) implement optimistic concurrency control. The general idea is that each transaction has a read and write set and when a transaction commits, any other transactions whose read or write set intersects with the write set of the committing transaction are marked un-committable. Once a transaction thus becomes un-committable, it may presumably continue reading indefinitely but may no longer commit its updates. Optimistic concurrency is generally coupled with multi-version semantics where the pre-image of a transaction is a clean committed state of the database as of a specific point in time, i.e., snapshot isolation.
To implement SERIALIZABLE isolation, i.e., the guarantee that if a transaction twice performs a COUNT the result will be the same, one locks also the row that precedes the set of selected rows and marks each lock so as to prevent an insert to the right of the lock in key order. The same thing may be done in an optimistic setting.
Positional Handling of Updates in Column Stores [Heman, Zukowski, CWI science library] discusses management of multiple consecutive snapshots in some detail. The paper does not go into the details of different levels of isolation but nothing there suggests that serializability could not be supported. There is some complexity in marking the space between ordered rows as non-insertable across multiple versions but this should be feasible enough.
The issue of optimistic Vs. pessimistic concurrency does not seem to be affected by the differences between RDF and relational models. We note that an OLTP workload can be made to run with very few transaction aborts (deadlocks) by properly ordering operations when using a locking scheme. The same does not work with optimistic concurrency since updates happen immediately and transaction aborts occur whenever the writes of one intersect the reads or writes of another, regardless of the order in which these were made.
Developers seldom understand transactions; therefore DBMS should, within the limits of the possible, optimize locking order for locking schemes. A simple example is locking in key order when doing an operation on a set of values. A more complex variant would consist of analyzing data dependencies in stored procedures and reordering updates so as to get the highest cardinality tables first. We note that this latter trick also benefits optimistic schemes.
In RDF, the same principles apply but distinguishing cardinality of an updated set will have to rely on statistics of predicate cardinality. Such are anyhow needed for query optimization.
Eventual Consistency
Web scale systems that need to maintain consistent state across multiple data centers sometimes use "eventual consistency" schemes. Two-phase-commit becomes very inefficient as latency increases, thus strict transactional semantics have prohibitive cost if the system is more distributed than a cluster with a fast interconnect.
Eventual consistency schemes (Amazon Dynamo, Yahoo! PNUTS) maintain history information on the record which is the unit of concurrency control. The record is typically a non-first normal form chunk of related data that it makes sense to store together from the application's viewpoint. Application logic can then be applied to reconciling differing copies of the same logical record.
Such a scheme seems a priori ill-suited for RDF, where the natural unit of concurrency control would seem to be the quad. We first note that only recently changed (i.e., DELETEd + INSERTed quads, as there is no UPDATE-in-place) need history information. This history information can be stored away from the quad itself, thus not disrupting compression. When detecting that one site has INSERTed a quad that another has DELETEd in the same general time period, application logic can still be applied for reading related quads in order to arrive at a decision on how to reconcile two databases that have diverged. The same can apply to conflicting values of properties that for the application should be single-valued. Comparing time-stamped transaction logs on quads is not fundamentally different from comparing record histories in Dynamo or PNUTS.
As we overcome the data size penalties that have until recently been associated with RDF, RDF becomes even more interesting as a data model for large online systems such as social network platforms where frequent application changes lead to volatility of schema. Key value stores are currently found in such applications, but they generally do not provide the query flexibility at which RDF excels.
Conclusions
We have gone over basic aspects of the endlessly complex and variable topic of transactions, and drawn parallels as well as outlined two basic differences between relational and RDF systems: What used to be REPEATABLE READ becomes SERIALIZABLE; and row-level locking becomes locking at the level of a single attribute value. For the rest, we see that the optimistic and pessimistic modes of concurrency control, as well as guidelines for writing transaction procedures, remain much the same.
Based on this overview, it should be possible to design an ACID test for describing the ACID behavior of benchmarked systems. We do not intend to make transaction support a qualification requirement for an RDF benchmark, but information on transaction support will still be valuable in comparing different systems.
|
03/22/2011 19:55 GMT
|
Modified:
03/22/2011 18:24 GMT
|
Transaction Semantics in RDF and Relational Models
[
Virtuso Data Space Bot
]
As a part of defining benchmark audit for testing ACID properties on RDF stores, we will here examine different RDF scenarios where lack of concurrency control causes inconsistent results. In so doing, we consider common implementation techniques and implications as concern locking (pessimistic) and multi-version (optimistic) concurrency control schemes.
In the following, we will talk in terms of triples, but the discussion can be trivially generalized to quads. We will use numbers for IRIs and literals. In most implementations, the internal representation for these is indeed a number (or at least some data type that has a well defined collation order). For ease of presentation, we consider a single index with key parts SPO. Any other index-like setting with any possible key order will have similar issues.
Insert (Create) and Delete
INSERT and DELETE as defined in SPARQL are queries which generate a result set which is then used for instantiating triple patterns. We note that a DELETE may delete a triple which the DELETE has not read; thus the delete set is not a subset of the read set. The SQL equivalent is the
DELETE FROM table WHERE key IN
( SELECT key1 FROM other_table )
expression, supposing it were implemented as a scan of other_table and an index lookup followed by DELETE on table.
The meaning of INSERT is that the triples in question exist after the operation, and the meaning of DELETE is that said triples do not exist. In a transactional context, this means that the after-image of the transaction is guaranteed either to have or not-have said triples.
Suppose that the triples { 1 0 0 }, { 1 5 6 }, and { 1 5 7 } exist in the beginning. If we DELETE { 1 ?x ?y } and concurrently INSERT { 1 2 4 . 1 2 3 . 1 3 5 }, then whichever was considered to be first by the concurrency control of the DBMS would complete first, and the other after that. Thus the end state would either have no triples with subject 1 or would have the three just inserted.
Suppose the INSERT inserts the first triple, { 1 2 4 }. The DELETE at the same time reads all triples with subject 1. The exclusive read waits for the uncommitted INSERT. The INSERT then inserts the second triple, { 1 2 3 }. Depending on the isolation of the read, this either succeeds, since no { 1 2 3 } was read, or causes a deadlock. The first corresponds to REPEATABLE READ isolation; the second to SERIALIZABLE.
We would not get the desired end-state of either all the inserted triples or no triples with subject 1 if the read or the DELETE were not serializable.
Furthermore if a DELETE template produced a triple that did not exist in the pre-image, the DELETE semantics still imply that this also does not exist in the after-image, which implies serializability.
Read and Update
Let us consider the prototypical transaction example of transferring funds from one account to another. Two balances are updated, and a history record is inserted.
The initial state is
a balance 10
b balance 10
We transfer 1 from a to b, and at the same time transfer 2 from b to a. The end state must have a at 11 and b at 9.
A relational database needs REPEATABLE READ isolation for this.
With RDF, txn1 reads that a has a balance of 10. At the same time, txn1 reads the balance of a. txn2 waits because the read of txn1 is exclusive. txn1 proceeds and read the balance of b. It then updates the balance of a and b.
All goes without the deadlock which is always cited in this scenario, because the locks are acquired in the same order. The act of updating the balance of a, since RDF does not really have an update-in-place, consists of deleting { a balance 10 } and inserting { a balance 9 }. This gets done and txn1 commits. At this point, txn2 proceeds after its wait on the row that stated { a balance 10 }. This row is now gone, and txn2 sees that a has no balance, which is quite possible in RDF's schema-less model.
We see that REPEATABLE READ is not adequate with RDF, even though it is with relational. The reason why there is no UPDATE-in-place is that the PRIMARY KEY of the triple includes all the parts, including the object. Even in a RDBMS, an UPDATE of a primary key part amounts to a DELETE-plus-INSERT. One could here argue that an implementation might still UPDATE-in-place if the key order were not changed. This would resolve the special case of the accounts but not a more general case.
Thus we see that the read of the balance must be SERIALIZABLE. This means that the read locks the space before the first balance, so that no insertion may take place. In this way the read of txn2 waits on the lock that is conceptually before the first possible match of { a balance ?x }.
locking order and OLTP
To implement TPC-C, I would update the table with the highest cardinality first, and then all tables in descending order of cardinality. In this way, the locks with the highest likelihood for contention are held for the least time. If locking multiple rows of a table, these should be locked in a deterministic order, e.g., lowest key-value first. In this way, the workload would not deadlock. In actual fact, with clusters and parallel execution, the lock acquisition will not be guaranteed to be serial, so deadlocks do not entirely go away, but still may get fewer. Besides, any outside transaction might still lock in the wrong order and cause deadlocks, which is why the OLTP application must in any case be built to deal with the possibility of deadlock.
This is the conventional relational view of the matter. In more recent times, in-memory schemes with deterministic lock acquisition (Abadi VLDB 2010) or single-threaded atomic execution of transactions (Uni Munich BIRTE workshop at VLDB2010, VoltDB) have been proposed. There the transaction is described as a stored procedure, possibly with extra annotations. These techniques might apply to RDF also. RDF is however an unlikely model for transaction-intensive applications, so we will not for now examine these further.
RDBMS usually implement row-level locking. This means that once a column of a row has an uncommitted state, any other transaction is prevented from changing the row. This has no ready RDF equivalent. RDF is usually implemented as a row-per-triple system and applying row-level locking to this does not give the semantic one expects of a relational row.
I would argue that it is not essential to enforce transactional guarantees in units of rows. The guarantees must apply between data that is read and written by a transaction. It does not need to apply to columns that the transaction does not reference. To take the TPC-C example, the new order transaction updates the stock level and the delivery transaction updates the delivery count on the stock table. In practice, a delivery and a new order falling on the same row of stock will lock each other out, but nothing in the semantics of the workload mandates this.
It does not seem a priori necessary to recreate the row as a unit of concurrency control in RDF. One could say that a multi-attribute whole (such as an address) ought to be atomic for concurrency control, but then applications updating addresses will most likely read and update all the fields together even if only the street name changes.
Pessimistic Vs. Optimistic Concurrency Control
We have so far spoken only in terms of row-level locking, which is to my knowledge the most widely used model in RDBMS, and one we implement ourselves. Some databases (e.g., MonetDB and VectorWise) implement optimistic concurrency control. The general idea is that each transaction has a read and write set and when a transaction commits, any other transactions whose read or write set intersects with the write set of the committing transaction are marked un-committable. Once a transaction thus becomes un-committable, it may presumably continue reading indefinitely but may no longer commit its updates. Optimistic concurrency is generally coupled with multi-version semantics where the pre-image of a transaction is a clean committed state of the database as of a specific point in time, i.e., snapshot isolation.
To implement SERIALIZABLE isolation, i.e., the guarantee that if a transaction twice performs a COUNT the result will be the same, one locks also the row that precedes the set of selected rows and marks each lock so as to prevent an insert to the right of the lock in key order. The same thing may be done in an optimistic setting.
Positional Handling of Updates in Column Stores [Heman, Zukowski, CWI science library] discusses management of multiple consecutive snapshots in some detail. The paper does not go into the details of different levels of isolation but nothing there suggests that serializability could not be supported. There is some complexity in marking the space between ordered rows as non-insertable across multiple versions but this should be feasible enough.
The issue of optimistic Vs. pessimistic concurrency does not seem to be affected by the differences between RDF and relational models. We note that an OLTP workload can be made to run with very few transaction aborts (deadlocks) by properly ordering operations when using a locking scheme. The same does not work with optimistic concurrency since updates happen immediately and transaction aborts occur whenever the writes of one intersect the reads or writes of another, regardless of the order in which these were made.
Developers seldom understand transactions; therefore DBMS should, within the limits of the possible, optimize locking order for locking schemes. A simple example is locking in key order when doing an operation on a set of values. A more complex variant would consist of analyzing data dependencies in stored procedures and reordering updates so as to get the highest cardinality tables first. We note that this latter trick also benefits optimistic schemes.
In RDF, the same principles apply but distinguishing cardinality of an updated set will have to rely on statistics of predicate cardinality. Such are anyhow needed for query optimization.
Eventual Consistency
Web scale systems that need to maintain consistent state across multiple data centers sometimes use "eventual consistency" schemes. Two-phase-commit becomes very inefficient as latency increases, thus strict transactional semantics have prohibitive cost if the system is more distributed than a cluster with a fast interconnect.
Eventual consistency schemes (Amazon Dynamo, Yahoo! PNUTS) maintain history information on the record which is the unit of concurrency control. The record is typically a non-first normal form chunk of related data that it makes sense to store together from the application's viewpoint. Application logic can then be applied to reconciling differing copies of the same logical record.
Such a scheme seems a priori ill-suited for RDF, where the natural unit of concurrency control would seem to be the quad. We first note that only recently changed (i.e., DELETEd + INSERTed quads, as there is no UPDATE-in-place) need history information. This history information can be stored away from the quad itself, thus not disrupting compression. When detecting that one site has INSERTed a quad that another has DELETEd in the same general time period, application logic can still be applied for reading related quads in order to arrive at a decision on how to reconcile two databases that have diverged. The same can apply to conflicting values of properties that for the application should be single-valued. Comparing time-stamped transaction logs on quads is not fundamentally different from comparing record histories in Dynamo or PNUTS.
As we overcome the data size penalties that have until recently been associated with RDF, RDF becomes even more interesting as a data model for large online systems such as social network platforms where frequent application changes lead to volatility of schema. Key value stores are currently found in such applications, but they generally do not provide the query flexibility at which RDF excels.
Conclusions
We have gone over basic aspects of the endlessly complex and variable topic of transactions, and drawn parallels as well as outlined two basic differences between relational and RDF systems: What used to be REPEATABLE READ becomes SERIALIZABLE; and row-level locking becomes locking at the level of a single attribute value. For the rest, we see that the optimistic and pessimistic modes of concurrency control, as well as guidelines for writing transaction procedures, remain much the same.
Based on this overview, it should be possible to design an ACID test for describing the ACID behavior of benchmarked systems. We do not intend to make transaction support a qualification requirement for an RDF benchmark, but information on transaction support will still be valuable in comparing different systems.
|
03/22/2011 19:55 GMT
|
Modified:
03/22/2011 18:24 GMT
|
Benchmarks, Redux (part 1): On RDF Benchmarks
[
Orri Erling
]
This post introduces a series on RDF benchmarking. In these posts I will cover the following:
-
Correct misleading information about us in the recent Berlin report: The load rate is off-the wall and the update mix is missing. We supply the right numbers and explain how to load things so that one gets decent performance.
-
Discuss configuration options for Virtuoso.
-
Tell a story about multithreading and its perils and how vectoring and scale-out can save us.
-
Analyze the run time behavior of Virtuoso 6 Single, 6 Cluster, and 7 Single.
-
Look at the benefits of SSDs (solid-state storage devices) over HDDs (hard disk devices; spinning platters), and I/O matters in general.
-
Talk in general about modalities of benchmark running, and how to reconcile vendors doing what they know best with the air of legitimacy of a third party. Whether to do things a la TPC or a la TREC? We will hopefully try a bit of both, at least so I have proposed to our partners in LOD2, the EU FP7 that also funded the recent Berlin report.
-
Outline the desiderata for an RDF benchmark that is not just an RDF-ized relational workload, the Social Intelligence Benchmark.
-
Talk about BSBM in specific. What does it measure?
-
Discuss some experiments with the BI use case of BSBM.
-
Document how the results mentioned here were obtained and suggest practices for benchmark running and disclosure.
The background is that the LOD2 FP7 project is supposed to deliver a report about the state of the art and benchmark laboratory by March 1. The Berlin report is a part thereof. In the project proposal we talk about an ongoing benchmarking activity and about having up-to-date installations of the relevant RDF stores and RDBMS.
Since this is taxpayer money for supposedly the common good, I see no reason why such a useful thing should be restricted to the project participants. On the other hand, running a display window of stuff for benchmarking, when in at least in some cases licenses prohibit unauthorized publishing of benchmark results might be seen to conflict with the spirit of the license if not its letter. We will see.
For now, my take is that we want to run benchmarks of all interesting software, inviting the vendors to tell us how to do that if they will, and maybe even letting them perform those runs themselves. Then we promise not to disclose results without the vendor's permission. Access to the installations is limited to whoever operates the equipment. Configuration files and detailed hardware specs and such on the other hand will be made public. If a run is published, it will be with permission and in a format that includes full information for replicating the experiment.
In the LOD2 proposal we also in so many words say that we will stretch the limits of the state of the art. This stretching is surely not limited to the project's own products but should also include the general benchmarking aspect. I will say with confidence that running single server benchmarks at a max 200 Mtriples of data is not stretching anything.
So to ameliorate this situation, I thought to run the same at 10x the scale on a couple of large boxes we have access to. 1 and 2 billion triples are still comfortably single server scales. Then we could go for example to Giovanni's cluster at DERI and do 10 and 20 billion triples, this should fly reasonably on 8 or 16 nodes of the DERI gear. Or we might talk to SEALS who by now should have their own lab. Even Amazon EC2 might be an option, although not the preferred one.
So I asked everybody about config instructions, which produced a certain amount of dismay as I might be said to be biased and to be skirting the edges of conflict of interest. The inquiry was not altogether negative though since Ontotext and Garlik provided some information. We will look into these this and next week. We will not publish any information without asking first.
In this series of posts I will only talk about OpenLink Software.
Benchmarks, Redux Series
- Benchmarks, Redux (part 1): On RDF Benchmarks (this post)
-
Benchmarks, Redux (part 2): A Benchmarking Story
-
Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
-
Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
-
Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
-
Benchmarks, Redux (part 6): BSBM and I/O, continued
-
Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
-
Benchmarks, Redux (part 8): BSBM Explore and Update
-
Benchmarks, Redux (part 9): BSBM With Cluster
-
Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
-
Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks
-
Benchmarks, Redux (part 12): Our Own BSBM Results Report
-
Benchmarks, Redux (part 13): BSBM BI Modifications
-
Benchmarks, Redux (part 14): BSBM BI Mix
-
Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
|
02/28/2011 15:20 GMT
|
Modified:
03/14/2011 17:15 GMT
|
Benchmarks, Redux (part 1): On RDF Benchmarks
[
Virtuso Data Space Bot
]
This post introduces a series on RDF benchmarking. In these posts I will cover the following:
-
Correct misleading information about us in the recent Berlin report: The load rate is off-the wall and the update mix is missing. We supply the right numbers and explain how to load things so that one gets decent performance.
-
Discuss configuration options for Virtuoso.
-
Tell a story about multithreading and its perils and how vectoring and scale-out can save us.
-
Analyze the run time behavior of Virtuoso 6 Single, 6 Cluster, and 7 Single.
-
Look at the benefits of SSDs (solid-state storage devices) over HDDs (hard disk devices; spinning platters), and I/O matters in general.
-
Talk in general about modalities of benchmark running, and how to reconcile vendors doing what they know best with the air of legitimacy of a third party. Whether to do things a la TPC or a la TREC? We will hopefully try a bit of both, at least so I have proposed to our partners in LOD2, the EU FP7 that also funded the recent Berlin report.
-
Outline the desiderata for an RDF benchmark that is not just an RDF-ized relational workload, the Social Intelligence Benchmark.
-
Talk about BSBM in specific. What does it measure?
-
Discuss some experiments with the BI use case of BSBM.
-
Document how the results mentioned here were obtained and suggest practices for benchmark running and disclosure.
The background is that the LOD2 FP7 project is supposed to deliver a report about the state of the art and benchmark laboratory by March 1. The Berlin report is a part thereof. In the project proposal we talk about an ongoing benchmarking activity and about having up-to-date installations of the relevant RDF stores and RDBMS.
Since this is taxpayer money for supposedly the common good, I see no reason why such a useful thing should be restricted to the project participants. On the other hand, running a display window of stuff for benchmarking, when in at least in some cases licenses prohibit unauthorized publishing of benchmark results might be seen to conflict with the spirit of the license if not its letter. We will see.
For now, my take is that we want to run benchmarks of all interesting software, inviting the vendors to tell us how to do that if they will, and maybe even letting them perform those runs themselves. Then we promise not to disclose results without the vendor's permission. Access to the installations is limited to whoever operates the equipment. Configuration files and detailed hardware specs and such on the other hand will be made public. If a run is published, it will be with permission and in a format that includes full information for replicating the experiment.
In the LOD2 proposal we also in so many words say that we will stretch the limits of the state of the art. This stretching is surely not limited to the project's own products but should also include the general benchmarking aspect. I will say with confidence that running single server benchmarks at a max 200 Mtriples of data is not stretching anything.
So to ameliorate this situation, I thought to run the same at 10x the scale on a couple of large boxes we have access to. 1 and 2 billion triples are still comfortably single server scales. Then we could go for example to Giovanni's cluster at DERI and do 10 and 20 billion triples, this should fly reasonably on 8 or 16 nodes of the DERI gear. Or we might talk to SEALS who by now should have their own lab. Even Amazon EC2 might be an option, although not the preferred one.
So I asked everybody about config instructions, which produced a certain amount of dismay as I might be said to be biased and to be skirting the edges of conflict of interest. The inquiry was not altogether negative though since Ontotext and Garlik provided some information. We will look into these this and next week. We will not publish any information without asking first.
In this series of posts I will only talk about OpenLink Software.
Benchmarks, Redux Series
- Benchmarks, Redux (part 1): On RDF Benchmarks (this post)
-
Benchmarks, Redux (part 2): A Benchmarking Story
-
Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
-
Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
-
Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
-
Benchmarks, Redux (part 6): BSBM and I/O, continued
-
Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
-
Benchmarks, Redux (part 8): BSBM Explore and Update
-
Benchmarks, Redux (part 9): BSBM With Cluster
-
Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
-
Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks
-
Benchmarks, Redux (part 12): Our Own BSBM Results Report
-
Benchmarks, Redux (part 13): BSBM BI Modifications
-
Benchmarks, Redux (part 14): BSBM BI Mix
-
Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
|
02/28/2011 15:20 GMT
|
Modified:
03/14/2011 17:16 GMT
|
Benchmarks, Redux (part 1): On RDF Benchmarks
[
Orri Erling
]
This post introduces a series on RDF benchmarking. In these posts I will cover the following:
-
Correct misleading information about us in the recent Berlin report: The load rate is off-the wall and the update mix is missing. We supply the right numbers and explain how to load things so that one gets decent performance.
-
Discuss configuration options for Virtuoso.
-
Tell a story about multithreading and its perils and how vectoring and scale-out can save us.
-
Analyze the run time behavior of Virtuoso 6 Single, 6 Cluster, and 7 Single.
-
Look at the benefits of SSDs (solid-state storage devices) over HDDs (hard disk devices; spinning platters), and I/O matters in general.
-
Talk in general about modalities of benchmark running, and how to reconcile vendors doing what they know best with the air of legitimacy of a third party. Whether to do things a la TPC or a la TREC? We will hopefully try a bit of both, at least so I have proposed to our partners in LOD2, the EU FP7 that also funded the recent Berlin report.
-
Outline the desiderata for an RDF benchmark that is not just an RDF-ized relational workload, the Social Intelligence Benchmark.
-
Talk about BSBM in specific. What does it measure?
-
Discuss some experiments with the BI use case of BSBM.
-
Document how the results mentioned here were obtained and suggest practices for benchmark running and disclosure.
The background is that the LOD2 FP7 project is supposed to deliver a report about the state of the art and benchmark laboratory by March 1. The Berlin report is a part thereof. In the project proposal we talk about an ongoing benchmarking activity and about having up-to-date installations of the relevant RDF stores and RDBMS.
Since this is taxpayer money for supposedly the common good, I see no reason why such a useful thing should be restricted to the project participants. On the other hand, running a display window of stuff for benchmarking, when in at least in some cases licenses prohibit unauthorized publishing of benchmark results might be seen to conflict with the spirit of the license if not its letter. We will see.
For now, my take is that we want to run benchmarks of all interesting software, inviting the vendors to tell us how to do that if they will, and maybe even letting them perform those runs themselves. Then we promise not to disclose results without the vendor's permission. Access to the installations is limited to whoever operates the equipment. Configuration files and detailed hardware specs and such on the other hand will be made public. If a run is published, it will be with permission and in a format that includes full information for replicating the experiment.
In the LOD2 proposal we also in so many words say that we will stretch the limits of the state of the art. This stretching is surely not limited to the project's own products but should also include the general benchmarking aspect. I will say with confidence that running single server benchmarks at a max 200 Mtriples of data is not stretching anything.
So to ameliorate this situation, I thought to run the same at 10x the scale on a couple of large boxes we have access to. 1 and 2 billion triples are still comfortably single server scales. Then we could go for example to Giovanni's cluster at DERI and do 10 and 20 billion triples, this should fly reasonably on 8 or 16 nodes of the DERI gear. Or we might talk to SEALS who by now should have their own lab. Even Amazon EC2 might be an option, although not the preferred one.
So I asked everybody about config instructions, which produced a certain amount of dismay as I might be said to be biased and to be skirting the edges of conflict of interest. The inquiry was not altogether negative though since Ontotext and Garlik provided some information. We will look into these this and next week. We will not publish any information without asking first.
In this series of posts I will only talk about OpenLink Software.
Benchmarks, Redux Series
- Benchmarks, Redux (part 1): On RDF Benchmarks (this post)
-
Benchmarks, Redux (part 2): A Benchmarking Story
-
Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
-
Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
-
Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
-
Benchmarks, Redux (part 6): BSBM and I/O, continued
-
Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
-
Benchmarks, Redux (part 8): BSBM Explore and Update
-
Benchmarks, Redux (part 9): BSBM With Cluster
-
Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
-
Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks
-
Benchmarks, Redux (part 12): Our Own BSBM Results Report
-
Benchmarks, Redux (part 13): BSBM BI Modifications
-
Benchmarks, Redux (part 14): BSBM BI Mix
-
Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
|
02/28/2011 15:20 GMT
|
Modified:
03/14/2011 17:15 GMT
|
Benchmarks, Redux (part 1): On RDF Benchmarks
[
Virtuso Data Space Bot
]
This post introduces a series on RDF benchmarking. In these posts I will cover the following:
-
Correct misleading information about us in the recent Berlin report: The load rate is off-the wall and the update mix is missing. We supply the right numbers and explain how to load things so that one gets decent performance.
-
Discuss configuration options for Virtuoso.
-
Tell a story about multithreading and its perils and how vectoring and scale-out can save us.
-
Analyze the run time behavior of Virtuoso 6 Single, 6 Cluster, and 7 Single.
-
Look at the benefits of SSDs (solid-state storage devices) over HDDs (hard disk devices; spinning platters), and I/O matters in general.
-
Talk in general about modalities of benchmark running, and how to reconcile vendors doing what they know best with the air of legitimacy of a third party. Whether to do things a la TPC or a la TREC? We will hopefully try a bit of both, at least so I have proposed to our partners in LOD2, the EU FP7 that also funded the recent Berlin report.
-
Outline the desiderata for an RDF benchmark that is not just an RDF-ized relational workload, the Social Intelligence Benchmark.
-
Talk about BSBM in specific. What does it measure?
-
Discuss some experiments with the BI use case of BSBM.
-
Document how the results mentioned here were obtained and suggest practices for benchmark running and disclosure.
The background is that the LOD2 FP7 project is supposed to deliver a report about the state of the art and benchmark laboratory by March 1. The Berlin report is a part thereof. In the project proposal we talk about an ongoing benchmarking activity and about having up-to-date installations of the relevant RDF stores and RDBMS.
Since this is taxpayer money for supposedly the common good, I see no reason why such a useful thing should be restricted to the project participants. On the other hand, running a display window of stuff for benchmarking, when in at least in some cases licenses prohibit unauthorized publishing of benchmark results might be seen to conflict with the spirit of the license if not its letter. We will see.
For now, my take is that we want to run benchmarks of all interesting software, inviting the vendors to tell us how to do that if they will, and maybe even letting them perform those runs themselves. Then we promise not to disclose results without the vendor's permission. Access to the installations is limited to whoever operates the equipment. Configuration files and detailed hardware specs and such on the other hand will be made public. If a run is published, it will be with permission and in a format that includes full information for replicating the experiment.
In the LOD2 proposal we also in so many words say that we will stretch the limits of the state of the art. This stretching is surely not limited to the project's own products but should also include the general benchmarking aspect. I will say with confidence that running single server benchmarks at a max 200 Mtriples of data is not stretching anything.
So to ameliorate this situation, I thought to run the same at 10x the scale on a couple of large boxes we have access to. 1 and 2 billion triples are still comfortably single server scales. Then we could go for example to Giovanni's cluster at DERI and do 10 and 20 billion triples, this should fly reasonably on 8 or 16 nodes of the DERI gear. Or we might talk to SEALS who by now should have their own lab. Even Amazon EC2 might be an option, although not the preferred one.
So I asked everybody about config instructions, which produced a certain amount of dismay as I might be said to be biased and to be skirting the edges of conflict of interest. The inquiry was not altogether negative though since Ontotext and Garlik provided some information. We will look into these this and next week. We will not publish any information without asking first.
In this series of posts I will only talk about OpenLink Software.
Benchmarks, Redux Series
- Benchmarks, Redux (part 1): On RDF Benchmarks (this post)
-
Benchmarks, Redux (part 2): A Benchmarking Story
-
Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
-
Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
-
Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
-
Benchmarks, Redux (part 6): BSBM and I/O, continued
-
Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
-
Benchmarks, Redux (part 8): BSBM Explore and Update
-
Benchmarks, Redux (part 9): BSBM With Cluster
-
Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
-
Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks
-
Benchmarks, Redux (part 12): Our Own BSBM Results Report
-
Benchmarks, Redux (part 13): BSBM BI Modifications
-
Benchmarks, Redux (part 14): BSBM BI Mix
-
Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
|
02/28/2011 15:20 GMT
|
Modified:
03/14/2011 17:16 GMT
|
SPARQL Guide for the Perl Developer
[
Kingsley Uyi Idehen
]
What?
A simple guide usable by any Perl developer seeking to exploit SPARQL without hassles.
Why?
SPARQL is a powerful query language, results serialization format, and an HTTP based data access protocol from the W3C. It provides a mechanism for accessing and integrating data across Deductive Database Systems (colloquially referred to as triple or quad stores in Semantic Web and Linked Data circles) -- database systems (or data spaces) that manage proposition oriented records in 3-tuple (triples) or 4-tuple (quads) form.
How?
SPARQL queries are actually HTTP payloads (typically). Thus, using a RESTful client-server interaction pattern, you can dispatch calls to a SPARQL compliant data server and receive a payload for local processing.
Steps:
- Determine which SPARQL endpoint you want to access e.g. DBpedia or a local Virtuoso instance (typically: http://localhost:8890/sparql).
- If using Virtuoso, and you want to populate its quad store using SPARQL, assign "SPARQL_SPONGE" privileges to user "SPARQL" (this is basic control, more sophisticated WebID based ACLs are available for controlling SPARQL access).
Script:
#
# Demonstrating use of a single query to populate a
# Virtuoso Quad Store via Perl.
#
#
# HTTP URL is constructed accordingly with CSV query results format as the default via mime type.
#
use CGI qw/:standard/;
use LWP::UserAgent;
use Data::Dumper;
use Text::CSV_XS;
sub sparqlQuery(@args) {
my $query=shift;
my $baseURL=shift;
my $format=shift;
%params=(
"default-graph" => "", "should-sponge" => "soft", "query" => $query,
"debug" => "on", "timeout" => "", "format" => $format,
"save" => "display", "fname" => ""
);
@fragments=();
foreach $k (keys %params) {
$fragment="$k=".CGI::escape($params{$k});
push(@fragments,$fragment);
}
$query=join("&", @fragments);
$sparqlURL="${baseURL}?$query";
my $ua = LWP::UserAgent->new;
$ua->agent("MyApp/0.1 ");
my $req = HTTP::Request->new(GET => $sparqlURL);
my $res = $ua->request($req);
$str=$res->content;
$csv = Text::CSV_XS->new();
foreach $line ( split(/^/, $str) ) {
$csv->parse($line);
@bits=$csv->fields();
push(@rows, [ @bits ] );
}
return \@rows;
}
# Setting Data Source Name (DSN)
$dsn="http://dbpedia.org/resource/DBpedia";
# Virtuoso pragmas for instructing SPARQL engine to perform an HTTP GET using the IRI in
# FROM clause as Data Source URL en route to DBMS
# record Inserts.
$query="DEFINE get:soft \"replace\"\n
# Generic (non Virtuoso specific SPARQL
# Note: this will not add records to the
# DBMS
SELECT DISTINCT * FROM <$dsn> WHERE {?s ?p ?o}";
$data=sparqlQuery($query, "http://localhost:8890/sparql/", "text/csv");
print "Retrieved data:\n";
print Dumper($data);
Output
Retrieved data:
$VAR1 = [
[
's',
'p',
'o'
],
[
'http://dbpedia.org/resource/DBpedia',
'http://www.w3.org/1999/02/22-rdf-syntax-ns#type',
'http://www.w3.org/2002/07/owl#Thing'
],
[
'http://dbpedia.org/resource/DBpedia',
'http://www.w3.org/1999/02/22-rdf-syntax-ns#type',
'http://dbpedia.org/ontology/Work'
],
[
'http://dbpedia.org/resource/DBpedia',
'http://www.w3.org/1999/02/22-rdf-syntax-ns#type',
'http://dbpedia.org/class/yago/Software106566077'
],
...
Conclusion
CSV was chosen over XML (re. output format) since this is about a "no-brainer installation and utilization" guide for a Perl developer that already knows how to use Perl for HTTP based data access within HTML. SPARQL just provides an added bonus to URL dexterity (delivered via URI abstraction) with regards to constructing Data Source Names or Addresses.
Related
|
01/25/2011 11:05 GMT
|
Modified:
01/26/2011 18:11 GMT
|
SPARQL Guide for the Javascript Developer
[
Kingsley Uyi Idehen
]
What?
A simple guide usable by any Javascript developer seeking to exploit SPARQL without hassles.
Why?
SPARQL is a powerful query language, results serialization format, and an HTTP based data access protocol from the W3C. It provides a mechanism for accessing and integrating data across Deductive Database Systems (colloquially referred to as triple or quad stores in Semantic Web and Linked Data circles) -- database systems (or data spaces) that manage proposition oriented records in 3-tuple (triples) or 4-tuple (quads) form.
How?
SPARQL queries are actually HTTP payloads (typically). Thus, using a RESTful client-server interaction pattern, you can dispatch calls to a SPARQL compliant data server and receive a payload for local processing.
Steps:
- Determine which SPARQL endpoint you want to access e.g. DBpedia or a local Virtuoso instance (typically: http://localhost:8890/sparql).
- If using Virtuoso, and you want to populate its quad store using SPARQL, assign "SPARQL_SPONGE" privileges to user "SPARQL" (this is basic control, more sophisticated WebID based ACLs are available for controlling SPARQL access).
Script:
/*
Demonstrating use of a single query to populate a # Virtuoso Quad Store via Javascript.
*/
/*
HTTP URL is constructed accordingly with JSON query results format as the default via mime type.
*/
function sparqlQuery(query, baseURL, format) {
if(!format)
format="application/json";
var params={
"default-graph": "", "should-sponge": "soft", "query": query,
"debug": "on", "timeout": "", "format": format,
"save": "display", "fname": ""
};
var querypart="";
for(var k in params) {
querypart+=k+"="+encodeURIComponent(params[k])+"&";
}
var queryURL=baseURL + '?' + querypart;
if (window.XMLHttpRequest) {
xmlhttp=new XMLHttpRequest();
}
else {
xmlhttp=new ActiveXObject("Microsoft.XMLHTTP");
}
xmlhttp.open("GET",queryURL,false);
xmlhttp.send();
return JSON.parse(xmlhttp.responseText);
}
/*
setting Data Source Name (DSN)
*/
var dsn="http://dbpedia.org/resource/DBpedia";
/*
Virtuoso pragma "DEFINE get:soft "replace" instructs Virtuoso SPARQL engine to perform an HTTP GET using the IRI in FROM clause as Data Source URL with regards to
DBMS record inserts
*/
var query="DEFINE get:soft \"replace\"\nSELECT DISTINCT * FROM <"+dsn+"> WHERE {?s ?p ?o}";
var data=sparqlQuery(query, "/sparql/");
Output
Place the snippet above into the <script/> section of an HTML document to see the query result.
Conclusion
JSON was chosen over XML (re. output format) since this is about a "no-brainer installation and utilization" guide for a Javascript developer that already knows how to use Javascript for HTTP based data access within HTML. SPARQL just provides an added bonus to URL dexterity (delivered via URI abstraction) with regards to constructing Data Source Names or Addresses.
Related
|
01/21/2011 14:59 GMT
|
Modified:
01/26/2011 18:10 GMT
|
SPARQL Guide for the Javascript Developer
[
Kingsley Uyi Idehen
]
What?
A simple guide usable by any Javascript developer seeking to exploit SPARQL without hassles.
Why?
SPARQL is a powerful query language, results serialization format, and an HTTP based data access protocol from the W3C. It provides a mechanism for accessing and integrating data across Deductive Database Systems (colloquially referred to as triple or quad stores in Semantic Web and Linked Data circles) -- database systems (or data spaces) that manage proposition oriented records in 3-tuple (triples) or 4-tuple (quads) form.
How?
SPARQL queries are actually HTTP payloads (typically). Thus, using a RESTful client-server interaction pattern, you can dispatch calls to a SPARQL compliant data server and receive a payload for local processing.
Steps:
- Determine which SPARQL endpoint you want to access e.g. DBpedia or a local Virtuoso instance (typically: http://localhost:8890/sparql).
- If using Virtuoso, and you want to populate its quad store using SPARQL, assign "SPARQL_SPONGE" privileges to user "SPARQL" (this is basic control, more sophisticated WebID based ACLs are available for controlling SPARQL access).
Script:
/*
Demonstrating use of a single query to populate a # Virtuoso Quad Store via Javascript.
*/
/*
HTTP URL is constructed accordingly with JSON query results format as the default via mime type.
*/
function sparqlQuery(query, baseURL, format) {
if(!format)
format="application/json";
var params={
"default-graph": "", "should-sponge": "soft", "query": query,
"debug": "on", "timeout": "", "format": format,
"save": "display", "fname": ""
};
var querypart="";
for(var k in params) {
querypart+=k+"="+encodeURIComponent(params[k])+"&";
}
var queryURL=baseURL + '?' + querypart;
if (window.XMLHttpRequest) {
xmlhttp=new XMLHttpRequest();
}
else {
xmlhttp=new ActiveXObject("Microsoft.XMLHTTP");
}
xmlhttp.open("GET",queryURL,false);
xmlhttp.send();
return JSON.parse(xmlhttp.responseText);
}
/*
setting Data Source Name (DSN)
*/
var dsn="http://dbpedia.org/resource/DBpedia";
/*
Virtuoso pragma "DEFINE get:soft "replace" instructs Virtuoso SPARQL engine to perform an HTTP GET using the IRI in FROM clause as Data Source URL with regards to
DBMS record inserts
*/
var query="DEFINE get:soft \"replace\"\nSELECT DISTINCT * FROM <"+dsn+"> WHERE {?s ?p ?o}";
var data=sparqlQuery(query, "/sparql/");
Output
Place the snippet above into the <script/> section of an HTML document to see the query result.
Conclusion
JSON was chosen over XML (re. output format) since this is about a "no-brainer installation and utilization" guide for a Javascript developer that already knows how to use Javascript for HTTP based data access within HTML. SPARQL just provides an added bonus to URL dexterity (delivered via URI abstraction) with regards to constructing Data Source Names or Addresses.
Related
|
01/21/2011 14:59 GMT
|
Modified:
01/26/2011 18:10 GMT
|
SPARQL Guide for the PHP Developer
[
Kingsley Uyi Idehen
]
What?
A simple guide usable by any PHP developer seeking to exploit SPARQL without hassles.
Why?
SPARQL is a powerful query language, results serialization format, and an HTTP based data access protocol from the W3C. It provides a mechanism for accessing and integrating data across Deductive Database Systems (colloquially referred to as triple or quad stores in Semantic Web and Linked Data circles) -- database systems (or data spaces) that manage proposition oriented records in 3-tuple (triples) or 4-tuple (quads) form.
How?
SPARQL queries are actually HTTP payloads (typically). Thus, using a RESTful client-server interaction pattern, you can dispatch calls to a SPARQL compliant data server and receive a payload for local processing e.g. local object binding re. PHP.
Steps:
-
From your command line execute: aptitude search '^PHP26', to verify PHP is in place
- Determine which SPARQL endpoint you want to access e.g. DBpedia or a local Virtuoso instance (typically: http://localhost:8890/sparql).
- If using Virtuoso, and you want to populate its quad store using SPARQL, assign "SPARQL_SPONGE" privileges to user "SPARQL" (this is basic control, more sophisticated WebID based ACLs are available for controlling SPARQL access).
Script:
#!/usr/bin/env php
<?php
#
# Demonstrating use of a single query to populate a # Virtuoso Quad Store via PHP.
#
# HTTP URL is constructed accordingly with JSON query results format in mind.
function sparqlQuery($query, $baseURL, $format="application/json")
{
$params=array(
"default-graph" => "",
"should-sponge" => "soft",
"query" => $query,
"debug" => "on",
"timeout" => "",
"format" => $format,
"save" => "display",
"fname" => ""
);
$querypart="?";
foreach($params as $name => $value)
{
$querypart=$querypart . $name . '=' . urlencode($value) . "&";
}
$sparqlURL=$baseURL . $querypart;
return json_decode(file_get_contents($sparqlURL));
};
# Setting Data Source Name (DSN)
$dsn="http://dbpedia.org/resource/DBpedia";
#Virtuoso pragmas for instructing SPARQL engine to perform an HTTP GET
#using the IRI in FROM clause as Data Source URL
$query="DEFINE get:soft \"replace\"
SELECT DISTINCT * FROM <$dsn> WHERE {?s ?p ?o}";
$data=sparqlQuery($query, "http://localhost:8890/sparql/");
print "Retrieved data:\n" . json_encode($data);
?>
Output
Retrieved data:
{"head":
{"link":[],"vars":["s","p","o"]},
"results":
{"distinct":false,"ordered":true,
"bindings":[
{"s":
{"type":"uri","value":"http:\/\/dbpedia.org\/resource\/DBpedia"},"p":
{"type":"uri","value":"http:\/\/www.w3.org\/1999\/02\/22-rdf-syntax-ns#type"},"o":
{"type":"uri","value":"http:\/\/www.w3.org\/2002\/07\/owl#Thing"}},
{"s":
{"type":"uri","value":"http:\/\/dbpedia.org\/resource\/DBpedia"},"p":
{"type":"uri","value":"http:\/\/www.w3.org\/1999\/02\/22-rdf-syntax-ns#type"},"o":
{"type":"uri","value":"http:\/\/dbpedia.org\/ontology\/Work"}},
{"s":
{"type":"uri","value":"http:\/\/dbpedia.org\/resource\/DBpedia"},"p":
{"type":"uri","value":"http:\/\/www.w3.org\/1999\/02\/22-rdf-syntax-ns#type"},"o":
{"type":"uri","value":"http:\/\/dbpedia.org\/class\/yago\/Software106566077"}},
...
Conclusion
JSON was chosen over XML (re. output format) since this is about a "no-brainer installation and utilization" guide for a PHP developer that already knows how to use PHP for HTTP based data access. SPARQL just provides an added bonus to URL dexterity (delivered via URI abstraction) with regards to constructing Data Source Names or Addresses.
Related
|
01/20/2011 16:25 GMT
|
Modified:
01/25/2011 10:36 GMT
|
|
|