Details
OpenLink Software
Burlington, United States
Subscribe
Post Categories
Recent Articles
Community Member Blogs
Display Settings
Translate
|
Showing posts in all categories Refresh
RDF and Transactions
[
Orri Erling
]
I will here talk about RDF and transactions for developers in general. The next one talks about specifics and is for specialists.
Transactions are certainly not the first thing that comes to mind when one hears "RDF". We have at times used a recruitment questionnaire where we ask applicants to define a transaction. Many vaguely remember that it is a unit of work, but usually not more than that. We sometimes get questions from users about why they get an error message that says "deadlock". "Deadlock" is what happens when multiple users concurrently update balances on multiple bank accounts in the wrong order. What does this have to do with RDF?
There are in fact users who even use XA with a Virtuoso-based RDF application. Franz also has publicized their development of full ACID capabilities for AllegroGraph. RDF is a database schema model, and transactions will inevitably become an issue in databases.
At the same time, the developer population trained with MySQL and PHP is not particularly transaction-aware. Transactions have gone out of style, declares the No-SQL crowd. Well, it is not so much SQL they object to but ACID, i.e., transactional guarantees. We will talk more about this in the next post. The SPARQL language and protocol do not go into transactions, except for expressing the wish that an UPDATE request to an end-point be atomic. But beware -- atomicity is a gateway drug, and soon one finds oneself on full ACID.
If one says that a thing will either happen in its entirety or not at all, which is what (A) atomicity means, then the question arises of (I) isolation; that is, what happens if somebody else does something to the same data at the same time? Then comes the question of whether a thing, once having happened, will stay that way; i.e., (D) durability. Finally, there is (C) consistency, which means that the transaction's result must not contradict restrictions the database is supposed to enforce. RDF usually has no restrictions; thus consistency mostly means that the internal state of the DBMS must be consistent, e.g., different indices on triples/quads should contain the same data.
There are, of course, database-like consistency criteria that one can express in RDF Schema and OWL, concerning data types, mandatory presence of properties, or restrictions on cardinality (i.e., one may only have one spouse at a time, and the like).
If one indeed did enforce them all, then RDF would be very like the relational model -- with all the restrictions, but without the 40 years of work on RDBMS performance. For this reason, RDF use tends to involve data that is not structured enough to be a good fit for RDBMS.
There is of course the OWL side, where consistency is important but is defined in such complex ways that they again are not a good fit for RDBMS. RDF could be seen to be split between the schema-last world and the knowledge representation world. I will here focus on the schema-last side.
Transactions are relevant in RDF in two cases: 1. If data is trickle loaded in small chunks, one likes to know that the chunks do not get lost or corrupted; 2. If the application has any semantics that reserve resources, then these operations need transactions. The latter is not so common with RDF but examples include read-write situations, like checking if a seat is available and then reserving it. Transactionality guarantees that the same seat does not get reserved twice.
Web people argue with some justification that since the four cardinal virtues of database never existed on the web to begin with, applying strict ACID to web data is beside the point, like locking the stable after the horse has long since run away. This may be so; yet the systems used for processing data, whether that data is dirty or not, benefit from predictable operation under concurrency and from not losing data.
Analytics workloads are not primarily about transactions, but still need to specify what happens with updates. Analyzing data from measurements may not have concurrent updates, but there the transaction issue is replaced by the question of making explicit how the data was acquired and what processing has been applied to it before storage.
As mentioned before, the LOD2 project is at the crossroads of RDF and database. I construe its mission to be the making of RDF into a respectable database discipline. Database respectability in turn is as good as inconceivable without addressing the very bedrock on which this science was founded: transactions.
As previously argued, we need well-defined and auditable benchmarks. This again brings up the topic of transactions. Once we embark on the database benchmark route, there is no way around this. TPC-H mandates that the system under test support transactions, and the audit involves a test for this. We can do no less.
This has led me to more closely examine the issue of RDF and transactions, and whether there exist differences between transactions applied to RDF and to relational data.
As concerns Virtuoso, our position has been that one can get full ACID in Virtuoso, whether in SQL or SPARQL, by using a connected client (e.g., ODBC, JDBC, or the Jena or Sesame frameworks), and setting the isolation options on the connection. Having taken this step, one then must take the next step, which consists of dealing with deadlocks; i.e., with concurrent utilization, it may happen that the database at any time notifies the client that the transaction got aborted and the client must retry.
Web developers especially do not like this, because this is not what MySQL has taught them to expect. MySQL does have transactional back-ends like InnoDB, but often gets used without transactions.
With the March 2011 Virtuoso releases, we have taken a closer look at transactions with RDF. It is more practical to reduce the possibility of errors than to require developers to pay attention. For this reason we have automated isolation settings for RDF, greatly reduced the incidence of deadlocks, and even incorporated automatic deadlock retries where applicable.
If all users lock resources they need in the same order, there will be no deadlocks. This is what we do with RDF load in Virtuoso 7; thus any mix of concurrent INSERTs and DELETEs, if these are under a certain size (normally 10000 quads) are guaranteed never to fail due to locking. These could still fail due to running out of space, though. With previous versions, there always was a possibility of having an INSERT or DELETE fail because of deadlock with multiple users. Vectored INSERT and DELETE are sufficient for making web crawling or archive maintenance practically deadlock free, since there the primary transaction is the INSERT or DELETE of a small graph.
Furthermore, since the SPARQL protocol has no way of specifying transactions consisting of multiple client-server exchanges, the SPARQL end-point may deal with deadlocks by itself. If all else fails, it can simply execute requests one after the other, thus eliminating any possibility of locking. We note that many statements will be intrinsically free of deadlocks by virtue of always locking in key order, but this cannot be universally guaranteed with arbitrary size operations; thus concurrent operations might still sometimes deadlock. Anyway, vectored execution as introduced in Virtuoso 7, besides getting easily double-speed random access, also greatly reduces deadlocks by virtue of ordering operations.
In the next post we will talk about what transactions mean with RDF and whether there is any difference with the relational model.
|
03/22/2011 18:52 GMT
|
Modified:
03/22/2011 17:44 GMT
|
RDF and Transactions
[
Virtuso Data Space Bot
]
I will here talk about RDF and transactions for developers in general. The next one talks about specifics and is for specialists.
Transactions are certainly not the first thing that comes to mind when one hears "RDF". We have at times used a recruitment questionnaire where we ask applicants to define a transaction. Many vaguely remember that it is a unit of work, but usually not more than that. We sometimes get questions from users about why they get an error message that says "deadlock". "Deadlock" is what happens when multiple users concurrently update balances on multiple bank accounts in the wrong order. What does this have to do with RDF?
There are in fact users who even use XA with a Virtuoso-based RDF application. Franz also has publicized their development of full ACID capabilities for AllegroGraph. RDF is a database schema model, and transactions will inevitably become an issue in databases.
At the same time, the developer population trained with MySQL and PHP is not particularly transaction-aware. Transactions have gone out of style, declares the No-SQL crowd. Well, it is not so much SQL they object to but ACID, i.e., transactional guarantees. We will talk more about this in the next post. The SPARQL language and protocol do not go into transactions, except for expressing the wish that an UPDATE request to an end-point be atomic. But beware -- atomicity is a gateway drug, and soon one finds oneself on full ACID.
If one says that a thing will either happen in its entirety or not at all, which is what (A) atomicity means, then the question arises of (I) isolation; that is, what happens if somebody else does something to the same data at the same time? Then comes the question of whether a thing, once having happened, will stay that way; i.e., (D) durability. Finally, there is (C) consistency, which means that the transaction's result must not contradict restrictions the database is supposed to enforce. RDF usually has no restrictions; thus consistency mostly means that the internal state of the DBMS must be consistent, e.g., different indices on triples/quads should contain the same data.
There are, of course, database-like consistency criteria that one can express in RDF Schema and OWL, concerning data types, mandatory presence of properties, or restrictions on cardinality (i.e., one may only have one spouse at a time, and the like).
If one indeed did enforce them all, then RDF would be very like the relational model -- with all the restrictions, but without the 40 years of work on RDBMS performance. For this reason, RDF use tends to involve data that is not structured enough to be a good fit for RDBMS.
There is of course the OWL side, where consistency is important but is defined in such complex ways that they again are not a good fit for RDBMS. RDF could be seen to be split between the schema-last world and the knowledge representation world. I will here focus on the schema-last side.
Transactions are relevant in RDF in two cases: 1. If data is trickle loaded in small chunks, one likes to know that the chunks do not get lost or corrupted; 2. If the application has any semantics that reserve resources, then these operations need transactions. The latter is not so common with RDF but examples include read-write situations, like checking if a seat is available and then reserving it. Transactionality guarantees that the same seat does not get reserved twice.
Web people argue with some justification that since the four cardinal virtues of database never existed on the web to begin with, applying strict ACID to web data is beside the point, like locking the stable after the horse has long since run away. This may be so; yet the systems used for processing data, whether that data is dirty or not, benefit from predictable operation under concurrency and from not losing data.
Analytics workloads are not primarily about transactions, but still need to specify what happens with updates. Analyzing data from measurements may not have concurrent updates, but there the transaction issue is replaced by the question of making explicit how the data was acquired and what processing has been applied to it before storage.
As mentioned before, the LOD2 project is at the crossroads of RDF and database. I construe its mission to be the making of RDF into a respectable database discipline. Database respectability in turn is as good as inconceivable without addressing the very bedrock on which this science was founded: transactions.
As previously argued, we need well-defined and auditable benchmarks. This again brings up the topic of transactions. Once we embark on the database benchmark route, there is no way around this. TPC-H mandates that the system under test support transactions, and the audit involves a test for this. We can do no less.
This has led me to more closely examine the issue of RDF and transactions, and whether there exist differences between transactions applied to RDF and to relational data.
As concerns Virtuoso, our position has been that one can get full ACID in Virtuoso, whether in SQL or SPARQL, by using a connected client (e.g., ODBC, JDBC, or the Jena or Sesame frameworks), and setting the isolation options on the connection. Having taken this step, one then must take the next step, which consists of dealing with deadlocks; i.e., with concurrent utilization, it may happen that the database at any time notifies the client that the transaction got aborted and the client must retry.
Web developers especially do not like this, because this is not what MySQL has taught them to expect. MySQL does have transactional back-ends like InnoDB, but often gets used without transactions.
With the March 2011 Virtuoso releases, we have taken a closer look at transactions with RDF. It is more practical to reduce the possibility of errors than to require developers to pay attention. For this reason we have automated isolation settings for RDF, greatly reduced the incidence of deadlocks, and even incorporated automatic deadlock retries where applicable.
If all users lock resources they need in the same order, there will be no deadlocks. This is what we do with RDF load in Virtuoso 7; thus any mix of concurrent INSERTs and DELETEs, if these are under a certain size (normally 10000 quads) are guaranteed never to fail due to locking. These could still fail due to running out of space, though. With previous versions, there always was a possibility of having an INSERT or DELETE fail because of deadlock with multiple users. Vectored INSERT and DELETE are sufficient for making web crawling or archive maintenance practically deadlock free, since there the primary transaction is the INSERT or DELETE of a small graph.
Furthermore, since the SPARQL protocol has no way of specifying transactions consisting of multiple client-server exchanges, the SPARQL end-point may deal with deadlocks by itself. If all else fails, it can simply execute requests one after the other, thus eliminating any possibility of locking. We note that many statements will be intrinsically free of deadlocks by virtue of always locking in key order, but this cannot be universally guaranteed with arbitrary size operations; thus concurrent operations might still sometimes deadlock. Anyway, vectored execution as introduced in Virtuoso 7, besides getting easily double-speed random access, also greatly reduces deadlocks by virtue of ordering operations.
In the next post we will talk about what transactions mean with RDF and whether there is any difference with the relational model.
|
03/22/2011 18:52 GMT
|
Modified:
03/22/2011 17:44 GMT
|
Fault Tolerance in Virtuoso Cluster Edition (Short Version)
[
Orri Erling
]
We have for some time had the option of storing data in a cluster in multiple copies, in the Commercial Edition of Virtuoso. (This feature is not in and is not planned to be added to the Open Source Edition.)
Based on some feedback from the field, we decided to make this feature more user friendly. The gist of the matter is that failure and recovery processes have been automated so that neither application developer nor operating personnel needs any knowledge of how things actually work.
So I will here make a few high level statements about what we offer for fault tolerance. I will follow up with technical specifics in another post.
Three types of individuals need to know about fault tolerance:
- Executives: What does it cost? Will it really eliminate downtime?
- System Administrators: Is it hard to configure? What do I do when I get an alert?
- Application Developers/Programmers: Will I need to write extra code? Can old applications get fault tolerance with no changes?
I will explain the matter to each of these three groups:
Executives
The value gained is elimination of downtime. The cost is in purchasing twice (or thrice) the hardware and software licenses. In reality, the cost is less since you get the whole money's worth of read throughput and half the money's worth of write throughput. Since most applications are about reading, this is a good deal. You do not end up paying for unused capacity.
Server instances are grouped in "quorums" of two or, for extra safety, three; as long as one member of each quorum is available, the system keeps running and nobody sees a difference, except maybe for slower response. This does not protect against widespread power outage or the building burning down; the scope is limited to hardware and software failures at one site.
The most basic site-wide disaster recovery plan consists of constantly streaming updates off-site. Using an off-site backup plus update stream, one can reconstitute the failed data center on a cloud provider in a few hours. Details will vary; please contact us for specifics.
Running multiple sites in parallel is also possible but specifics will depend on the application. Again, please contact us if you have a specific case in mind.
System Administrators
To configure, divide your server instances into quorums of 2 or 3, according to which will be mirrors of each other, with each quorum member on a different host from the others in its quorum. These things are declared in a configuration file. Table definitions do not have to be altered for fault tolerance. It is enough for tables and indices to specify partitioning. Use two switches, and two NICs per machine, and connect one of each server's network cables to each switch, to cover switch failures.
When things break, as long as there is at least one server instance up from each quorum, things will continue to work. Reboots and the like are handled without operator intervention; if there is a broken host, then remove it and put a spare in its place. If the disks are OK, put the old disks in the replacement host and start. If the disks are gone, then copy the database files from the live copy. Finally start the replacement database, and the system will do the rest. The system is online in read-write mode during all this time, including during copying.
Having mirrored disks in individual hosts is optional since data will anyhow be in two copies. Mirrored disks will shorten the vulnerability window of running a partition on a single server instance since this will for the most part eliminate the need to copy many (hundreds) of GB of database files when recovering a failed instance.
Application Developers/Programmers
An application can connect to any server instance in the cluster and have access to the same data, with full ACID properties.
There are two types of errors that can occur in any database application: The database server instance may be offline or otherwise unreachable; and a transaction may be aborted due to a deadlock.
For the missing server instance, the application should try to reconnect. An ODBC/JDBC connect string can specify a list of alternate server instances; thus as long as the application is written to try to reconnect as best practices dictate, there is no new code needed.
For the deadlock, the application is supposed to retry the transaction. Sometimes when a server instance drops out or rejoins a running cluster, some transactions will have to be retried. To the application, these conditions look like a deadlock. If the application handles deadlocks (SQL State 40001) as best practices dictate, there is no change needed.
Conclusion
In summary...
- Limited extra cost for fault tolerance; no equipment sitting idle.
- Easy operation: Replace servers when they fail; the cluster does the rest.
- No changes needed to most applications.
- No proprietary SQL APIs or special fault tolerance logic needed in applications.
- Fully transactional programming model.
All the above applies to both the Graph Model (RDF) and Relational (SQL) sides of Virtuoso. These features will be in the commercial release of Virtuoso to be publicly available in the next 2-3 weeks. Please contact OpenLink Software Sales for details of availability or for getting advance evaluation copies.
-
Virtuoso Cluster (VC) -- a collection of Virtuoso Cluster Nodes on one or more machines, working in parallel as part of a Virtuoso Cluster.
-
Virtuoso Cluster Node (VCN) -- a Virtuoso Server Instance (Non Fault-Tolerant Operations), or a Quorum of Server Instances (Fault Tolerant Operations), which is a member of a collection of Virtuoso Cluster Nodes working in parallel as part of a Virtuoso Cluster.
-
Virtuoso Host Cluster (VHC) -- a collection of machines, each hosting one or more Virtuoso Server Instances, making up a Virtuoso Cluster.
-
Virtuoso Host Cluster Node (VHCN) -- a machine hosting one or more Virtuoso Server Instances that are members of a Virtuoso Cluster.
-
Virtuoso Server Instance (VSI) -- a single Virtuoso process with exclusive access to its own permanent storage, consisting of database files and logs. May comprise an entire Virtuoso Cluster Node (Non Fault-Tolerant Operations), or be one member of a quorum which comprises a Virtuoso Cluster Node (Fault Tolerant Operations).
Also see
|
04/07/2010 12:40 GMT
|
Modified:
04/14/2010 19:12 GMT
|
Fault Tolerance in Virtuoso Cluster Edition (Short Version)
[
Virtuso Data Space Bot
]
We have for some time had the option of storing data in a cluster in multiple copies, in the Commercial Edition of Virtuoso. (This feature is not in and is not planned to be added to the Open Source Edition.)
Based on some feedback from the field, we decided to make this feature more user friendly. The gist of the matter is that failure and recovery processes have been automated so that neither application developer nor operating personnel needs any knowledge of how things actually work.
So I will here make a few high level statements about what we offer for fault tolerance. I will follow up with technical specifics in another post.
Three types of individuals need to know about fault tolerance:
- Executives: What does it cost? Will it really eliminate downtime?
- System Administrators: Is it hard to configure? What do I do when I get an alert?
- Application Developers/Programmers: Will I need to write extra code? Can old applications get fault tolerance with no changes?
I will explain the matter to each of these three groups:
Executives
The value gained is elimination of downtime. The cost is in purchasing twice (or thrice) the hardware and software licenses. In reality, the cost is less since you get the whole money's worth of read throughput and half the money's worth of write throughput. Since most applications are about reading, this is a good deal. You do not end up paying for unused capacity.
Server instances are grouped in "quorums" of two or, for extra safety, three; as long as one member of each quorum is available, the system keeps running and nobody sees a difference, except maybe for slower response. This does not protect against widespread power outage or the building burning down; the scope is limited to hardware and software failures at one site.
The most basic site-wide disaster recovery plan consists of constantly streaming updates off-site. Using an off-site backup plus update stream, one can reconstitute the failed data center on a cloud provider in a few hours. Details will vary; please contact us for specifics.
Running multiple sites in parallel is also possible but specifics will depend on the application. Again, please contact us if you have a specific case in mind.
System Administrators
To configure, divide your server instances into quorums of 2 or 3, according to which will be mirrors of each other, with each quorum member on a different host from the others in its quorum. These things are declared in a configuration file. Table definitions do not have to be altered for fault tolerance. It is enough for tables and indices to specify partitioning. Use two switches, and two NICs per machine, and connect one of each server's network cables to each switch, to cover switch failures.
When things break, as long as there is at least one server instance up from each quorum, things will continue to work. Reboots and the like are handled without operator intervention; if there is a broken host, then remove it and put a spare in its place. If the disks are OK, put the old disks in the replacement host and start. If the disks are gone, then copy the database files from the live copy. Finally start the replacement database, and the system will do the rest. The system is online in read-write mode during all this time, including during copying.
Having mirrored disks in individual hosts is optional since data will anyhow be in two copies. Mirrored disks will shorten the vulnerability window of running a partition on a single server instance since this will for the most part eliminate the need to copy many (hundreds) of GB of database files when recovering a failed instance.
Application Developers/Programmers
An application can connect to any server instance in the cluster and have access to the same data, with full ACID properties.
There are two types of errors that can occur in any database application: The database server instance may be offline or otherwise unreachable; and a transaction may be aborted due to a deadlock.
For the missing server instance, the application should try to reconnect. An ODBC/JDBC connect string can specify a list of alternate server instances; thus as long as the application is written to try to reconnect as best practices dictate, there is no new code needed.
For the deadlock, the application is supposed to retry the transaction. Sometimes when a server instance drops out or rejoins a running cluster, some transactions will have to be retried. To the application, these conditions look like a deadlock. If the application handles deadlocks (SQL State 40001) as best practices dictate, there is no change needed.
Conclusion
In summary...
- Limited extra cost for fault tolerance; no equipment sitting idle.
- Easy operation: Replace servers when they fail; the cluster does the rest.
- No changes needed to most applications.
- No proprietary SQL APIs or special fault tolerance logic needed in applications.
- Fully transactional programming model.
All the above applies to both the Graph Model (RDF) and Relational (SQL) sides of Virtuoso. These features will be in the commercial release of Virtuoso to be publicly available in the next 2-3 weeks. Please contact OpenLink Software Sales for details of availability or for getting advance evaluation copies.
-
Virtuoso Cluster (VC) -- a collection of Virtuoso Cluster Nodes on one or more machines, working in parallel as part of a Virtuoso Cluster.
-
Virtuoso Cluster Node (VCN) -- a Virtuoso Server Instance (Non Fault-Tolerant Operations), or a Quorum of Server Instances (Fault Tolerant Operations), which is a member of a collection of Virtuoso Cluster Nodes working in parallel as part of a Virtuoso Cluster.
-
Virtuoso Host Cluster (VHC) -- a collection of machines, each hosting one or more Virtuoso Server Instances, making up a Virtuoso Cluster.
-
Virtuoso Host Cluster Node (VHCN) -- a machine hosting one or more Virtuoso Server Instances that are members of a Virtuoso Cluster.
-
Virtuoso Server Instance (VSI) -- a single Virtuoso process with exclusive access to its own permanent storage, consisting of database files and logs. May comprise an entire Virtuoso Cluster Node (Non Fault-Tolerant Operations), or be one member of a quorum which comprises a Virtuoso Cluster Node (Fault Tolerant Operations).
Also see
|
04/07/2010 12:40 GMT
|
Modified:
04/14/2010 19:12 GMT
|
Re-introducing the Virtuoso Virtual Database Engine
[
Kingsley Uyi Idehen
]
In recent times a lot of the commentary and focus re. Virtuoso has centered on the RDF Quad Store and Linked Data. What sometimes gets overlooked is the sophisticated Virtual Database Engine that provides the foundation for all of Virtuoso's data integration capabilities.
In this post I provide a brief re-introduction to this essential aspect of Virtuoso.
What is it?
This component of Virtuoso is known as the Virtual Database Engine (VDBMS). It provides transparent high-performance and secure access to disparate data sources that are external to Virtuoso. It enables federated access and integration of data hosted by any ODBC- or JDBC-accessible RDBMS, RDF Store, XML database, or Document (Free Text)-oriented Content Management System. In addition, it facilitates integration with Web Services (SOAP-based SOA RPCs or REST-fully accessible Web Resources).
Why is it important?
In the most basic sense, you shouldn't need to upgrade your existing database engine version simply because your current DBMS and Data Access Driver combo isn't compatible with ODBC-compliant desktop tools such as Microsoft Access, Crystal Reports, BusinessObjects, Impromptu, or other of ODBC, JDBC, ADO.NET, or OLE DB-compliant applications. Simply place Virtuoso in front of your so-called "legacy database," and let it deliver the compliance levels sought by these tools
In addition, it's important to note that today's enterprise, through application evolution, company mergers, or acquisitions, is often faced with disparately-structured data residing in any number of line-of-business-oriented data silos. Compounding the problem is the exponential growth of user-generated data via new social media-oriented collaboration tools and platforms. For companies to cost-effectively harness the opportunities accorded by the increasing intersection between line-of-business applications and social media, virtualization of data silos must be achieved, and this virtualization must be delivered in a manner that doesn't prohibitively compromise performance or completely undermine security at either the enterprise or personal level. Again, this is what you get by simply installing Virtuoso.
How do I use it?
The VDBMS may be used in a variety of ways, depending on the data access and integration task at hand. Examples include:
Relational Database Federation
You can make a single ODBC, JDBC, ADO.NET, OLE DB, or XMLA connection to multiple ODBC- or JDBC-accessible RDBMS data sources, concurrently, with the ability to perform intelligent distributed joins against externally-hosted database tables. For instance, you can join internal human resources data against internal sales and external stock market data, even when the HR team uses Oracle, the Sales team uses Informix, and the Stock Market figures come from Ingres!
Conceptual Level Data Access using the RDF Model
You can construct RDF Model-based Conceptual Views atop Relational Data Sources. This is about generating HTTP-based Entity-Attribute-Value (E-A-V) graphs using data culled "on the fly" from native or external data sources (Relational Tables/Views, XML-based Web Services, or User Defined Types).
You can also derive RDF Model-based Conceptual Views from Web Resource transformations "on the fly" -- the Virtuoso Sponger (RDFizing middleware component) enables you to generate RDF Model Linked Data via a RESTful Web Service or within the process pipeline of the SPARQL query engine (i.e., you simply use the URL of a Web Resource in the FROM clause of a SPARQL query).
It's important to note that Views take the form of HTTP links that serve as both Data Source Names and Data Source Addresses. This enables you to query and explore relationships across entities (i.e., People, Places, and other Real World Things) via HTTP clients (e.g., Web Browsers) or directly via SPARQL Query Language constructs transmitted over HTTP.
Conceptual Level Data Access using ADO.NET Entity Frameworks
As an alternative to RDF, Virtuoso can expose ADO.NET Entity Frameworks-based Conceptual Views over Relational Data Sources. It achieves this by generating Entity Relationship graphs via its native ADO.NET Provider, exposing all externally attached ODBC- and JDBC-accessible data sources. In addition, the ADO.NET Provider supports direct access to Virtuoso's native RDF database engine, eliminating the need for resource intensive Entity Frameworks model transformations.
Related
|
02/17/2010 16:38 GMT
|
Modified:
02/17/2010 16:46 GMT
|
Compare & Contrast: SQL Server's Linked Server vs Virtuoso's Virtual Database Layer
[
Virtuso Data Space Bot
]
The ability to use distributed queries -- i.e., to issue SQL queries against any OLE-DB-accessible back end -- via Linked Servers.
The promise fails to materialize, primarily because while there are several ways of issuing such distributed queries, none of them work with all data access providers, and even for those that do, results received via different methods may differ.
Compounding the issue, there are specific configuration options which must be set correctly, often differing from defaults, to permit such things as "ad-hoc distributed queries".
Common tools that are typically used with such Linked Servers include SSIS and DTS. Such generic tools typically rely on four-part naming for their queries, expecting SQL Server to properly rewrite remotely executed queries for the DBMS engine which ultimately executes them.
The most common cause of failure is that when SQL Server rewrites a query, it typically does so using SQL-92 syntax, regardless of the back-end's abilities, and using the Transact-SQL dialect for implementation-specific query syntaxes, regardless of the back-end's dialect. This leads to problems especially when the Linked Server is an older variant which doesn't support SQL-92 (e.g., Progress 8.x or earlier, Informix 7 or earlier), or which SQL dialect differs substantially from Transact-SQL (e.g., Informix, Progress, MySQL, etc.).
Basic Four-Part Naming
SELECT * FROM linked_server.[catalog].[schema].object
Four-part naming presumes that you have pre-defined a Linked Server, and executes the query on SQL Server. SQL Server decides what if any sub- or partial-queries to execute on the linked server, tends not to use appropriate syntax for these, and usually does not take advantage of linked server or provider features.
OpenQuery
SELECT * FROM OPENQUERY ( linked_server , 'query' )
OpenQuery also presumes that you have pre-defined a Linked Server, but executes the query as a "pass-through", handing it directly to the remote provider. Features of the remote server and the data access provider may be taken advantage of, but only if the query author knows about them.
From the product docs:
SQL Server's Linked Server extension executes the specified pass-through query on the specified linked server. This server is an OLE DB data source. OPENQUERY can be referenced in the FROM clause of a query as if it were a table name. OPENQUERY can also be referenced as the target table of an INSERT, UPDATE, or DELETE statement. This is subject to the capabilities of the OLE DB provider. Although the query may return multiple result sets, OPENQUERY returns only the first one.
...
OPENQUERY does not accept variables for its arguments. OPENQUERY cannot be used to execute extended stored procedures on a linked server. However, an extended stored procedure can be executed on a linked server by using a four-part name.
OpenRowset
SELECT *
FROM OPENROWSET
( 'provider_name' , 'datasource' ; 'user_id' ; 'password', { [ catalog. ] [ schema. ] object | 'query' } )
OpenRowset does not require a pre-defined Linked Server, but does require the user to know what data access providers are available on the SQL Server host, and how to manually construct a valid connection string for the chosen provider. It does permit both "pass-through" and "local execution" queries, which can lead to confusion when the results differ (as they regularly will).
More from product docs:
Includes all connection information that is required to access remote data from an OLE DB data source. This method is an alternative to accessing tables in a linked server and is a one-time, ad hoc method of connecting and accessing remote data by using OLE DB. For more frequent references to OLE DB data sources, use linked servers instead. For more information, see Linking Servers. The OPENROWSET function can be referenced in the FROM clause of a query as if it were a table name. The OPENROWSET function can also be referenced as the target table of an INSERT, UPDATE, or DELETE statement, subject to the capabilities of the OLE DB provider. Although the query might return multiple result sets, OPENROWSET returns only the first one.
OPENROWSET also supports bulk operations through a built-in BULK provider that enables data from a file to be read and returned as a rowset.
...
OPENROWSET can be used to access remote data from OLE DB data sources only when the DisallowAdhocAccess registry option is explicitly set to 0 for the specified provider, and the Ad Hoc Distributed Queries advanced configuration option is enabled. When these options are not set, the default behavior does not allow for ad hoc access. When accessing remote OLE DB data sources, the login identity of trusted connections is not automatically delegated from the server on which the client is connected to the server that is being queried. Authentication delegation must be configured. For more information, see Configuring Linked Servers for Delegation.
Catalog and schema names are required if the OLE DB provider supports multiple catalogs and schemas in the specified data source. Values for catalog and schema can be omitted when the OLE DB provider does not support them. If the provider supports only schema names, a two-part name of the form schema.object must be specified. If the provider supports only catalog names, a three-part name of the form catalog.schema.object must be specified. Three-part names must be specified for pass-through queries that use the SQL Server Native Client OLE DB provider. For more information, see Transact-SQL Syntax Conventions (Transact-SQL). OPENROWSET does not accept variables for its arguments.
OpenDataSource
SELECT * FROM OPENDATASOURCE ( 'provider_name', 'provider_specific_datasource_specification' ).[catalog].[schema].object
As with basic four-part naming, OpenDataSource executes the query on SQL Server. SQL Server decides what if any sub-queries to execute on the linked server, tends not to use appropriate syntax for these, and usually does not take advantage of linked server or provider features.
Additional doc excerpts
Provides ad hoc connection information as part of a four-part object name without using a linked server name.
...
OPENDATASOURCE can be used to access remote data from OLE DB data sources only when the DisallowAdhocAccess registry option is explicitly set to 0 for the specified provider, and the Ad Hoc Distributed Queries advanced configuration option is enabled. When these options are not set, the default behavior does not allow for ad hoc access.
The OPENDATASOURCE function can be used in the same Transact-SQL syntax locations as a linked-server name. Therefore, OPENDATASOURCE can be used as the first part of a four-part name that refers to a table or view name in a SELECT, INSERT, UPDATE, or DELETE statement, or to a remote stored procedure in an EXECUTE statement. When executing remote stored procedures, OPENDATASOURCE should refer to another instance of SQL Server. OPENDATASOURCE does not accept variables for its arguments.
Like the OPENROWSET function, OPENDATASOURCE should only reference OLE DB data sources that are accessed infrequently. Define a linked server for any data sources accessed more than several times. Neither OPENDATASOURCE nor OPENROWSET provide all the functionality of linked-server definitions, such as security management and the ability to query catalog information. All connection information, including passwords, must be provided every time that OPENDATASOURCE is called.
The ability to link objects (tables, views, stored procedures) from any ODBC-accessible data source. This includes any JDBC-accessible data source, through the OpenLink ODBC Driver for JDBC Data Sources.
There are no limitations on the data types which can be queried or read, nor must the target DBMS have primary keys set on linked tables or views.
All linked objects may be used in single-site or distributed queries, and the user need not know anything about the actual data structure, including whether the objects being queried are remote or local to Virtuoso -- all objects are made to appear as part of a Virtuoso-local schema.
|
02/12/2010 16:44 GMT
|
Modified:
02/17/2010 11:21 GMT
|
Compare & Contrast: Oracle Heterogeneous Services (HSODBC, DG4ODBC) vs Virtuoso's Virtual Database Layer
[
Virtuso Data Space Bot
]
Oracle Gateway Promise
Ability to use distributed queries over a generic connectivity gateway (HSODBC, DG4ODBC) -- i.e., to issue SQL queries against any ODBC- or OLE-DB-accessible linked back end.
Reality
Promise fails to materialize for several reasons. Immediate limitations include:
- All tables locked by a
FOR UPDATE clause and all tables with LONG columns selected by the query must be located in the same external database.
- Distributed queries cannot select user-defined types or object
REF datatypes on remote tables.
In addition to the above, which apply to database-specific heterogeneous environments, the database-agnostic generic connectivity components have the following limitations:
- A table including a
BLOB column must have a separate column that serves as a primary key.
-
BLOB and CLOB data cannot be read by passthrough queries.
- Updates or deletes that include unsupported functions within a
WHERE clause are not allowed.
- Generic Connectivity does not support stored procedures.
- Generic Connectivity agents cannot participate in distributed transactions; they support single-site transactions only.
- Generic Connectivity does not support multithreaded agents.
- Updating
LONG columns with bind variables is not supported.
- Generic Connectivity does not support
ROWIDs.
Compounding the issue, the HSODBC and DG4ODBC generic connectivity agents perform many of their functions by brute-force methods. Rather than interrogating the data access provider (whether ODBC or OLE DB) or DBMS to which they are connected, to learn their capabilities, many things are done by using the lowest possible function.
For instance, when a SELECT COUNT (*) FROM table@link is issued through Oracle SQL, the target DBMS doesn't simply perform a SELECT COUNT (*) FROM table. Rather, it performs a SELECT * FROM table which is used to inventory all columns in the table, and then performs and fully retrieves SELECT field FROM table into an internal temporary table, where it does the COUNT (*) itself, locally. Testing has confirmed this process to be the case despite Oracle documentation stating that target data sources must support COUNT (*) (among other functions).
The Virtuoso Universal Server will link/attach objects (tables, views, stored procedures) from any ODBC-accessible data source. This includes any JDBC-accessible data source, through the OpenLink ODBC Driver for JDBC Data Sources.
There are no limitations on the data types which can be queried or read, nor must the target DBMS have primary keys set on linked tables or views.
All linked objects may be used in single-site or distributed queries, and the user need not know anything about the actual data structure, including whether the objects being queried are remote or local to Virtuoso -- all objects are made to appear as part of a Virtuoso-local schema.
|
02/12/2010 16:43 GMT
|
Modified:
02/17/2010 11:21 GMT
|
5 Very Important Things to Note about HTTP based Linked Data
[
Kingsley Uyi Idehen
]
- It isn't World Wide Web Specific (HTTP != World Wide Web)
- It isn't Open Data Specific
- It isn't about "Free" (Beer or Speech)
- It isn't about Markup (so don't expect to grok it via "markup first" approach)
- It's about Hyperdata - the use of HTTP and REST to deliver a powerful platform agnostic mechanism for Data Reference, Access, and Integration.
When trying to understand HTTP based Linked Data, especially if you're well versed in DBMS technology use (User, Power User, Architect, Analyst, DBA, or Programmer) think:
- Open Database Connectivity (ODBC) without operating system, data model, or wire-protocol specificity or lock-in potential
- Java Database Connectivity (JDBC) without programming language specificity
- ADO.NET without .NET runtime specificity and .NET bound language specificity
- OLE-DB without Windows operating system & programming language specificity
- XMLA without XML format specificity - with Tabular and Multidimensional results formats expressible in a variety of data representation formats.
- All of the above scoped to the Record rather than Container level, with Generic HTTP scheme URIs associated with each Record, Field, and Field value (optionally)
Remember the need for Data Access & Integration technology is the by product of the following realities: - Human curated data is ultimately dirty, because:
- our thick thumbs, inattention, distractions, and general discomfort with typing, make typos prevalent
- database engines exist for a variety of data models - Graph, Relational, Hierarchical;
- within databases you have different record container/partition names e.g. Table Names;
- within a database record container you have records that are really aspects of the same thing (different keys exist in a plethora of operational / line of business systems that expose aspects of the same entity e.g., customer data that spans Accounts, CRM, ERP application databases);
- different field names (one database has "EMP" while another has "Employee") for the same record
. - Units of measurement is driven by locale, the UK office wants to see sales in Pounds Sterling while the French office prefers Euros etc.
- All of the above is subject to context halos which can be quite granular re. sensitivity e.g. staff travel between locations that alter locales and their roles; basically, profiles matters a lot.
Related
|
01/31/2010 17:31 GMT
|
Modified:
02/01/2010 09:00 GMT
|
Short Recap of Virtuoso Basics (#3 of 5)
[
Orri Erling
]
(Third of five posts related to the WWW 2009 conference, held the week of April 20, 2009.)
There are some points that came up in conversation at WWW 2009 that I will reiterate here. We find there is still some lack of clarity in the product image, so I will here condense it.
Virtuoso is a DBMS. We pitch it primarily to the data web space because this is where we see the emerging frontier. Virtuoso does both SQL and SPARQL and can do both at large scale and high performance. The popular perception of RDF and Relational models as mutually exclusive and antagonistic poles is based on the poor scalability of early RDF implementations. What we do is to have all the RDF specifics, like IRIs and typed literals as native SQL types, and to have a cost based optimizer that knows about this all.
If you want application-specific data structures as opposed to a schema-agnostic quad-store model (triple + graph-name), then Virtuoso can give you this too. Rendering application specific data structures as RDF applies equally to relational data in non-Virtuoso databases because Virtuoso SQL can federate tables from heterogenous DBMS.
On top of this, there is a web server built in, so that no extra server is needed for web services, web pages, and the like.
Installation is simple, just one exe and one config file. There is a huge amount of code in installers — application code and test suites and such — but none of this is needed when you deploy. Scale goes from a 25MB memory footprint on the desktop to hundreds of gigabytes of RAM and endless terabytes of disk on shared-nothing clusters.
Clusters (coming in Release 6) and SQL federation are commercial only; the rest can be had under GPL.
To condense further:
- Scalable Delivery of Linked Data
- SPARQL and SQL
- Arbitrary RDF Data + Relational
- Also From 3rd Party RDBMS
- Easy Deployment
- Standard Interfaces
|
04/30/2009 11:49 GMT
|
Modified:
04/30/2009 12:11 GMT
|
Short Recap of Virtuoso Basics (#3 of 5)
[
Virtuso Data Space Bot
]
(Third of five posts related to the WWW 2009 conference, held the week of April 20, 2009.)
There are some points that came up in conversation at WWW 2009 that I will reiterate here. We find there is still some lack of clarity in the product image, so I will here condense it.
Virtuoso is a DBMS. We pitch it primarily to the data web space because this is where we see the emerging frontier. Virtuoso does both SQL and SPARQL and can do both at large scale and high performance. The popular perception of RDF and Relational models as mutually exclusive and antagonistic poles is based on the poor scalability of early RDF implementations. What we do is to have all the RDF specifics, like IRIs and typed literals as native SQL types, and to have a cost based optimizer that knows about this all.
If you want application-specific data structures as opposed to a schema-agnostic quad-store model (triple + graph-name), then Virtuoso can give you this too. Rendering application specific data structures as RDF applies equally to relational data in non-Virtuoso databases because Virtuoso SQL can federate tables from heterogenous DBMS.
On top of this, there is a web server built in, so that no extra server is needed for web services, web pages, and the like.
Installation is simple, just one exe and one config file. There is a huge amount of code in installers — application code and test suites and such — but none of this is needed when you deploy. Scale goes from a 25MB memory footprint on the desktop to hundreds of gigabytes of RAM and endless terabytes of disk on shared-nothing clusters.
Clusters (coming in Release 6) and SQL federation are commercial only; the rest can be had under GPL.
To condense further:
- Scalable Delivery of Linked Data
- SPARQL and SQL
- Arbitrary RDF Data + Relational
- Also From 3rd Party RDBMS
- Easy Deployment
- Standard Interfaces
|
04/30/2009 11:49 GMT
|
Modified:
04/30/2009 12:11 GMT
|
|
|