Details
OpenLink Software
Burlington, United States
Subscribe
Post Categories
Recent Articles
Community Member Blogs
Display Settings
Translate
|
Showing posts in all categories Refresh
Re-introducing the Virtuoso Virtual Database Engine
[
Kingsley Uyi Idehen
]
In recent times a lot of the commentary and focus re. Virtuoso has centered on the RDF Quad Store and Linked Data. What sometimes gets overlooked is the sophisticated Virtual Database Engine that provides the foundation for all of Virtuoso's data integration capabilities.
In this post I provide a brief re-introduction to this essential aspect of Virtuoso.
What is it?
This component of Virtuoso is known as the Virtual Database Engine (VDBMS). It provides transparent high-performance and secure access to disparate data sources that are external to Virtuoso. It enables federated access and integration of data hosted by any ODBC- or JDBC-accessible RDBMS, RDF Store, XML database, or Document (Free Text)-oriented Content Management System. In addition, it facilitates integration with Web Services (SOAP-based SOA RPCs or REST-fully accessible Web Resources).
Why is it important?
In the most basic sense, you shouldn't need to upgrade your existing database engine version simply because your current DBMS and Data Access Driver combo isn't compatible with ODBC-compliant desktop tools such as Microsoft Access, Crystal Reports, BusinessObjects, Impromptu, or other of ODBC, JDBC, ADO.NET, or OLE DB-compliant applications. Simply place Virtuoso in front of your so-called "legacy database," and let it deliver the compliance levels sought by these tools
In addition, it's important to note that today's enterprise, through application evolution, company mergers, or acquisitions, is often faced with disparately-structured data residing in any number of line-of-business-oriented data silos. Compounding the problem is the exponential growth of user-generated data via new social media-oriented collaboration tools and platforms. For companies to cost-effectively harness the opportunities accorded by the increasing intersection between line-of-business applications and social media, virtualization of data silos must be achieved, and this virtualization must be delivered in a manner that doesn't prohibitively compromise performance or completely undermine security at either the enterprise or personal level. Again, this is what you get by simply installing Virtuoso.
How do I use it?
The VDBMS may be used in a variety of ways, depending on the data access and integration task at hand. Examples include:
Relational Database Federation
You can make a single ODBC, JDBC, ADO.NET, OLE DB, or XMLA connection to multiple ODBC- or JDBC-accessible RDBMS data sources, concurrently, with the ability to perform intelligent distributed joins against externally-hosted database tables. For instance, you can join internal human resources data against internal sales and external stock market data, even when the HR team uses Oracle, the Sales team uses Informix, and the Stock Market figures come from Ingres!
Conceptual Level Data Access using the RDF Model
You can construct RDF Model-based Conceptual Views atop Relational Data Sources. This is about generating HTTP-based Entity-Attribute-Value (E-A-V) graphs using data culled "on the fly" from native or external data sources (Relational Tables/Views, XML-based Web Services, or User Defined Types).
You can also derive RDF Model-based Conceptual Views from Web Resource transformations "on the fly" -- the Virtuoso Sponger (RDFizing middleware component) enables you to generate RDF Model Linked Data via a RESTful Web Service or within the process pipeline of the SPARQL query engine (i.e., you simply use the URL of a Web Resource in the FROM clause of a SPARQL query).
It's important to note that Views take the form of HTTP links that serve as both Data Source Names and Data Source Addresses. This enables you to query and explore relationships across entities (i.e., People, Places, and other Real World Things) via HTTP clients (e.g., Web Browsers) or directly via SPARQL Query Language constructs transmitted over HTTP.
Conceptual Level Data Access using ADO.NET Entity Frameworks
As an alternative to RDF, Virtuoso can expose ADO.NET Entity Frameworks-based Conceptual Views over Relational Data Sources. It achieves this by generating Entity Relationship graphs via its native ADO.NET Provider, exposing all externally attached ODBC- and JDBC-accessible data sources. In addition, the ADO.NET Provider supports direct access to Virtuoso's native RDF database engine, eliminating the need for resource intensive Entity Frameworks model transformations.
Related
|
02/17/2010 16:38 GMT
|
Modified:
02/17/2010 16:46 GMT
|
Compare & Contrast: SQL Server's Linked Server vs Virtuoso's Virtual Database Layer
[
Virtuso Data Space Bot
]
The ability to use distributed queries -- i.e., to issue SQL queries against any OLE-DB-accessible back end -- via Linked Servers.
The promise fails to materialize, primarily because while there are several ways of issuing such distributed queries, none of them work with all data access providers, and even for those that do, results received via different methods may differ.
Compounding the issue, there are specific configuration options which must be set correctly, often differing from defaults, to permit such things as "ad-hoc distributed queries".
Common tools that are typically used with such Linked Servers include SSIS and DTS. Such generic tools typically rely on four-part naming for their queries, expecting SQL Server to properly rewrite remotely executed queries for the DBMS engine which ultimately executes them.
The most common cause of failure is that when SQL Server rewrites a query, it typically does so using SQL-92 syntax, regardless of the back-end's abilities, and using the Transact-SQL dialect for implementation-specific query syntaxes, regardless of the back-end's dialect. This leads to problems especially when the Linked Server is an older variant which doesn't support SQL-92 (e.g., Progress 8.x or earlier, Informix 7 or earlier), or which SQL dialect differs substantially from Transact-SQL (e.g., Informix, Progress, MySQL, etc.).
Basic Four-Part Naming
SELECT * FROM linked_server.[catalog].[schema].object
Four-part naming presumes that you have pre-defined a Linked Server, and executes the query on SQL Server. SQL Server decides what if any sub- or partial-queries to execute on the linked server, tends not to use appropriate syntax for these, and usually does not take advantage of linked server or provider features.
OpenQuery
SELECT * FROM OPENQUERY ( linked_server , 'query' )
OpenQuery also presumes that you have pre-defined a Linked Server, but executes the query as a "pass-through", handing it directly to the remote provider. Features of the remote server and the data access provider may be taken advantage of, but only if the query author knows about them.
From the product docs:
SQL Server's Linked Server extension executes the specified pass-through query on the specified linked server. This server is an OLE DB data source. OPENQUERY can be referenced in the FROM clause of a query as if it were a table name. OPENQUERY can also be referenced as the target table of an INSERT, UPDATE, or DELETE statement. This is subject to the capabilities of the OLE DB provider. Although the query may return multiple result sets, OPENQUERY returns only the first one.
...
OPENQUERY does not accept variables for its arguments. OPENQUERY cannot be used to execute extended stored procedures on a linked server. However, an extended stored procedure can be executed on a linked server by using a four-part name.
OpenRowset
SELECT *
FROM OPENROWSET
( 'provider_name' , 'datasource' ; 'user_id' ; 'password', { [ catalog. ] [ schema. ] object | 'query' } )
OpenRowset does not require a pre-defined Linked Server, but does require the user to know what data access providers are available on the SQL Server host, and how to manually construct a valid connection string for the chosen provider. It does permit both "pass-through" and "local execution" queries, which can lead to confusion when the results differ (as they regularly will).
More from product docs:
Includes all connection information that is required to access remote data from an OLE DB data source. This method is an alternative to accessing tables in a linked server and is a one-time, ad hoc method of connecting and accessing remote data by using OLE DB. For more frequent references to OLE DB data sources, use linked servers instead. For more information, see Linking Servers. The OPENROWSET function can be referenced in the FROM clause of a query as if it were a table name. The OPENROWSET function can also be referenced as the target table of an INSERT, UPDATE, or DELETE statement, subject to the capabilities of the OLE DB provider. Although the query might return multiple result sets, OPENROWSET returns only the first one.
OPENROWSET also supports bulk operations through a built-in BULK provider that enables data from a file to be read and returned as a rowset.
...
OPENROWSET can be used to access remote data from OLE DB data sources only when the DisallowAdhocAccess registry option is explicitly set to 0 for the specified provider, and the Ad Hoc Distributed Queries advanced configuration option is enabled. When these options are not set, the default behavior does not allow for ad hoc access. When accessing remote OLE DB data sources, the login identity of trusted connections is not automatically delegated from the server on which the client is connected to the server that is being queried. Authentication delegation must be configured. For more information, see Configuring Linked Servers for Delegation.
Catalog and schema names are required if the OLE DB provider supports multiple catalogs and schemas in the specified data source. Values for catalog and schema can be omitted when the OLE DB provider does not support them. If the provider supports only schema names, a two-part name of the form schema.object must be specified. If the provider supports only catalog names, a three-part name of the form catalog.schema.object must be specified. Three-part names must be specified for pass-through queries that use the SQL Server Native Client OLE DB provider. For more information, see Transact-SQL Syntax Conventions (Transact-SQL). OPENROWSET does not accept variables for its arguments.
OpenDataSource
SELECT * FROM OPENDATASOURCE ( 'provider_name', 'provider_specific_datasource_specification' ).[catalog].[schema].object
As with basic four-part naming, OpenDataSource executes the query on SQL Server. SQL Server decides what if any sub-queries to execute on the linked server, tends not to use appropriate syntax for these, and usually does not take advantage of linked server or provider features.
Additional doc excerpts
Provides ad hoc connection information as part of a four-part object name without using a linked server name.
...
OPENDATASOURCE can be used to access remote data from OLE DB data sources only when the DisallowAdhocAccess registry option is explicitly set to 0 for the specified provider, and the Ad Hoc Distributed Queries advanced configuration option is enabled. When these options are not set, the default behavior does not allow for ad hoc access.
The OPENDATASOURCE function can be used in the same Transact-SQL syntax locations as a linked-server name. Therefore, OPENDATASOURCE can be used as the first part of a four-part name that refers to a table or view name in a SELECT, INSERT, UPDATE, or DELETE statement, or to a remote stored procedure in an EXECUTE statement. When executing remote stored procedures, OPENDATASOURCE should refer to another instance of SQL Server. OPENDATASOURCE does not accept variables for its arguments.
Like the OPENROWSET function, OPENDATASOURCE should only reference OLE DB data sources that are accessed infrequently. Define a linked server for any data sources accessed more than several times. Neither OPENDATASOURCE nor OPENROWSET provide all the functionality of linked-server definitions, such as security management and the ability to query catalog information. All connection information, including passwords, must be provided every time that OPENDATASOURCE is called.
The ability to link objects (tables, views, stored procedures) from any ODBC-accessible data source. This includes any JDBC-accessible data source, through the OpenLink ODBC Driver for JDBC Data Sources.
There are no limitations on the data types which can be queried or read, nor must the target DBMS have primary keys set on linked tables or views.
All linked objects may be used in single-site or distributed queries, and the user need not know anything about the actual data structure, including whether the objects being queried are remote or local to Virtuoso -- all objects are made to appear as part of a Virtuoso-local schema.
|
02/12/2010 16:44 GMT
|
Modified:
02/17/2010 11:21 GMT
|
Compare & Contrast: Oracle Heterogeneous Services (HSODBC, DG4ODBC) vs Virtuoso's Virtual Database Layer
[
Virtuso Data Space Bot
]
Oracle Gateway Promise
Ability to use distributed queries over a generic connectivity gateway (HSODBC, DG4ODBC) -- i.e., to issue SQL queries against any ODBC- or OLE-DB-accessible linked back end.
Reality
Promise fails to materialize for several reasons. Immediate limitations include:
- All tables locked by a
FOR UPDATE clause and all tables with LONG columns selected by the query must be located in the same external database.
- Distributed queries cannot select user-defined types or object
REF datatypes on remote tables.
In addition to the above, which apply to database-specific heterogeneous environments, the database-agnostic generic connectivity components have the following limitations:
- A table including a
BLOB column must have a separate column that serves as a primary key.
-
BLOB and CLOB data cannot be read by passthrough queries.
- Updates or deletes that include unsupported functions within a
WHERE clause are not allowed.
- Generic Connectivity does not support stored procedures.
- Generic Connectivity agents cannot participate in distributed transactions; they support single-site transactions only.
- Generic Connectivity does not support multithreaded agents.
- Updating
LONG columns with bind variables is not supported.
- Generic Connectivity does not support
ROWIDs.
Compounding the issue, the HSODBC and DG4ODBC generic connectivity agents perform many of their functions by brute-force methods. Rather than interrogating the data access provider (whether ODBC or OLE DB) or DBMS to which they are connected, to learn their capabilities, many things are done by using the lowest possible function.
For instance, when a SELECT COUNT (*) FROM table@link is issued through Oracle SQL, the target DBMS doesn't simply perform a SELECT COUNT (*) FROM table. Rather, it performs a SELECT * FROM table which is used to inventory all columns in the table, and then performs and fully retrieves SELECT field FROM table into an internal temporary table, where it does the COUNT (*) itself, locally. Testing has confirmed this process to be the case despite Oracle documentation stating that target data sources must support COUNT (*) (among other functions).
The Virtuoso Universal Server will link/attach objects (tables, views, stored procedures) from any ODBC-accessible data source. This includes any JDBC-accessible data source, through the OpenLink ODBC Driver for JDBC Data Sources.
There are no limitations on the data types which can be queried or read, nor must the target DBMS have primary keys set on linked tables or views.
All linked objects may be used in single-site or distributed queries, and the user need not know anything about the actual data structure, including whether the objects being queried are remote or local to Virtuoso -- all objects are made to appear as part of a Virtuoso-local schema.
|
02/12/2010 16:43 GMT
|
Modified:
02/17/2010 11:21 GMT
|
New ADO.NET 3.x Provider for Virtuoso Released (Update 2)
[
Kingsley Uyi Idehen
]
I am pleased to announce the immediate availability of the Virtuoso ADO.NET 3.5 data provider for Microsoft's .NET platform.
What is it?
A data access driver/provider that provides conceptual entity oriented access to RDBMS data managed by Virtuoso. Naturally, it also uses Virtuoso's in-built virtual / federated database layer to provide access to ODBC and JDBC accessible RDBMS engines such as: Oracle (7.x to latest), SQL Server (4.2 to latest), Sybase, IBM Informix (5.x to latest), IBM DB2, Ingres (6.x to latest), Progress (7.x to OpenEdge), MySQL, PostgreSQL, Firebird, and others using our ODBC or JDBC bridge drivers.
Benefits?
Technical:
It delivers an Entity-Attribute-Value + Classes & Relationships model over disparate data sources that are materialized as .NET Entity Framework Objects, which are then consumable via ADO.NET Data Object Services, LINQ for Entities, and other ADO.NET data consumers.
The provider is fully integrated into Visual Studio 2008 and delivers the same "ease of use" offered by Microsoft's own SQL Server provider, but across Virtuoso, Oracle, Sybase, DB2, Informix, Ingres, Progress (OpenEdge), MySQL, PostgreSQL, Firebird, and others. The same benefits also apply uniformly to Entity Frameworks compatibility.
Bearing in mind that Virtuoso is a multi-model (hybrid) data manager, this also implies that you can use .NET Entity Frameworks against all data managed by Virtuoso. Remember, Virtuoso's SQL channel is a conduit to Virtuoso's core; thus, RDF (courtesy of SPASQL as already implemented re. Jena/Sesame/Redland providers), XML, and other data forms stored in Virtuoso also become accessible via .NET's Entity Frameworks.
Strategic:
You can choose which entity oriented data access model works best for you: RDF Linked Data & SPARQL or .NET Entity Frameworks & Entity SQL. Either way, Virtuoso delivers a commercial grade, high-performance, secure, and scalable solution.
How do I use it?
Simply follow one of guides below:
Note: When working with external or 3rd party databases, simply use the Virtuoso Conductor to link the external data source into Virtuoso. Once linked, the remote tables will simply be treated as though they are native Virtuoso tables leaving the virtual database engine to handle the rest. This is similar to the role the Microsoft JET engine played in the early days of ODBC, so if you've ever linked an ODBC data source into Microsoft Access, you are ready to do the same using Virtuoso.
Related
|
01/08/2009 04:36 GMT
|
Modified:
01/08/2009 09:12 GMT
|
The Trouble with Labels (Contd.): Data Integration & SOA
[
Kingsley Uyi Idehen
]
I just stumbled across an post from ITBusines Edge titled: How Semantic Technology Can Help Companies with Integration. While reading the post I encountered the term: Master Data Manager (MDM), and wondered to myself, "what's that?" only to realize it's the very same thing I described as a Data Virtualization or Virtual Database technology (circa. 1998). Now, if re-labeling can confuse me when applied to a realm I've been intimately involved with for eons (internet time). I don't want to imagine what it does for others who aren't that intimately involved with the important data access and data integration realms.
On the more refreshing side, the article does shed some light on the potency of RDF and OWL when applied to the construction of conceptual views of heterogeneous data sources.
"How do you know that data coming from one place calculates net revenue the same way that data coming from another place does? You’ve got people using the same term for different things and different terms for the same things. How do you reconcile all of that? That’s really what semantic integration is about."
BTW - I discovered this article via another titled: Understanding Integration And How It Can Help with SOA, that covers SOA and Integration matters. Again, in this piece I feel the gradual realization of the virtues that RDF, OWL, and RDF Linked Data bring to bear in the vital realm of data integration across heterogeneous data silos.
Conclusion
A number of events, at the micro and macro economic levels, are forcing attention back to the issue of productive use of existing IT resources. The trouble with the aforementioned quest is that it ultimately unveils the global IT affliction known as: heterogeneous data silos, and the challenges of pain alleviation, that have been ignored forever or approached inadequately as clearly shown by the rapid build up of SOA horror stories in the data integration realm.
Data Integration via conceptualization of heterogenous data sources, that result in concrete conceptual layer data access and management, remains the greatest and most potent application of technologies associated with the "Semantic Web" and/or "Linked Data" monikers.
Related
|
10/12/2008 18:53 GMT
|
Modified:
10/12/2008 18:54 GMT
|
Crunchbase & Semantic Web Interview (Remix - Update 1)
[
Kingsley Uyi Idehen
]
After reading Bengee's interview with CrunchBase, I decided to knock up a quick interview remix as part of my usual attempt to add to the developing discourse.
CrunchBase: When we released the CrunchBase API, you were one of the first developers to step up and quickly released a CrunchBase Sponger Cartridge. Can you explain what a CrunchBase Sponger Cartridge is?
Me: A Sponger Cartridge is a data access driver for Web Resources that plugs into our Virtuoso Universal Server (DBMS and Linked Data Web Server combo amongst other things). It uses the internal structure of a resource and/or a web service associated with a resource, to materialize an RDF based Linked Data graph that essentially describes the resource via its properties (Attributes & Relationships).
CrunchBase: And what inspired you to create it?
Me: Bengee built a new space with your data, and we've built a space on the fly from your data which still resides in your domain. Either solution extols the virtues of Linked Data i.e. the ability to explore relationships across data items with high degrees of serendipity (also colloquially known as: following-your-nose pattern in Semantic Web circles).
Bengee posted a notice to the Linking Open Data Community's public mailing list announcing his effort. Bearing in mind the fact that we've been using middleware to mesh the realms of Web 2.0 and the Linked Data Web for a while, it was a no-brainer to knock something up based on the conceptual similarities between Wikicompany and CrunchBase. In a sense, a quadrant of orthogonality is what immediately came to mind re. Wikicompany, CrunchBase, Bengee's RDFization efforts, and ours.
Bengee created an RDF based Linked Data warehouse based on the data exposed by your API, which is exposed via the Semantic CrunchBase data space. In our case we've taken the "RDFization on the fly" approach which produces a transient Linked Data View of the CrunchBase data exposed by your APIs. Our approach is in line with our world view: all resources on the Web are data sources, and the Linked Data Web is about incorporating HTTP into the naming scheme of these data sources so that the conventional URL based hyperlinking mechanism can be used to access a structured description of a resource, which is then transmitted using a range negotiable representation formats. In addition, based on the fact that we house and publish a lot of Linked Data on the Web (e.g. DBpedia, PingTheSemanticWeb, and others), we've also automatically meshed Crunchbase data with related data in DBpedia and Wikicompany data.
CrunchBase: Do you know of any apps that are using CrunchBase Cartridge to enhance their functionality?
Me: Yes, the OpenLink Data Explorer which provides CrunchBase site visitors with the option to explore the Linked Data in the CrunchBase data space. It also allows them to "Mesh" (rather than "Mash") CrunchBase data with other Linked Data sources on the Web without writing a single line of code.
CrunchBase: You have been immersed in the Semantic Web movement for a while now. How did you first get interested in the Semantic Web?
Me: We saw the Semantic Web as a vehicle for standardizing conceptual views of heterogeneous data sources via context lenses (URIs). In 1998 as part of our strategy to expand our business beyond the development and deployment of ODBC, JDBC, and OLE-DB data providers, we decided to build a Virtual Database Engine (see: Virtuoso History), and in doing so we sought a standards based mechanism for the conceptual output of the data virtualization effort. As of the time of the seminal unveiling of the Semantic Web in 1998 we were clear about two things, in relation to the effects of the Web and Internet data management infrastructure inflections: 1) Existing DBMS technology had reached it limits 2) Web Servers would ultimately hit their functional limits. These fundamental realities compelled us to develop Virtuoso with an eye to leveraging the Semantic Web as a vehicle from completing its technical roadmap.
CrunchBase: Can you put into layman’s terms exactly what RDF and SPARQL are and why they are important? Do they only matter for developers or will they extend past developers at some point and be used by website visitors as well?
Me: RDF (Resource Description Framework) is a Graph based Data Model that facilitates resource description using the Subject, Predicate, and Object principle. Associated with the core data model, as part of the overall framework, are a number of markup languages for expressing your descriptions (just as you express presentation markup semantics in HTML or document structure semantics in XML) that include: RDFa (simple extension of HTML markup for embedding descriptions of things in a page), N3 (a human friendly markup for describing resources), RDF/XML (a machine friendly markup for describing resources).
SPARQL is the query language associated with the RDF Data Model, just as SQL is a query language associated with the Relational Database Model. Thus, when you have RDF based structured and linked data on the Web, you can query against Web using SPARQL just as you would against an Oracle/SQL Server/DB2/Informix/Ingres/MySQL/etc.. DBMS using SQL. That's it in a nutshell.
CrunchBase: On your website you wrote that “RDF and SPARQL as productivity boosters in everyday web development”. Can you elaborate on why you believe that to be true?
Me: I think the ability to discern a formal description of anything via its discrete properties is of immense value re. productivity, especially when the capability in question results in a graph of Linked Data that isn't confined to a specific host operating system, database engine, application or service, programming language, or development framework. RDF Linked Data is about infrastructure for the true materialization of the "Information at Your Fingertips" vision of yore. Even though it's taken the emergence of RDF Linked Data to make the aforementioned vision tractable, the comprehension of the vision's intrinsic value have been clear for a very long time. Most organizations and/or individuals are quite familiar with the adage: Knowledge is Power, well there isn't any knowledge without accessible Information, and there isn't any accessible Information without accessible Data. The Web has always be grounded in accessibility to data (albeit via compound container documents called Web Pages). Bottom line, RDF based Linked Data is about Open Data access by reference using URIs (HTTP based Entity IDs / Data Object IDs / Data Source Names), and as I said earlier, the intrinsic value is pretty obvious bearing in mind the costs associated with integrating disparate and heterogeneous data sources -- across intranets, extranets, and the Internet.
CrunchBase: In his definition of Web 3.0, Nova Spivack proposes that the Semantic Web, or Semantic Web technologies, will be force behind much of the innovation that will occur during Web 3.0. Do you agree with Nova Spivack? What role, if any, do you feel the Semantic Web will play in Web 3.0?
Me: I agree with Nova. But I see Web 3.0 as a phase within the Semantic Web innovation continuum. Web 3.0 exists because Web 2.0 exists. Both of these Web versions express usage and technology focus patterns. Web 2.0 is about the use of Open Source technologies to fashion Web Services that are ultimately used to drive proprietary Software as Service (SaaS) style solutions. Web 3.0 is about the use of "Smart Data Access" to fashion a new generation of Linked Data aware Web Services and solutions that exploit the federated nature of the Web to maximum effect; proprietary branding will simply be conveyed via quality of data (cleanliness, context fidelity, and comprehension of privacy) exposed by URIs.
Here are some examples of the CrunchBase Linked Data Space, as projected via our CruncBase Sponger Cartridge:
-
Amazon.com
-
Microsoft
-
Google
-
Apple
|
08/27/2008 18:16 GMT
|
Modified:
08/27/2008 20:35 GMT
|
My Talis Podcast re. Semantic Web, Linked Data, and OpenLink Software
[
Kingsley Uyi Idehen
]
My podcast interview with Paul Miller of Talis is out. As I listened to the podcast (naturally awkward affair) I got a first hand sense of Paul's mastery of the art of interviewing, even when dealing with a fast talking data blitzers like me. Personally, I think I still talk a little too fast (the Nigerian in me), especially when the subject matter hones right into the epicenter of my professional passions: Open Data Access and Heterogeneous Data Integration (aka. Virtual Database Technology) -- so you may need to rewind every now and then during the interview :-)
During this particular podcast interview, I deliberately wanted to have an conversation about the practical value of Linked Data, rather than the technical innards. The fundamental utility of Linked Data remains somewhat mercurial, and I am certainly hoping to do my bit at the upcoming Linked Data Planet conference re. demonstrating and articulating linked data value across the blurring realms of "the individual" and "the enterprise".
Note to my old schoolmates on Facebook: when you listen to this podcast you will at least reconcile "Uyi Idehen" with "Kingsley Idehen". Unfortunately, Facebook refuses to let me Identify myself in the manner I choose. Ideally, I would like to have the name: "Kingsley (Uyi) Idehen" associated with my Facebook ID since this is the Identifier known to my personal network of friends, family, and old schoolmates. This Identity predicament is a long running Identity case study in the making.
|
05/16/2008 00:10 GMT
|
Modified:
05/16/2008 12:53 GMT
|
Adding Wordpress Blogs into the Linked Data Web using Virtuoso
[
Kingsley Uyi Idehen
]
Wordpress is a Weblog platform comprised of the following:
- User Interface - PHP
- Application Logic - PHP
-
Data Storage (SQL RDBMS) - MySQL via PHP-MySQL
-
Application Server - Apache
In the form above (the norm), Wordpress data can be injected into the Linked Data Web via RDFization middleware such as theVirtuoso Sponger (built into all Virtuoso instances) and Triplr. The downside of this approach is that the blog owner doesn't necessary possess full control over their contributions to the emerging Giant Global Graph or Linked Data.
Another route to Linked Data exposure is via Virtuoso's Metaschema Language for producing RDF Views over ODBC/JDBC accessible Data Sources, that enables the following setup:
- User Interface - PHP
- Application Logic - PHP
-
Data Storage (SQL RDBMS) - MySQL via the PHP-MySQL data access interface
-
Virtual Database linkage of MySQL Tables into Virtuoso
-
RDF View generated over the Virtual SQL Tables
-
Application Server - Virtuoso which provides Linked Data Deployment such that RDF Linked Data is exposed when requested by Web User Agents.
Alternatively, you can also exploit Virtuoso as the SQL DBMS, RDF DBMS, Application Server, and Linked Data Deployment platform:
- User Interface - PHP
- Application Logic - PHP
- Data Storage (SQL RDBMS) - Virtuoso via PHP-ODBC data access interface (* ODBC is Virtuoso's native SQL CLI/API *)
-
RDF View generated over the Native SQL Tables
- Application Server - Virtuoso which provides Linked Data Deployment such that RDF Linked Data is exposed when requested by Web User Agents (e.g. OpenLink RDF Browser, Zitgist Data Viewer, DISCO Hyperdata Browser, and Tabulator).
Benefits?
- Each user account gets a proper Linked Data URI (ID) that can me meshed/smushed with other IDs (so you add data from this new blog space to other linked data sources associated with you other URIs/IDs)
- Each post gets a proper URI
All data is now query-able via SPARQL
Discoverability increases exponentially (without drop in relevance in either direction i.e. discovering or being discovered)
How Do I map the WordPress SQL Schema to RDF using Virtuoso?
- Determine the RDF Schema or Ontologies that define the Classes for which you will be producing instance data (e.g. SIOC and FOAF)
- Declare URI/IRI generator functions (*special Virtuoso functions*)
- Use SPARQL Graph patterns to apply URI/IRI generator functions to Tables, Views, Table Values mode Stored Procedures, Query Resultsets as part of RDBMS to RDF mapping
Read the Meta Schema Language guide or simply apply our "WordPress SQL Schema to RDF" script to your Virtuoso hosted instance.
Of course, there are other mappings that cover other PHP applications deployed via Virtuoso:
Live Demos?
|
04/09/2008 21:27 GMT
|
Modified:
04/10/2008 12:33 GMT
|
Linked Data is vital to Enterprise Integration driven Agility
[
Kingsley Uyi Idehen
]
John Schmidt, from Informatica, penned an interesting post titled: IT Doesn't Matter - Integration Does. Yes, integration is hard, but I do profoundly believe that what's been happening on the Web over the last 10 or so years also applies to the Enterprise, and by this I absolutely do not mean "Enterprise 2.0" since "2.0" and productive agility do not compute in my realm of discourse. large collections of RSS feeds, Wikiwords, Shared Bookmarks, Discussion Forums etc.. when disconnected at the data level (i.e. hosted in pages with no access to the "data behind") simply offer information deluge and inertia (there are only so many hours for processing opaque information sources in a given day). Enterprises fundamentally need to process information efficiently as part of a perpetual assessment of their relative competitive Strengths, Weaknesses, Opportunities, and Threats (SWOT), in existing and/or future markets. Historically, IT acquisitions have run counter intuitively to the aforementioned quest for "Ability" due to the predominance of "rip and replace" approach technology acquisition that repeatedly creates and perpetuates information silos across Application, Database, Operating System, Development Environment boundaries. The sequence of events typically occurs as follows: - applications are acquired on a problem by problem basis
- back-end application databases are discovered once ad-hoc information views are sought by information workers
- back-end database disparity across applications is discovered once holistic views are sought by knowledge workers (typically domain experts).
In the early to mid 90's (pre ubiquitous Web), operating system, programming language, operating system, and development framework independence inside the enterprise was technically achievable via ODBC (due to it's platform independence). That said, DBMS specific ODBC channels alone couldn't address the holistic requirements associated with Conceptual Views of disparate data sources, hence the need for Data Access Virtualization via Virtual Database Engine technology. Just as is the case on the Web today, with the emergence of the "Linked Data" meme, enterprises now have a powerful mechanism for exploiting the Data Integration benefits associated with generating Data Objects from disparate data sources, endowed with HTTP based IDs (URIs). Conceptualizing access to data exposed Databases APIs, SOA based Web Services (SOAP style Web Services), Web 2.0 APIs (REST style Web Services), XML Views of SQL Data (SQLX), pure XML etc.. is problem area addressed by RDF aware middleware (RDFizers e.g Virtuoso Sponger). Here are examples of what SQL Rows exposed as RDF Data Objects (identified using HTTP based URIs) would look like outside or behind a corporate firewall: What's Good for the Web Goose (Personal Data Space URIs) is good for the Enterprise Gander (Enterprise Data Space URIs). Related
|
03/21/2008 21:56 GMT
|
Modified:
03/22/2008 14:13 GMT
|
Virtuoso Cluster
[
Orri Erling
]
We often get questions on clustering support, especially around RDF, where databases quickly get rather large. So we will answer them here.
But first on some support technology. We have an entire new disk allocation and IO system. It is basically operational but needs some further tuning. It offers much better locality and much better sequential access speeds.
Specially for dealing with large RDF databases, we will introduce data compression. We have over the years looked at different key compression possibilities but have never been very excited by them since thy complicate random access to index pages and make for longer execution paths, require scraping data for one logical thing from many places, and so on. Anyway, now we will compress pages before writing them to disk, so the cache is in machine byte order and alignment and disk is compressed. Since multiple processors are commonplace on servers, they can well be used for compression, that being such a nicely local operation, all in cache and requiring no serialization with other things.
Of course, what was fixed length now becomes variable length, but if the compression ratio is fairly constant, we reserve space for the expected compressed size, and deal with the rare overflows separately. So no complicated shifting data around when something grows.
Once we are done with this, this could well be a separate intermediate release.
Now about clusters. We have for a long time had various plans for clusters but have not seen the immediate need for execution. With the rapid growth in the Linking Open Data movement and questions on web scale knowledge systems, it is time to get going.
How will it work? Virtuoso remains a generic DBMS, thus the clustering support is an across the board feature, not something for RDF only. So we can join Oracle, IBM DB2, and others at the multi-terabyte TPC races.
We introduce hash partitioning at the index level and allow for redundancy, where multiple nodes can serve the same partition, allowing for load balancing read and replacement of failing nodes and growth of cluster without interruption of service.
The SQL compiler, SPARQL, and database engine all stay the same. There is a little change in the SQL run time, not so different from what we do with remote databases at present in the context of our virtual database federation. There is a little extra complexity for distributed deadlock detection and sometimes multiple threads per transaction. We remember that one RPC round trip Is 3-4 index lookups, so we pipeline things so as to move requests in batches, a few dozen at a time.
The cluster support will be in the same executable and will be enabled by configuration file settings. Administration is limited to one node, but Web and SQL clients can connect to any node and see the same data. There is no balancing between storage and control nodes because clients can simply be allocated round robin for statistically even usage. In relational applications, as exemplified by TPC-C, if one partitions by fields with an application meaning (such as warehouse ID), and if clients have an affinity to a particular chunk of data, they will of course preferentially connect to nodes hosting this data. With RDF, such affinity is unlikely, so nodes are basically interchangeable.
In practice, we develop in June and July. Then we can rent a supercomputer maybe from Amazon EC2 and experiment away.
We should just come up with a name for this. Maybe something astronomical, like star cluster. Big, bright but in this case not far away.
|
05/23/2007 14:11 GMT
|
Modified:
04/24/2008 09:52 GMT
|
|
|