Details

OpenLink Software
Burlington, United States

Subscribe

Post Categories

Recent Articles

Community Member Blogs

Display Settings

articles per page.
order.

Translate

Showing posts in all categories RefreshRefresh
Why Do I Need To Pay For ODBC , JDBC, ADO.NET, OLE-DB Drivers? [ UDA Data Space Bot ]

Payment is a function of pain alleviation (opportunity cost) monetization.

This post is about highlighting the real pains associated with the $0.00 misconception associated with Data Access Drivers: ODBC, JDBC, ADO.NET, OLE-DB etc.

In the most basic sense, there are some fundament aspects of data access that are complex to implement and rarely implemented (if at all) by free drivers, the list includes:

  1. Escape Syntaxes for Dates and Functions
  2. Metadata Calls which enable smarter ODBC compliant applications (this feature is typically missing on Driver Side and abused on the Client side i.e., making clients DBMS specific by testing for specific DBMS names)
  3. Scrollable Cursors, this is how you deal with change sensitivity, and most drivers actually fake support and get away with it due to shortage of applications to test proper cursor types (Static, Forward-Only, Key-Set, Dynamic, and Mixed models).

Okay, so we're done with actual driver sophistication re. implementation of critical features. Let's Up the ante by veering into the area of security. At the most basic level, It's extremely important to understand that all data access driver types provide read-write access to your databases; thus, it's imperative that data access drivers address the following:

  1. Read-Only or Read-Write Access scoped to specific Users
  2. Ditto applied to specific User Groups
  3. Ditto applied to Database Names
  4. Ditto applied to specific ODBC compliant applications
  5. Ditto applied to specific ODBC host operating systems
  6. Ditto applied to specific IP addresses or Ranges on your Network
  7. Any combination of items 1-6 as part of a configurable data access rules/policy system.

Once you're done with security, you then have the thorny issue of data access and data flow management. In a nutshell, your driver needs to be able to handle:

  1. Protection against cartesian product network flooding (e.g., user clicks on Customer Table via an ODBC compliant application without comprehension of back-end implications)
  2. Enabling or Disabling of key DBMS engine data access optimization features (e.g. DBMS specific extensions exposed via Environment Variables of SQL commands based settings)
  3. Conditional Connection Pooling across User, User Groups, Applications, Host Operating System, IP Address dimensions.

Once you've dealt with Security and Data Flow, you then have to address the enforcement of these settings across a myriad of ODBC compliant host, which is where Zeroconfig and centralized data access administration comes into play i.e., configure once (locally) and enforce globally.

When OpenLink Software entered the ODBC Driver Market segment in 1992, the issues above where the fundamental basis of our Multi-Tier Drivers. Thus, although we distinguished ourselves via performance, stability, and specification adherence, our fundamental engineering focus has always been skewed towards security and configurability, alongside high-performance and scalability.

As we close 2009, the security issues that pervade Native DBMS Drives, ODBC, JDBC, ADO.NET, OLE-DB etc. Drivers have only increased, courtesy of ubiquitous computing, sadly though, there remains a fundamental illusion that Data Access Drivers simply connect you to DBMS back-ends, and since you can get these drivers at $0.00 from most DBMS vendors they can't be that important.

I hope that this post brings some clarity to a very serious security and general configuration management issues associated with Data Access Drivers. Free ODBC Drivers offer nothing, when it comes to the real issues of Open Data Access. If they did, they wouldn't be worth $0.00!

Note: wondering if this has anything to do with Linked Data (my current data access focal point)? Well, remember, the Linked Data meme is fundamentally about REST based Open Data Access & Integration via HTTP; thus, what applies to Relational Model databases naturally applies to their more granular Graph Model relatives. Basically, data access security never goes away, it just gets more granular, complex, and ultimately, mercurial.

Related

# PermaLink Comments [0]
02/05/2010 01:03 GMT Modified: 02/05/2010 01:04 GMT
Compare & Contrast: SQL Server's Linked Server vs Virtuoso's Virtual Database Layer [ Virtuso Data Space Bot ]

Microsoft SQL Server's Linked Server Promise

The ability to use distributed queries -- i.e., to issue SQL queries against any OLE-DB-accessible back end -- via Linked Servers.

The promise fails to materialize, primarily because while there are several ways of issuing such distributed queries, none of them work with all data access providers, and even for those that do, results received via different methods may differ.

Compounding the issue, there are specific configuration options which must be set correctly, often differing from defaults, to permit such things as "ad-hoc distributed queries".

Common tools that are typically used with such Linked Servers include SSIS and DTS. Such generic tools typically rely on four-part naming for their queries, expecting SQL Server to properly rewrite remotely executed queries for the DBMS engine which ultimately executes them.

The most common cause of failure is that when SQL Server rewrites a query, it typically does so using SQL-92 syntax, regardless of the back-end's abilities, and using the Transact-SQL dialect for implementation-specific query syntaxes, regardless of the back-end's dialect. This leads to problems especially when the Linked Server is an older variant which doesn't support SQL-92 (e.g., Progress 8.x or earlier, Informix 7 or earlier), or which SQL dialect differs substantially from Transact-SQL (e.g., Informix, Progress, MySQL, etc.).

Basic Four-Part Naming

SELECT *FROMlinked_server.[catalog].[schema].object

Four-part naming presumes that you have pre-defined a Linked Server, and executes the query on SQL Server. SQL Server decides what if any sub- or partial-queries to execute on the linked server, tends not to use appropriate syntax for these, and usually does not take advantage of linked server or provider features.

OpenQuery

SELECT * FROMOPENQUERY ( linked_server , 'query' )

OpenQuery? also presumes that you have pre-defined a Linked Server, but executes the query as a "pass-through", handing it directly to the remote provider. Features of the remote server and the data access provider may be taken advantage of, but only if the query author knows about them.

From the product docs:

SQL Server's Linked Server extension executes the specified pass-through query on the specified linked server. This server is an OLE DB data source. OPENQUERY can be referenced in the FROM clause of a query as if it were a table name. OPENQUERY can also be referenced as the target table of an INSERT, UPDATE, or DELETE statement. This is subject to the capabilities of the OLE DB provider. Although the query may return multiple result sets, OPENQUERY returns only the first one....OPENQUERY does not accept variables for its arguments.OPENQUERY cannot be used to execute extended stored procedures on a linked server. However, an extended stored procedure can be executed on a linked server by using a four-part name.OpenRowset

SELECT * FROMOPENROWSET( 'provider_name' , 'datasource' ; 'user_id' ; 'password', { [ catalog. ] [ schema. ] object | 'query' })

OpenRowset? does not require a pre-defined Linked Server, but does require the user to know what data access providers are available on the SQL Server host, and how to manually construct a valid connection string for the chosen provider. It does permit both "pass-through" and "local execution" queries, which can lead to confusion when the results differ (as they regularly will).

More from product docs:

Includes all connection information that is required to access remote data from an OLE DB data source. This method is an alternative to accessing tables in a linked server and is a one-time, ad hoc method of connecting and accessing remote data by using OLE DB. For more frequent references to OLE DB data sources, use linked servers instead. For more information, see Linking Servers. The OPENROWSET function can be referenced in the FROM clause of a query as if it were a table name. The OPENROWSET function can also be referenced as the target table of an INSERT, UPDATE, or DELETE statement, subject to the capabilities of the OLE DB provider. Although the query might return multiple result sets, OPENROWSET returns only the first one.

OPENROWSET also supports bulk operations through a built-in BULK provider that enables data from a file to be read and returned as a rowset.

...OPENROWSET can be used to access remote data from OLE DB data sources only when the DisallowAdhocAccess? registry option is explicitly set to 0 for the specified provider, and the Ad Hoc Distributed Queries advanced configuration option is enabled. When these options are not set, the default behavior does not allow for ad hoc access.When accessing remote OLE DB data sources, the login identity of trusted connections is not automatically delegated from the server on which the client is connected to the server that is being queried. Authentication delegation must be configured. For more information, see Configuring Linked Servers for Delegation.

Catalog and schema names are required if the OLE DB provider supports multiple catalogs and schemas in the specified data source. Values for catalog and schema can be omitted when the OLE DB provider does not support them. If the provider supports only schema names, a two-part name of the form schema.object must be specified. If the provider supports only catalog names, a three-part name of the form catalog.schema.object must be specified. Three-part names must be specified for pass-through queries that use the SQL Server Native Client OLE DB provider. For more information, see Transact-SQL Syntax Conventions (Transact-SQL).OPENROWSET does not accept variables for its arguments.OpenDataSource

SELECT * FROMOPENDATASOURCE( 'provider_name',   'provider_specific_datasource_specification').[catalog].[schema].object

As with basic four-part naming, OpenDataSource? executes the query on SQL Server. SQL Server decides what if any sub-queries to execute on the linked server, tends not to use appropriate syntax for these, and usually does not take advantage of linked server or provider features.

Additional doc excerpts

Provides ad hoc connection information as part of a four-part object name without using a linked server name.

...OPENDATASOURCE can be used to access remote data from OLE DB data sources only when the DisallowAdhocAccess? registry option is explicitly set to 0 for the specified provider, and the Ad Hoc Distributed Queries advanced configuration option is enabled. When these options are not set, the default behavior does not allow for ad hoc access.

The OPENDATASOURCE function can be used in the same Transact-SQL syntax locations as a linked-server name. Therefore, OPENDATASOURCE can be used as the first part of a four-part name that refers to a table or view name in a SELECT, INSERT, UPDATE, or DELETE statement, or to a remote stored procedure in an EXECUTE statement. When executing remote stored procedures, OPENDATASOURCE should refer to another instance of SQL Server. OPENDATASOURCE does not accept variables for its arguments.

Like the OPENROWSET function, OPENDATASOURCE should only reference OLE DB data sources that are accessed infrequently. Define a linked server for any data sources accessed more than several times. Neither OPENDATASOURCE nor OPENROWSET provide all the functionality of linked-server definitions, such as security management and the ability to query catalog information. All connection information, including passwords, must be provided every time that OPENDATASOURCE is called.

Virtuoso's Virtual Database Promise & Deliverables

The ability to link objects (tables, views, stored procedures) from any ODBC-accessible data source. This includes any JDBC-accessible data source, through the OpenLink ODBC Driver for JDBC Data Sources.

There are no limitations on the data types which can be queried or read, nor must the target DBMS have primary keys set on linked tables or views.

All linked objects may be used in single-site or distributed queries, and the user need not know anything about the actual data structure, including whether the objects being queried are remote or local to Virtuoso -- all objects are made to appear as part of a Virtuoso-local schema.

# PermaLink Comments [0]
01/31/2010 18:04 GMT
Compare & Contrast: Oracle Heterogeneous Services (HSODBC, DG4ODBC) vs Virtuoso's Virtual Database Layer [ Virtuso Data Space Bot ]

Oracle Gateway Promise

Ability to use distributed queries over a generic connectivity gateway (HSODBC, DG4ODBC) -- i.e., to issue SQL queries against any ODBC- or OLE-DB-accessible linked back end.

Reality

Promise fails to materialize for several reasons. Immediate limitations include:

All tables locked by a FOR UPDATE clause and all tables with LONG columns selected by the query must be located in the same external database.

Distributed queries cannot select user-defined types or object REF datatypes on remote tables.In addition to the above, which apply to database-specific heterogeneous environments, the database-agnostic generic connectivity components have the following limitations:

  • A table including a BLOB column must have a separate column that serves as a primary key.
  • BLOB and CLOB data cannot be read by passthrough queries.
  • Updates or deletes that include unsupported functions within a WHERE clause are not allowed.
  • Generic Connectivity does not support stored procedures.
  • Generic Connectivity agents cannot participate in distributed transactions; they support single-site transactions only.
  • Generic Connectivity does not support multithreaded agents.
  • Updating LONG columns with bind variables is not supported.
  • Generic Connectivity does not support rowids.

Compounding the issue, the HSODBC and DG4ODBC generic connectivity agents perform many of their functions by brute-force methods. Rather than interrogating the data access provider (whether ODBC or OLE DB) or DBMS to which they are connected, to learn their capabilities, many things are done by using the lowest possible function.

For instance, when a SELECT COUNT (*) FROM table@link is issued through Oracle SQL, the target DBMS doesn't simply performs a SELECT COUNT (*) FROM table -- rather, it performs a SELECT * FROM table which is used to inventory all columns in the table, and then performs and fully retrieves SELECT field FROM table into an internal temporary table, where it does the COUNT(*) itself, locally. Testing has confirmed this process to be the case despite Oracle documentation stating that target data sources must support COUNT (*) (among other functions).

Virtuoso's Virtual Database Comparison

The Virtuoso Universal Server will link/attach objects (tables, views, stored procedures) from any ODBC-accessible data source. This includes any JDBC-accessible data source, through the OpenLink ODBC Driver for JDBC Data Sources.

There are no limitations on the data types which can be queried or read, nor must the target DBMS have primary keys set on linked tables or views.

All linked objects may be used in single-site or distributed queries, and the user need not know anything about the actual data structure, including whether the objects being queried are remote or local to Virtuoso -- all objects are made to appear as part of a Virtuoso-local schema.

# PermaLink Comments [0]
01/31/2010 18:03 GMT
The Business Of Linked Data (BOLD) Discussion Space [ Kingsley Uyi Idehen ]

I've created a new discussion space that's squarely focused on the business development and marketing aspects of "HTTP based Linked Data" (Linked Data). As its name indicates, It's a BOLD attempt to fill a VoiD. :-)

Background

A few months ago, Aldo Bucchi posted a message to the LOD mailing list seeking a discussion space for more business and marketing oriented topic, in relation to Linked Data. At the time, my assumption was that the existing LOD mailing list served that purpose absolutely fine, but in due course I came to realize that Aldo's request had a much lager foundation than I initially suspected.

Historic Oversight

Linked Data, like its umbrella Semantic Web Project, has suffered from an inadvertent oversight on the parts of many of its enthusiasts (myself included): 100% of the discussion spaces are created by, geared towards, or dominated by researchers (from Academia primarily) and/or developers. Thus, at the very least, we've been operating in an echo chamber that only feed the existing void between the core community and those who are more interested in discussing business and marketing related topics.

The new discussion space seeks to cover the following:

  1. Brainstorming Value Proposition Articulation
  2. War Story Exchanges
  3. Case Studies and Use-cases
  4. Market Research & Positioning (for instance Linked Data is killer technology that redefines Data Integration, but none of the major research firms currently make that connection)
  5. .

How Do I Join The Conversation? Simply sign up on the Google hosted BOLD mailing list, introduce yourself (ideally), and then start conversing! :-)

# PermaLink Comments [0]
01/31/2010 17:48 GMT Modified: 01/31/2010 17:48 GMT
Getting The Linked Data Value Pyramid Layers Right (Update #2) [ Kingsley Uyi Idehen ]

One of the real problems that pervades all routes to Linked Data value prop. incomprehension stems from the layering of its value pyramid; especially when communicating with -initially detached- end-users.

Note to Web Programmers: Linked Data is about Data (Wine) and not about Code (Fish). Thus, it isn't a "programmer only zone", far from it. More than anything else, its inherently inclusive and spreads its participation net widely across: Data Architects, Data Integrators, Power Users, Knowledge Workers, Information Workers, Data Analysts, etc.. Basically, everyone that can "click on a link" is invited to this particular party; remember, it is about "Linked Data" not "Linked Code", after all. :-)

Problematic Value Pyramid Layering

Here is an example of a Linked Data value pyramid that I am stumbling across --with some frequency-- these days (note: 1 being the pyramid apex):

  1. SPARQL Queries
  2. RDF Data Stores
  3. RDF Data Sets
  4. HTTP scheme URIs

Basically, Linked Data deployment (assigning de-referencable HTTP URIs to DBMS records, their attributes, and attribute values [optionally] ) is occurring last. Even worse, this happens in the context of Linked Open Data oriented endeavors, resulting in nothing but confusion or inadvertent perpetuation of the overarching pragmatically challenged "Semantic Web" stereotype.

As you can imagine, hitting SPARQL as your introduction to Linked Data is akin to hitting SQL as your introduction to Relational Database Technology, neither is an elevator-style value prop. relay mechanism.

In the relational realm, killer demos always started with desktop productivity tools (spreadsheets, report-writers, SQL QBE tools etc.) accessing, relational data sources en route to unveiling the "Productivity" and "Agility" value prop. that such binding delivered i.e., the desktop application (clients) and the databases (servers) are distinct, but operating in a mutually beneficial manner to all, courtesy of a data access standards such as ODBC (Open Database Connectivity).

In the Linked Data realm, learning to embrace and extend best practices from the relational dbms realm remains a challenge, a lot of this has to do with hangovers from a misguided perception that RDF databases will somehow completely replace RDBMS engines, rather than compliment them. Thus, you have a counter productive variant of NIH (Not Invented Here) in play, taking us to the dreaded realm of: Break the Pot and You Own It (exemplified by the 11+ year Semantic Web Project comprehension and appreciation odyssey).

From my vantage point, here is how I believe the Linked Data value pyramid should be layered, especially when communicating the essential value prop.:

  1. HTTP URLs -- LINKs to documents (Reports) that users already appreciate, across the public Web and/or Intranets
  2. HTTP URIs -- typically not visually distinguishable from the URLs, so use the Data exposed by de-referencing a URL to show how each Data Item (Entity or Object) is uniquely identified by a Generic HTTP URI, and how clicking on the said URIs leads to more structured metadata bearing documents available in a variety of data representation formats, thereby enabling flexible data presentation (e.g., smarter HTML pages)
  3. SPARQL -- when a user appreciates the data representation and presentation dexterity of a Generic HTTP URI, they will be more inclined to drill down an additional layer to unravel how HTTP URIs mechanically deliver such flexibility
  4. RDF Data Stores -- at this stage the user is now interested data sources behind the Generic HTTP URIs, courtesy of natural desire to tweak the data presented in the report; thus, you now have an engaged user ready to absorb the "How Generic HTTP URIs Pull This Off" message
  5. RDF Data Sets -- while attempting to make or tweak HTTP URIs, users become curious about the actual data loaded into the RDF Data Store, which is where data sets used to create powerful Lookup Data Spaces (e.g., DBpedia) come into play such as those from the LOD constellation as exemplified by DBpedia (extractions from Wikipedia).

Related

# PermaLink Comments [0]
01/31/2010 17:46 GMT Modified: 01/31/2010 17:47 GMT
What is the DBpedia Project? (Updated) [ Kingsley Uyi Idehen ]

The recent Wikipedia imbroglio centered around DBpedia is the fundamental driver for this particular blog post. At time of writing this blog post, the DBpedia project definition in Wikipedia remains unsatisfactory due to the following shortcomings:

  1. inaccurate and incomplete definition of the Project's What, Why, Who, Where, When, and How
  2. inaccurate reflection of project essence, by skewing focus towards data extraction and data set dump production, which is at best a quarter of the project.

Here are some insights on DBpedia, from the perspective of someone intimately involved with the other three-quarters of the project.

What is DBpedia?

A live Web accessible RDF model database (Quad Store) derived from Wikipedia content snapshots, taken periodically. The RDF database underlies a Linked Data Space comprised of: HTML (and most recently HTML+RDFa) based data browser pages and a SPARQL endpoint.

Note: DBpedia 3.4 now exists in snapshot (warehouse) and Live Editions (currently being hot-staged). This post is about the snapshot (warehouse) edition, I'll drop a different post about the DBpedia Live Edition where a new Delta-Engine covers both extraction and database record replacement, in realtime.

When was it Created?

As an idea under the moniker "DBpedia" it was conceptualized in late 2006 by researchers at University of Leipzig (lead by Soren Auer) and Freie University, Berlin (lead by Chris Bizer). The first public instance of DBpedia (as described above) was released in February 2007. The official DBpedia coming out party occurred at WWW2007, Banff, during the inaugural Linked Data gathering, where it showcased the virtues and immense potential of TimBL's Linked Data meme.

Who's Behind It?

OpenLink Software (developers of OpenLink Virtuoso and providers of Web Hosting infrastructure), University of Leipzig, and Freie Univerity, Berlin. In addition, there is a burgeoning community of collaborators and contributors responsible DBpedia based applications, cross-linked data sets, ontologies (OpenCyc, SUMO, UMBEL, and YAGO) and other utilities. Finally, DBpedia wouldn't be possible without the global content contribution and curation efforts of Wikipedians, a point typically overlooked (albeit inadvertently).

How is it Constructed?

The steps are as follows:

  1. RDF data set dump preparation via Wikipedia content extraction and transformation to RDF model data, using the N3 data representation format - Java and PHP extraction code produced and maintained by the teams at Leipzig and Berlin
  2. Deployment of Linked Data that enables Data browsing and exploration using any HTTP aware user agent (e.g. basic Web Browsers) - handled by OpenLink Virtuoso (handled by Berlin via the Pubby Linked Data Server during the early months of the DBpedia project)
  3. SPARQL compliant Quad Store, enabling direct access to database records via SPARQL (Query language, REST or SOAP Web Service, plus a variety of query results serialization formats) - OpenLink Virtuoso since first public release of DBpedia

In a nutshell, there are four distinct and vital components to DBpedia. Thus, DBpedia doesn't exist if all the project offered was a collection of RDF data dumps. Likewise, it doesn't exist if you have a SPARQL compliant Quad Store without loaded data sets, and of course it doesn't exist if you have a fully loaded SPARQL compliant Quad Store is up to the cocktail of challenges presented by live Web accessibility.

Why is it Important?

It remains a live exemplar for any individual or organization seeking to publishing or exploit HTTP based Linked Data on the World Wide Web. Its existence continues to stimulate growth in both density and quality of the burgeoning Web of Linked Data.

How Do I Use it?

In the most basic sense, simply browse the HTML pages en route to discovery erstwhile relationships that exist across named entities and subject matter concepts / headings. Beyond that, simply look at DBpedia as a master lookup table in a Web hosted distributed database setup; enabling you to mesh your local domain specific details with DBpedia records via structured relations (triples or 3-tuples records) comprised of HTTP URIs from both realms e.g., owl:sameAs relations.

What Can I Use it For?

Expanding on the Master-Details point above, you can use its rich URI corpus to alleviate tedium associated with activities such as:

  1. List maintenance - e.g., Countries, States, Companies, Units of Measurement, Subject Headings etc.
  2. Tagging - as a compliment to existing practices
  3. Analytical Research - you're only a LINK (URI) away from erstwhile difficult to attain research data spread across a broad range of topics
  4. Closed Vocabulary Construction - rather than commence the futile quest of building your own closed vocabulary, simply leverage Wikipedia's human curated vocabulary as our common base.

Related

# PermaLink Comments [0]
01/31/2010 17:45 GMT Modified: 01/31/2010 17:46 GMT
Getting The Linked Data Value Pyramid Layers Right (Update #2) [ Kingsley Uyi Idehen ]

One of the real problems that pervades all routes to Linked Data value prop. incomprehension stems from the layering of its value pyramid; especially when communicating with -initially detached- end-users.

Note to Web Programmers: Linked Data is about Data (Wine) and not about Code (Fish). Thus, it isn't a "programmer only zone", far from it. More than anything else, its inherently inclusive and spreads its participation net widely across: Data Architects, Data Integrators, Power Users, Knowledge Workers, Information Workers, Data Analysts, etc.. Basically, everyone that can "click on a link" is invited to this particular party; remember, it is about "Linked Data" not "Linked Code", after all. :-)

Problematic Value Pyramid Layering

Here is an example of a Linked Data value pyramid that I am stumbling across --with some frequency-- these days (note: 1 being the pyramid apex):

  1. SPARQL Queries
  2. RDF Data Stores
  3. RDF Data Sets
  4. HTTP scheme URIs

Basically, Linked Data deployment (assigning de-referencable HTTP URIs to DBMS records, their attributes, and attribute values [optionally] ) is occurring last. Even worse, this happens in the context of Linked Open Data oriented endeavors, resulting in nothing but confusion or inadvertent perpetuation of the overarching pragmatically challenged "Semantic Web" stereotype.

As you can imagine, hitting SPARQL as your introduction to Linked Data is akin to hitting SQL as your introduction to Relational Database Technology, neither is an elevator-style value prop. relay mechanism.

In the relational realm, killer demos always started with desktop productivity tools (spreadsheets, report-writers, SQL QBE tools etc.) accessing, relational data sources en route to unveiling the "Productivity" and "Agility" value prop. that such binding delivered i.e., the desktop application (clients) and the databases (servers) are distinct, but operating in a mutually beneficial manner to all, courtesy of a data access standards such as ODBC (Open Database Connectivity).

In the Linked Data realm, learning to embrace and extend best practices from the relational dbms realm remains a challenge, a lot of this has to do with hangovers from a misguided perception that RDF databases will somehow completely replace RDBMS engines, rather than compliment them. Thus, you have a counter productive variant of NIH (Not Invented Here) in play, taking us to the dreaded realm of: Break the Pot and You Own It (exemplified by the 11+ year Semantic Web Project comprehension and appreciation odyssey).

From my vantage point, here is how I believe the Linked Data value pyramid should be layered, especially when communicating the essential value prop.:

  1. HTTP URLs -- LINKs to documents (Reports) that users already appreciate, across the public Web and/or Intranets
  2. HTTP URIs -- typically not visually distinguishable from the URLs, so use the Data exposed by de-referencing a URL to show how each Data Item (Entity or Object) is uniquely identified by a Generic HTTP URI, and how clicking on the said URIs leads to more structured metadata bearing documents available in a variety of data representation formats, thereby enabling flexible data presentation (e.g., smarter HTML pages)
  3. SPARQL -- when a user appreciates the data representation and presentation dexterity of a Generic HTTP URI, they will be more inclined to drill down an additional layer to unravel how HTTP URIs mechanically deliver such flexibility
  4. RDF Data Stores -- at this stage the user is now interested data sources behind the Generic HTTP URIs, courtesy of natural desire to tweak the data presented in the report; thus, you now have an engaged user ready to absorb the "How Generic HTTP URIs Pull This Off" message
  5. RDF Data Sets -- while attempting to make or tweak HTTP URIs, users become curious about the actual data loaded into the RDF Data Store, which is where data sets used to create powerful Lookup Data Spaces (e.g., DBpedia) come into play such as those from the LOD constellation as exemplified by DBpedia (extractions from Wikipedia).

Related

# PermaLink Comments [0]
01/31/2010 17:44 GMT Modified: 02/01/2010 09:02 GMT
What is the DBpedia Project? (Updated) [ Kingsley Uyi Idehen ]

The recent Wikipedia imbroglio centered around DBpedia is the fundamental driver for this particular blog post. At time of writing this blog post, the DBpedia project definition in Wikipedia remains unsatisfactory due to the following shortcomings:

  1. inaccurate and incomplete definition of the Project's What, Why, Who, Where, When, and How
  2. inaccurate reflection of project essence, by skewing focus towards data extraction and data set dump production, which is at best a quarter of the project.

Here are some insights on DBpedia, from the perspective of someone intimately involved with the other three-quarters of the project.

What is DBpedia?

A live Web accessible RDF model database (Quad Store) derived from Wikipedia content snapshots, taken periodically. The RDF database underlies a Linked Data Space comprised of: HTML (and most recently HTML+RDFa) based data browser pages and a SPARQL endpoint.

Note: DBpedia 3.4 now exists in snapshot (warehouse) and Live Editions (currently being hot-staged). This post is about the snapshot (warehouse) edition, I'll drop a different post about the DBpedia Live Edition where a new Delta-Engine covers both extraction and database record replacement, in realtime.

When was it Created?

As an idea under the moniker "DBpedia" it was conceptualized in late 2006 by researchers at University of Leipzig (lead by Soren Auer) and Freie University, Berlin (lead by Chris Bizer). The first public instance of DBpedia (as described above) was released in February 2007. The official DBpedia coming out party occurred at WWW2007, Banff, during the inaugural Linked Data gathering, where it showcased the virtues and immense potential of TimBL's Linked Data meme.

Who's Behind It?

OpenLink Software (developers of OpenLink Virtuoso and providers of Web Hosting infrastructure), University of Leipzig, and Freie Univerity, Berlin. In addition, there is a burgeoning community of collaborators and contributors responsible DBpedia based applications, cross-linked data sets, ontologies (OpenCyc, SUMO, UMBEL, and YAGO) and other utilities. Finally, DBpedia wouldn't be possible without the global content contribution and curation efforts of Wikipedians, a point typically overlooked (albeit inadvertently).

How is it Constructed?

The steps are as follows:

  1. RDF data set dump preparation via Wikipedia content extraction and transformation to RDF model data, using the N3 data representation format - Java and PHP extraction code produced and maintained by the teams at Leipzig and Berlin
  2. Deployment of Linked Data that enables Data browsing and exploration using any HTTP aware user agent (e.g. basic Web Browsers) - handled by OpenLink Virtuoso (handled by Berlin via the Pubby Linked Data Server during the early months of the DBpedia project)
  3. SPARQL compliant Quad Store, enabling direct access to database records via SPARQL (Query language, REST or SOAP Web Service, plus a variety of query results serialization formats) - OpenLink Virtuoso since first public release of DBpedia

In a nutshell, there are four distinct and vital components to DBpedia. Thus, DBpedia doesn't exist if all the project offered was a collection of RDF data dumps. Likewise, it doesn't exist if you have a SPARQL compliant Quad Store without loaded data sets, and of course it doesn't exist if you have a fully loaded SPARQL compliant Quad Store is up to the cocktail of challenges presented by live Web accessibility.

Why is it Important?

It remains a live exemplar for any individual or organization seeking to publishing or exploit HTTP based Linked Data on the World Wide Web. Its existence continues to stimulate growth in both density and quality of the burgeoning Web of Linked Data.

How Do I Use it?

In the most basic sense, simply browse the HTML pages en route to discovery erstwhile relationships that exist across named entities and subject matter concepts / headings. Beyond that, simply look at DBpedia as a master lookup table in a Web hosted distributed database setup; enabling you to mesh your local domain specific details with DBpedia records via structured relations (triples or 3-tuples records) comprised of HTTP URIs from both realms e.g., owl:sameAs relations.

What Can I Use it For?

Expanding on the Master-Details point above, you can use its rich URI corpus to alleviate tedium associated with activities such as:

  1. List maintenance - e.g., Countries, States, Companies, Units of Measurement, Subject Headings etc.
  2. Tagging - as a compliment to existing practices
  3. Analytical Research - you're only a LINK (URI) away from erstwhile difficult to attain research data spread across a broad range of topics
  4. Closed Vocabulary Construction - rather than commence the futile quest of building your own closed vocabulary, simply leverage Wikipedia's human curated vocabulary as our common base.

Related

# PermaLink Comments [0]
01/31/2010 17:43 GMT Modified: 02/01/2010 09:01 GMT
5 Very Important Things to Note about HTTP based Linked Data [ Kingsley Uyi Idehen ]
  1. It isn't World Wide Web Specific (HTTP != World Wide Web)
  2. It isn't Open Data Specific
  3. It isn't about "Free" (Beer or Speech)
  4. It isn't about Markup (so don't expect to grok it via "markup first" approach)
  5. It's about Hyperdata - the use of HTTP and REST to deliver a powerful platform agnostic mechanism for Data Reference, Access, and Integration.

When trying to understand HTTP based Linked Data, especially if you're well versed in DBMS technology use (User, Power User, Architect, Analyst, DBA, or Programmer) think:

  • Open Database Connectivity (ODBC) without operating system, data model, or wire-protocol specificity or lock-in potential
  • Java Database Connectivity (JDBC) without programming language specificity
  • ADO.NET without .NET runtime specificity and .NET bound language specificity
  • OLE-DB without Windows operating system & programming language specificity
  • XMLA without XML format specificity - with Tabular and Multidimensional results formats expressible in a variety of data representation formats.
  • All of the above scoped to the Record rather than Container level, with Generic HTTP scheme URIs associated with each Record, Field, and Field value (optionally)

Remember the need for Data Access & Integration technology is the by product of the following realities:

  1. Human curated data is ultimately dirty, because:
    • our thick thumbs, inattention, distractions, and general discomfort with typing, make typos prevalent
    • database engines exist for a variety of data models - Graph, Relational, Hierarchical;
    • within databases you have different record container/partition names e.g. Table Names;
    • within a database record container you have records that are really aspects of the same thing (different keys exist in a plethora of operational / line of business systems that expose aspects of the same entity e.g., customer data that spans Accounts, CRM, ERP application databases);
    • different field names (one database has "EMP" while another has "Employee") for the same record
    • .
  2. Units of measurement is driven by locale, the UK office wants to see sales in Pounds Sterling while the French office prefers Euros etc.
  3. All of the above is subject to context halos which can be quite granular re. sensitivity e.g. staff travel between locations that alter locales and their roles; basically, profiles matters a lot.

Related

# PermaLink Comments [0]
01/31/2010 17:31 GMT Modified: 02/01/2010 09:00 GMT
5 Game Changing Things about the OpenLink Virtuoso + AWS Cloud Combo [ Kingsley Uyi Idehen ]

Here are 5 powerful benefits you can immediately derive from the combination of Virtuoso and Amazon's AWS services (specifically the EC2 and EBS components):

  1. Acquire your own personal or service specific data space in the Cloud. Think DBase, Paradox, FoxPRO, Access of yore, but with the power of Oracle, Informix, Microsoft SQL Server etc.. using a Conceptual, as opposed to solely Logical, model based DBMS (i.e., a Hybrid DBMS Engine for: SQL, RDF, XML, and Full Text)
  2. Ability to share and control access to your resources using innovations like FOAF+SSL, OpenID, and OAuth, all from one place
  3. Construction of personal or organization based FOAF profiles in a matter of minutes; by simply creating a basic DBMS (or ODS application layer) account; and then using this profile to create strong links (references) to all your Data silos (esp. those from the Web 2.0 realm)
  4. Load data sets from the LOD cloud or Sponge existing Web resources (i.e., on the fly data transformation to RDF model based Linked Data) and then use the combination to build powerful lookup services that enrich the value of URLs (think: Web addressable reports holding query results) that you publish
  5. Bind all of the above to a domain that you own (e.g. a .Name domain) so that you have an attribution-friendly "authority" component for resource URLs and Entity URIs published from your Personal Linked Data Space on the Web (or private HTTP network).

In a nutshell, the AWS Cloud infrastructure simplifies the process of generating Federated presence on the Internet and/or World Wide Web. Remember, centralized networking models always end up creating data silos, in some context, ultimately! :-)

# PermaLink Comments [0]
01/31/2010 17:29 GMT Modified: 02/01/2010 08:59 GMT
 <<     | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |     >>
Powered by OpenLink Virtuoso Universal Server
Running on Linux platform