Details

OpenLink Software
Burlington, United States

Subscribe

Post Categories

Recent Articles

Community Member Blogs

Display Settings

articles per page.
order.

Translate

Virtuoso Linked Data Deployment In 3 Simple Steps [ Kingsley Uyi Idehen ]

Injecting Linked Data into the Web has been a major pain point for those who seek personal, service, or organization-specific variants of DBpedia. Basically, the sequence goes something like this:

  1. You encounter DBpedia or the LOD Cloud Pictorial.
  2. You look around (typically following your nose from link to link).
  3. You attempt to publish your own stuff.
  4. You get stuck.

The problems typically take the following form:

  1. Functionality confusion about the complementary Name and Address functionality of a single URI abstraction
  2. Terminology confusion due to conflation and over-loading of terms such as Resource, URL, Representation, Document, etc.
  3. Inability to find robust tools with which to generate Linked Data from existing data sources such as relational databases, CSV files, XML, Web Services, etc.

To start addressing these problems, here is a simple guide for generating and publishing Linked Data using Virtuoso.

Step 1 - RDF Data Generation

Existing RDF data can be added to the Virtuoso RDF Quad Store via a variety of built-in data loader utilities.

Many options allow you to easily and quickly generate RDF data from other data sources:

  • Install the Sponger Bookmarklet for the URIBurner service. Bind this to your own SPARQL-compliant backend RDF database (in this scenario, your local Virtuoso instance), and then Sponge some HTTP-accessible resources.
  • Convert relational DBMS data to RDF using the Virtuoso RDF Views Wizard.
  • Starting with CSV files, you can
    • Place them at an HTTP-accessible location, and use the Virtuoso Sponger to convert them to RDF or;
    • Use the CVS import feature to import their content into Virtuoso's relational data engine; then use the built-in RDF Views Wizard as with other RDBMS data.
  • Starting from XML files, you can
    • Use Virtuoso's inbuilt XSLT-Processor for manual XML to RDF/XML transformation or;
    • Leverage the Sponger Cartridge for GRDDL, if there is a transformation service associated with your XML data source, or;
    • Let the Sponger analyze the XML data source and make a best-effort transformation to RDF.

Step 2 - Linked Data Deployment

Install the Faceted Browser VAD package (fct_dav.vad) which delivers the following:

  1. Faceted Browser Engine UI
  2. Dynamic Hypermedia Resource Generator
    • delivers descriptor resources for every entity (data object) in the Native or Virtual Quad Stores
    • supports a broad array of output formats, including HTML+RDFa, RDF/XML, N3/Turtle, NTriples, RDF-JSON, OData+Atom, and OData+JSON.

Step 3 - Linked Data Consumption & Exploitation

Three simple steps allow you, your enterprise, and your customers to consume and exploit your newly deployed Linked Data --

  1. Load a page like this in your browser: http://<cname>[:<port>]/describe/?uri=<entity-uri>
    • <cname>[:<port>] gets replaced by the host and port of your Virtuoso instance
    • <entity-uri> gets replaced by the URI you want to see described -- for instance, the URI of one of the resources you let the Sponger handle.
  2. Follow the links presented in the descriptor page.
  3. If you ever see a blank page with a hyperlink subject name in the About: section at the top of the page, simply add the parameter "&sp=1" to the URL in the browser's Address box, and hit [ENTER]. This will result in an "on the fly" resource retrieval, transformation, and descriptor page generation.
  4. Use the navigator controls to page up and down the data associated with the "in scope" resource descriptor.

Related

# PermaLink Comments [0]
10/29/2010 18:54 GMT Modified: 11/02/2010 11:55 GMT
VLDB Semdata Workshop [ Virtuoso Data Space Bot ]

I will begin by extending my thanks to the organizers, in specific Reto Krummenacher of STI and Atanas Kiryakov of Ontotext for inviting me to give a position paper at the workshop. Indeed, it is the builders of bridges, the pontifs (pontifex) amongst us who shall be remembered by history. The idea of organizing a semantic data management workshop at VLDB is a laudable attempt at rapprochement between two communities to the advantage of all concerned.

Franz, Ontotext, and OpenLink were the vendors present at the workshop. To summarize very briefly, Jans Aasman of Franz talked about the telco call center automation solution by Amdocs, where the AllegroGraph RDF store is integrated. On the technical side, AllegroGraph has Javascript as a stored procedure language, which is certainly a good idea. Naso of Ontotext talked about the BBC FIFA World Cup site. The technical proposition was that materialization is good and data partitioning is not needed; a set of replicated read-only copies is good enough.

I talked about making RDF cost competitive with relational for data integration and BI. The crux is space efficiency and column store techniques.

One question that came up was that maybe RDF could approach relational in some things, but what about string literals being stored in a separate table? Or URI strings being stored in a separate table?

The answer is that if one accesses a lot of these literals the access will be local and fairly efficient. If one accesses just a few, it does not matter. For user-facing reports, there is no point in returning a million strings that the user will not read anyhow. But then it turned out that there in fact exist reports in bioinformatics where there are 100,000 strings. Now taking the worst abuse of SPARQL, a regexp over all literals in a property of a given class. With a column store this is a scan of the column; with RDF, a three table join. The join is about 10x slower than the column scan. Quite OK, considering that a full text index is the likely solution for such workloads anyway. Besides, a sensible relational schema will also not use strings for foreign keys, and will therefore incur a similar burden from fetching the strings before returning the result.

Another question was about whether the attitude was one of confrontation between RDF and relational and whether it would not be better to join forces. Well, as said in my talk, sauce for the goose is sauce for the gander and generally speaking relational techniques apply equally to RDF. There are a few RDB tricks that have no RDF equivalent, like clustering a fact table on dimension values, e.g., sales ordered by country, manufacturer, month. But by and large, column-store techniques apply. The execution engine can be essentially identical, just needing a couple of extra data types and some run-time typing and in some cases producing nulls instead of errors. Query optimization is much the same, except that RDB stats are not applicable as such; one needs to sample the data in the cost model. All in all, these adaptations to a RDB are not so large, even though they do require changes to source code.

Another question was about combining data models, e.g., relational (rows and columns), RDF (graph), XML (tree), and full text. Here I would say that it is a fault of our messaging that we do not constantly repeat the necessity of this combining, as we take it for granted. Most RDF stores have a full text index on literal values. OWLIM and a CWI prototype even have it for URIs. XML is a valid data type for an RDF literal, even though this does not get used very much. So doing SPARQL to select the values, and then doing XPath and XSLT on the values, is entirely possible, at least in Virtuoso which has an XPath/XSLT engine built in. Same for invoking SPARQL from an XSLT sheet. Colocating a native RDBMS with local and federated SQL is what Virtuoso has always done. One can, for example, map tables in heterogenous remote RDBs into tables in Virtuoso, then map these into RDF, and run SPARQL queries that get translated into SQL against the original tables, thereby getting SPARQL access without any materialization. Alongside this, one can ETL relational data into RDF via the same declarative mapping.

Further, there are RDF extensions for geospatial queries in Virtuoso and AllegroGraph, and soon also in others.

With all this cross-model operation, RDF is definitely not a closed island. We'll have to repeat this more.

Of the academic papers, the SpiderStore (paper is not yet available at time of writing, but should be soon) and Webpie that should be specially noted.

Let us talk about SpiderStore first.

SpiderStore

The SpiderStore from the University of Innsbruck is a main-memory-only system that has a record for each distinct IRI. The IRI record has one array of pointers to all IRI records that are objects where the referencing record is the subject, and a similar array of pointers to all records where the referencing record is the object. Both sets of pointers are clustered based on the predicate labeling the edge.

According to the authors (Robert Binna, Wolfgang Gassler, Eva Zangerle, Dominic Pacher, and Günther Specht), a distinct IRI is 5 pointers and each triple is 3 pointers. This would make about 4 pointers per triple, i.e., 32 bytes with 64-bit pointers.

This is not particularly memory efficient, since one must count unused space after growing the lists, fragmentation, etc., which will make the space consumption closer to 40 bytes per triple, plus should one add a graph to the mix one would need another pointer per distinct predicate, adding another 1-4 bytes per triple. Supporting non-IRI types in the object position is not a problem, as long as all distinct values have a chunk of memory to them with a type tag.

We get a few times better memory efficiency with column compressed quads, plus we are not limited to main memory.

But SpiderStore has a point. Making the traversal of an edge in the graph into a pointer dereference is not such a bad deal, especially if the data set is not that big. Furthermore, compiling the queries into C procedures playing with the pointers alone would give performance to match or exceed any hard coded graph traversal library and would not be very difficult. Supporting multithreaded updates would spoil much of the gain but allowing single threaded updates and forking read-only copies for reading would be fine.

SpiderStore as such is not attractive for what we intend to do, this being aggregating RDF quads in volumes far exceeding main memory and scaling to clusters. We note that SpiderStore hits problems with distributed memory, since SpiderStore executes depth first, which is manifestly impossible if significant latencies are involved. In other words, if there can be latency, one must amortize by having a lot of other possible work available. Running with long vectors of values is one way, as in MonetDB or Virtuoso Cluster. The other way is to have a massively multithreaded platform which favors code with few instructions but little memory locality. SpiderStore could be a good fit for massive multithreading, specially if queries were compiled to C, dramatically cutting down on the count of instructions to execute.

We too could adopt some ideas from SpiderStore. Namely, if running vectored, one just in passing, without extra overhead, generates an array of links to the next IRI, a bit like the array that SpiderStore has for each predicate for the incoming and outgoing edges of a given IRI. Of course, here these would be persistent IDs and not pointers, but a hash from one to the other takes almost no time. So, while SpiderStore alone may not be what we are after for data warehousing, Spiderizing parts of the working set would not be so bad. This is especially so since the Spiderizable data structure almost gets made as a by-product of query evaluation.

If an algorithm made several passes over a relatively small subgraph of the whole database, Spiderizing it would accelerate things. The memory overhead could have a fixed cap so as not to ruin the working set if locality happened not to hold.

Running a SpiderStore-like execution model on vectors instead of single values would likely do no harm and might even result in better cache behavior. The exception is in the event of completely unpredictable patterns of connections which may only be amortized by massive multithreading.

Webpie

Webpie from VU Amsterdam and the LarKC EU FP 7 project is, as it were, the opposite of SpiderStore. This is a map-reduce-based RDFS and OWL Horst inference engine which is all about breadth-first passes over the data in a map-reduce framework with intermediate disk-based storage.

Webpie is not however a database. After the inference result has been materialized, it must be loaded into a SPARQL engine in order to evaluate a query against the result.

The execution plan of Webpie is made from the ontology whose consequences must be materialized. The steps are sorted and run until a fixed point is reached for each. This is similar to running SPARQL INSERT … SELECT statements until no new inserts are produced. The only requirement is that the INSERT statement should report whether new inserts were actually made. This is easy to do. In this way, a comparison between map-reduce plus memory-based joining and a parallel RDF database could be made.

We have suggested such an experiment to the LarKC people. We will see.

# PermaLink Comments [0]
09/21/2010 17:14 GMT Modified: 09/21/2010 16:22 GMT
VLDB Semdata Workshop [ Orri Erling ]

I will begin by extending my thanks to the organizers, in specific Reto Krummenacher of STI and Atanas Kiryakov of Ontotext for inviting me to give a position paper at the workshop. Indeed, it is the builders of bridges, the pontifs (pontifex) amongst us who shall be remembered by history. The idea of organizing a semantic data management workshop at VLDB is a laudable attempt at rapprochement between two communities to the advantage of all concerned.

Franz, Ontotext, and OpenLink were the vendors present at the workshop. To summarize very briefly, Jans Aasman of Franz talked about the telco call center automation solution by Amdocs, where the AllegroGraph RDF store is integrated. On the technical side, AllegroGraph has Javascript as a stored procedure language, which is certainly a good idea. Naso of Ontotext talked about the BBC FIFA World Cup site. The technical proposition was that materialization is good and data partitioning is not needed; a set of replicated read-only copies is good enough.

I talked about making RDF cost competitive with relational for data integration and BI. The crux is space efficiency and column store techniques.

One question that came up was that maybe RDF could approach relational in some things, but what about string literals being stored in a separate table? Or URI strings being stored in a separate table?

The answer is that if one accesses a lot of these literals the access will be local and fairly efficient. If one accesses just a few, it does not matter. For user-facing reports, there is no point in returning a million strings that the user will not read anyhow. But then it turned out that there in fact exist reports in bioinformatics where there are 100,000 strings. Now taking the worst abuse of SPARQL, a regexp over all literals in a property of a given class. With a column store this is a scan of the column; with RDF, a three table join. The join is about 10x slower than the column scan. Quite OK, considering that a full text index is the likely solution for such workloads anyway. Besides, a sensible relational schema will also not use strings for foreign keys, and will therefore incur a similar burden from fetching the strings before returning the result.

Another question was about whether the attitude was one of confrontation between RDF and relational and whether it would not be better to join forces. Well, as said in my talk, sauce for the goose is sauce for the gander and generally speaking relational techniques apply equally to RDF. There are a few RDB tricks that have no RDF equivalent, like clustering a fact table on dimension values, e.g., sales ordered by country, manufacturer, month. But by and large, column-store techniques apply. The execution engine can be essentially identical, just needing a couple of extra data types and some run-time typing and in some cases producing nulls instead of errors. Query optimization is much the same, except that RDB stats are not applicable as such; one needs to sample the data in the cost model. All in all, these adaptations to a RDB are not so large, even though they do require changes to source code.

Another question was about combining data models, e.g., relational (rows and columns), RDF (graph), XML (tree), and full text. Here I would say that it is a fault of our messaging that we do not constantly repeat the necessity of this combining, as we take it for granted. Most RDF stores have a full text index on literal values. OWLIM and a CWI prototype even have it for URIs. XML is a valid data type for an RDF literal, even though this does not get used very much. So doing SPARQL to select the values, and then doing XPath and XSLT on the values, is entirely possible, at least in Virtuoso which has an XPath/XSLT engine built in. Same for invoking SPARQL from an XSLT sheet. Colocating a native RDBMS with local and federated SQL is what Virtuoso has always done. One can, for example, map tables in heterogenous remote RDBs into tables in Virtuoso, then map these into RDF, and run SPARQL queries that get translated into SQL against the original tables, thereby getting SPARQL access without any materialization. Alongside this, one can ETL relational data into RDF via the same declarative mapping.

Further, there are RDF extensions for geospatial queries in Virtuoso and AllegroGraph, and soon also in others.

With all this cross-model operation, RDF is definitely not a closed island. We'll have to repeat this more.

Of the academic papers, the SpiderStore (paper is not yet available at time of writing, but should be soon) and Webpie that should be specially noted.

Let us talk about SpiderStore first.

SpiderStore

The SpiderStore from the University of Innsbruck is a main-memory-only system that has a record for each distinct IRI. The IRI record has one array of pointers to all IRI records that are objects where the referencing record is the subject, and a similar array of pointers to all records where the referencing record is the object. Both sets of pointers are clustered based on the predicate labeling the edge.

According to the authors (Robert Binna, Wolfgang Gassler, Eva Zangerle, Dominic Pacher, and Günther Specht), a distinct IRI is 5 pointers and each triple is 3 pointers. This would make about 4 pointers per triple, i.e., 32 bytes with 64-bit pointers.

This is not particularly memory efficient, since one must count unused space after growing the lists, fragmentation, etc., which will make the space consumption closer to 40 bytes per triple, plus should one add a graph to the mix one would need another pointer per distinct predicate, adding another 1-4 bytes per triple. Supporting non-IRI types in the object position is not a problem, as long as all distinct values have a chunk of memory to them with a type tag.

We get a few times better memory efficiency with column compressed quads, plus we are not limited to main memory.

But SpiderStore has a point. Making the traversal of an edge in the graph into a pointer dereference is not such a bad deal, especially if the data set is not that big. Furthermore, compiling the queries into C procedures playing with the pointers alone would give performance to match or exceed any hard coded graph traversal library and would not be very difficult. Supporting multithreaded updates would spoil much of the gain but allowing single threaded updates and forking read-only copies for reading would be fine.

SpiderStore as such is not attractive for what we intend to do, this being aggregating RDF quads in volumes far exceeding main memory and scaling to clusters. We note that SpiderStore hits problems with distributed memory, since SpiderStore executes depth first, which is manifestly impossible if significant latencies are involved. In other words, if there can be latency, one must amortize by having a lot of other possible work available. Running with long vectors of values is one way, as in MonetDB or Virtuoso Cluster. The other way is to have a massively multithreaded platform which favors code with few instructions but little memory locality. SpiderStore could be a good fit for massive multithreading, specially if queries were compiled to C, dramatically cutting down on the count of instructions to execute.

We too could adopt some ideas from SpiderStore. Namely, if running vectored, one just in passing, without extra overhead, generates an array of links to the next IRI, a bit like the array that SpiderStore has for each predicate for the incoming and outgoing edges of a given IRI. Of course, here these would be persistent IDs and not pointers, but a hash from one to the other takes almost no time. So, while SpiderStore alone may not be what we are after for data warehousing, Spiderizing parts of the working set would not be so bad. This is especially so since the Spiderizable data structure almost gets made as a by-product of query evaluation.

If an algorithm made several passes over a relatively small subgraph of the whole database, Spiderizing it would accelerate things. The memory overhead could have a fixed cap so as not to ruin the working set if locality happened not to hold.

Running a SpiderStore-like execution model on vectors instead of single values would likely do no harm and might even result in better cache behavior. The exception is in the event of completely unpredictable patterns of connections which may only be amortized by massive multithreading.

Webpie

Webpie from VU Amsterdam and the LarKC EU FP 7 project is, as it were, the opposite of SpiderStore. This is a map-reduce-based RDFS and OWL Horst inference engine which is all about breadth-first passes over the data in a map-reduce framework with intermediate disk-based storage.

Webpie is not however a database. After the inference result has been materialized, it must be loaded into a SPARQL engine in order to evaluate a query against the result.

The execution plan of Webpie is made from the ontology whose consequences must be materialized. The steps are sorted and run until a fixed point is reached for each. This is similar to running SPARQL INSERT … SELECT statements until no new inserts are produced. The only requirement is that the INSERT statement should report whether new inserts were actually made. This is easy to do. In this way, a comparison between map-reduce plus memory-based joining and a parallel RDF database could be made.

We have suggested such an experiment to the LarKC people. We will see.

# PermaLink Comments [0]
09/21/2010 17:14 GMT Modified: 09/21/2010 16:22 GMT
Simple Compare & Contrast of Web 1.0, 2.0, and 3.0 (Update 1) [ Kingsley Uyi Idehen ]

Here is a tabulated "compare and contrast" of Web usage patterns 1.0, 2.0, and 3.0.

  Web 1.0 Web 2.0 Web 3.0
Simple Definition Interactive / Visual Web Programmable Web Linked Data Web
Unit of Presence Web Page Web Service Endpoint Data Space (named structured data enclave)
Unit of Value Exchange Page URL Endpoint URL for API Resource / Entity / Object URI
Data Granularity Low (HTML) Medium (XML) High (RDF)
Defining Services Search Community (Blogs to Social Networks) Find
Participation Quotient Low Medium High
Serendipitous Discovery Quotient Low Medium High
Data Referencability Quotient Low (Documents) Medium (Documents) High (Documents and their constituent Data)
Subjectivity Quotient High Medium (from A-list bloggers to select source and partner lists) Low (everything is discovered via URIs)
Transclusence Low Medium (Code driven Mashups) HIgh (Data driven Meshups)
What You See Is What You Prefer (WYSIWYP) Low Medium High (negotiated representation of resource descriptions)
Open Data Access (Data Accessibility) Low Medium (Silos) High (no Silos)
Identity Issues Handling Low Medium (OpenID)

High (FOAF+SSL)

Solution Deployment Model Centralized Centralized with sprinklings of Federation Federated with function specific Centralization (e.g. Lookup hubs like LOD Cloud or DBpedia)
Data Model Orientation Logical (Tree based DOM) Logical (Tree based XML) Conceptual (Graph based RDF)
User Interface Issues Dynamically generated static interfaces Dyanically generated interafaces with semi-dynamic interfaces (courtesy of XSLT or XQuery/XPath) Dynamic Interfaces (pre- and post-generation) courtesy of self-describing nature of RDF
Data Querying Full Text Search Full Text Search Full Text Search + Structured Graph Pattern Query Language (SPARQL)
What Each Delivers Democratized Publishing Democratized Journalism & Commentary (Citizen Journalists & Commentators) Democratized Analysis (Citizen Data Analysts)
Star Wars Edition Analogy Star Wars (original fight for decentralization via rebellion) Empire Strikes Back (centralization and data silos make comeback) Return of the JEDI (FORCE emerges and facilitates decentralization from "Identity" all the way to "Open Data Access" and "Negotiable Descriptive Data Representation")

Naturally, I am not expecting everyone to agree with me. I am simply making my contribution to what will remain facinating discourse for a long time to come :-)

Related

# PermaLink Comments [1]
03/14/2009 14:20 GMT Modified: 04/29/2009 13:21 GMT
Introducing Virtuoso Universal Server (Cloud Edition) for Amazon EC2 [ Kingsley Uyi Idehen ]

What is it?

A pre-installed edition of Virtuoso for Amazon's EC2 Cloud platform.

What does it offer?

From a Web Entrepreneur perspective it offers:
  1. Low cost entry point to a game-changing Web 3.0+ (and beyond) platform that combines SQL, RDF, XML, and Web Services functionality
  2. Flexible variable cost model (courtesy of EC2 DevPay) tightly bound to revenue generated by your services
  3. Delivers federated and/or centralized model flexibility for you SaaS based solutions
  4. Simple entry point for developing and deploying sophisticated database driven applications (SQL or RDF Linked Data Web oriented)
  5. Complete framework for exploiting OpenID, OAuth (including Role enhancements) that simplifies exploitation of these vital Identity and Data Access technologies
  6. Easily implement RDF Linked Data based Mail, Blogging, Wikis, Bookmarks, Calendaring, Discussion Forums, Tagging, Social-Networking as Data Space (data containers) features of your application or service offering
  7. Instant alleviation of challenges (e.g. service costs and agility) associated with Data Portability and Open Data Access across Web 2.0 data silos
  8. LDAP integration for Intranet / Extranet style applications.

From the DBMS engine perspective it provides you with one or more pre-configured instances of Virtuoso that enable immediate exploitation of the following services:

  1. RDF Database (a Quad Store with SPARQL & SPARUL Language & Protocol support)
  2. SQL Database (with ODBC, JDBC, OLE-DB, ADO.NET, and XMLA driver access)
  3. XML Database (XML Schema, XQuery/Xpath, XSLT, Full Text Indexing)
  4. Full Text Indexing.

From a Middleware perspective it provides:

  1. RDF Views (Wrappers / Semantic Covers) over SQL, XML, and other data sources accessible via SOAP or REST style Web Services
  2. Sponger Service for converting non RDF information resources into RDF Linked Data "on the fly" via a large collection of pre-installed RDFizer Cartridges.

From the Web Server Platform perspective it provides an alternative to LAMP stack components such as MySQL and Apace by offering

  1. HTTP Web Server
  2. WebDAV Server
  3. Web Application Server (includes PHP runtime hosting)
  4. SOAP or REST style Web Services Deployment
  5. RDF Linked Data Deployment
  6. SPARQL (SPARQL Query Language) and SPARUL (SPARQL Update Language) endpoints
  7. Virtuoso Hosted PHP packages for MediaWiki, Drupal, Wordpress, and phpBB3 (just install the relevant Virtuoso Distro. Package).

From the general System Administrator's perspective it provides:

  1. Online Backups (Backup Set dispatched to S3 buckets, FTP, or HTTP/WebDAV server locations)
  2. Synchronized Incremental Backups to Backup Set locations
  3. Backup Restore from Backup Set location (without exiting to EC2 shell).

Higher level user oriented offerings include:

  1. OpenLink Data Explorer front-end for exploring the burgeoning Linked Data Web
  2. Ajax based SPARQL Query Builder (iSPARQL) that enables SPARQL Query construction by Example
  3. Ajax based SQL Query Builder (QBE) that enables SQL Query construction by Example.

For Web 2.0 / 3.0 users, developers, and entrepreneurs it offers it includes Distributed Collaboration Tools & Social Media realm functionality courtesy of ODS that includes:

  1. Point of presence on the Linked Data Web that meshes your Identity and your Data via URIs
  2. System generated Social Network Profile & Contact Data via FOAF?
  3. System generated SIOC (Semantically Interconnected Online Community) Data Space (that includes a Social Graph) exposing all your Web data in RDF Linked Data form
  4. System generated OpenID and automatic integration with FOAF
  5. Transparent Data Integration across Facebook, Digg, LinkedIn, FriendFeed, Twitter, and any other Web 2.0 data space equipped with RSS / Atom support and/or REST style Web Services
  6. In-built support for SyncML which enables data synchronization with Mobile Phones.

How Do I Get Going with It?

# PermaLink Comments [0]
11/28/2008 19:27 GMT Modified: 11/28/2008 16:06 GMT
ISWC 2008: RDB2RDF Face-to-Face [ Virtuoso Data Space Bot ]

The W3C's RDB-to-RDF mapping incubator group (RDB2RDF XG) met in Karlsruhe after ISWC 2008.

The meeting was about writing a charter for a working group that would define a standard for mapping relational databases to RDF, either for purposes of import into RDF stores or of query mapping from SPARQL to SQL. There was a lot of agreement and the meeting even finished ahead of the allotted time.

Whose Identifiers?

There was discussion concerning using the Entity Name Service from the Okkam project for assigning URIs to entities mapped from relational databases. This makes sense when talking about long-lived, legal entities, such as people or companies or geography. Of course, there are cases where this makes no sense; for example, a purchase order or maintenance call hardly needs an identifier registered with the ENS. The problem is, in practice, a CRM could mention customers that have an ENS registered ID (or even several such IDs) and others that have none. Of course, the CRM's reference cannot depend on any registration. Also, even when there is a stable URI for the entity, a CRM may need a key that specifies some administrative subdivision of the customer.

Also we note that an on-demand RDB-to-RDF mapping may have some trouble dealing with "same as" assertions. If names that are anything other than string forms of the keys in the system must be returned, there will have to be a lookup added to the RDB. This is an administrative issue. Certainly going over the network to ask for names of items returned by queries has a prohibitive cost. It would be good for ad hoc integration to use shared URIs when possible. The trouble of adding and maintaining lookups for these, however, makes this more expensive than just mapping to RDF and using literals for joining between independently maintained systems.

XML or RDF?

We talked about having a language for human consumption and another for discovery and machine processing of mappings. Would this latter be XML or RDF based? Describing every detail of syntax for a mapping as RDF is really tedious. Also such descriptions are very hard to query, just as OWL ontologies are. One solution is to have opaque strings embedded into RDF, just like XSLT has XPath in string form embedded into XML. Maybe it will end up in this way here also. Having a complete XML mapping of the parse tree for mappings, XQueryX-style, could be nice for automatic generation of mappings with XSLT from an XML view of the information schema. But then XSLT can also produce text, so an XML syntax that has every detail of a mapping language as distinct elements is not really necessary for this.

Another matter is then describing the RDF generated by the mapping in terms of RDFS or OWL. This would be a by-product of declaring the mapping. Most often, I would presume the target ontology to be given, though, reducing the need for this feature. But if RDF mapping is used for discovery of data, such a description of the exposed data is essential.

Interoperability

We agreed with Sören Auer that we could make Virtuoso's mapping language compatible with Triplify. Triplify is very simple, extraction only, no SPARQL, but does have the benefit of expressing everything in SQL. As it happens, I would be the last person to tell a web developer what language to program in. So if it is SQL, then let it stay SQL. Technically, a lot of the information the Virtuoso mapping expresses is contained in the Triplify SQL statements, but not all. Some extra declarations are needed still but can have reasonable defaults.

There are two ways of stating a mapping. Virtuoso starts with the triple and says which tables and columns will produce the triple. Triplify starts with the SQL statement and says what triples it produces. These are fairly equivalent. For the web developer, the latter is likely more self-evident, while the former may be more compact and have less repetition.

Virtuoso and Triplify alone would give us the two interoperable implementations required from a working group, supposing the language were annotations on top of SQL. This would be a guarantee of delivery, as we would be close enough to the result from the get go.

Related Web resources

# PermaLink Comments [0]
11/04/2008 13:26 GMT Modified: 11/04/2008 17:20 GMT
ISWC 2008: RDB2RDF Face-to-Face [ Orri Erling ]

The W3C's RDB-to-RDF mapping incubator group (RDB2RDF XG) met in Karlsruhe after ISWC 2008.

The meeting was about writing a charter for a working group that would define a standard for mapping relational databases to RDF, either for purposes of import into RDF stores or of query mapping from SPARQL to SQL. There was a lot of agreement and the meeting even finished ahead of the allotted time.

Whose Identifiers?

There was discussion concerning using the Entity Name Service from the Okkam project for assigning URIs to entities mapped from relational databases. This makes sense when talking about long-lived, legal entities, such as people or companies or geography. Of course, there are cases where this makes no sense; for example, a purchase order or maintenance call hardly needs an identifier registered with the ENS. The problem is, in practice, a CRM could mention customers that have an ENS registered ID (or even several such IDs) and others that have none. Of course, the CRM's reference cannot depend on any registration. Also, even when there is a stable URI for the entity, a CRM may need a key that specifies some administrative subdivision of the customer.

Also we note that an on-demand RDB-to-RDF mapping may have some trouble dealing with "same as" assertions. If names that are anything other than string forms of the keys in the system must be returned, there will have to be a lookup added to the RDB. This is an administrative issue. Certainly going over the network to ask for names of items returned by queries has a prohibitive cost. It would be good for ad hoc integration to use shared URIs when possible. The trouble of adding and maintaining lookups for these, however, makes this more expensive than just mapping to RDF and using literals for joining between independently maintained systems.

XML or RDF?

We talked about having a language for human consumption and another for discovery and machine processing of mappings. Would this latter be XML or RDF based? Describing every detail of syntax for a mapping as RDF is really tedious. Also such descriptions are very hard to query, just as OWL ontologies are. One solution is to have opaque strings embedded into RDF, just like XSLT has XPath in string form embedded into XML. Maybe it will end up in this way here also. Having a complete XML mapping of the parse tree for mappings, XQueryX-style, could be nice for automatic generation of mappings with XSLT from an XML view of the information schema. But then XSLT can also produce text, so an XML syntax that has every detail of a mapping language as distinct elements is not really necessary for this.

Another matter is then describing the RDF generated by the mapping in terms of RDFS or OWL. This would be a by-product of declaring the mapping. Most often, I would presume the target ontology to be given, though, reducing the need for this feature. But if RDF mapping is used for discovery of data, such a description of the exposed data is essential.

Interoperability

We agreed with Sören Auer that we could make Virtuoso's mapping language compatible with Triplify. Triplify is very simple, extraction only, no SPARQL, but does have the benefit of expressing everything in SQL. As it happens, I would be the last person to tell a web developer what language to program in. So if it is SQL, then let it stay SQL. Technically, a lot of the information the Virtuoso mapping expresses is contained in the Triplify SQL statements, but not all. Some extra declarations are needed still but can have reasonable defaults.

There are two ways of stating a mapping. Virtuoso starts with the triple and says which tables and columns will produce the triple. Triplify starts with the SQL statement and says what triples it produces. These are fairly equivalent. For the web developer, the latter is likely more self-evident, while the former may be more compact and have less repetition.

Virtuoso and Triplify alone would give us the two interoperable implementations required from a working group, supposing the language were annotations on top of SQL. This would be a guarantee of delivery, as we would be close enough to the result from the get go.

Related Web resources

# PermaLink Comments [0]
11/04/2008 13:26 GMT Modified: 11/04/2008 17:20 GMT
State of the Semantic Web, Part 2 - The Technical Questions (updated) [ Virtuoso Data Space Bot ]

Here I will talk about some more technical questions that came up. This is mostly general; Virtuoso specific questions and answers are separate.

"How to Bootstrap? Where will the triples come from?"

There are already wrappers producing RDF from many applications. Since any structured or semi-structured data can be converted to RDF and often there is even a pre-existing terminology for the application domain, the availability of the data per se is not the concern.

The triples may come from any application or database, but they will not come from the end user directly. There was a good talk about photograph annotation in Vienna, describing many ways of deriving metadata for photos. The essential wisdom is annotating on the spot and wherever possible doing so automatically. The consumer is very unlikely to go annotate photos after the fact. Further, one can infer that photos made with the same camera around the same time are from the same location. There are other such heuristics. In this use case, the end user does not need to see triples. There is some benefit though in using commonly used geographical terminology for linking to other data sources.

"How will one develop applications?"

I'd say one will develop them much the same way as thus far. In PHP, for example. Whether one's query language is SPARQL or SQL does not make a large difference in how basic web UI is made.

A SPARQL end-point is no more an end-user item than a SQL command-line is.

A common mistake among techies is that they think the data structure and user experience can or ought to be of the same structure. The UI dialogs do not, for example, have to have a 1:1 correspondence with SQL tables.

The idea of generating UI from data, whether relational or data-web, is so seductive that generation upon generation of developers fall for it, repeatedly. Even I, at OpenLink, after supposedly having been around the block a couple of times made some experiments around the topic. What does make sense is putting a thin wrapper or HTML around the application, using XSLT and such for formatting. Since the model does allow for unforeseen properties of data, one can build a viewer for these alongside the regular forms. For this, Ajax technologies like OAT (the OpenLink AJAX Toolkit) will be good.

The UI ought not to completely hide the URIs of the data from the user. It should offer a drill down to faceted views of the triples for example. Remember when Xerox talked about graphical user interfaces in 1980? "Don't mode me in" was the slogan, as I recall.

Since then, we have vacillated between modal and non-modal interaction models. Repetitive workflows like order entry go best modally and are anyway being replaced by web services. Also workflows that are very infrequent benefit from modality; take personal network setup wizards, for example. But enabling the knowledge worker is a domain that by its nature must retain some respect for human intelligence and not kill this by denying access to the underlying data, including provenance and URIs. Face it: the world is not getting simpler. It is increasingly data dependent and when this is so, having semantics and flexibility of access for the data is important.

For a real-time task-oriented user interface like a fighter plane cockpit, one will not show URIs unless specifically requested. For planning fighter sorties though, there is some potential benefit in having all data such as friendly and hostile assets, geography, organizational structure, etc., as linked data. It makes for more flexible querying. Linked data does not per se mean open, so one can be joinable with open data through using the same identifiers even while maintaining arbitrary levels of security and compartmentalization.

For automating tasks that every time involve the same data and queries, RDF has no intrinsic superiority. Thus the user interfaces in places where RDF will have real edge must be more capable of ad hoc viewing and navigation than regular real-time or line of business user interfaces.

The OpenLink Data Explorer idea of a "data behind the web page" view goes in this direction. Read the web as before, then hit a switch to go to the data view. There are and will be separate clarifications and demos about this.

"What of the proliferation of standards? Does this not look too tangled, no clear identity? How would one know where to begin?"

When SWEO was beginning, there was an endlessly protracted discussion of the so-called layer cake. This acronym jungle is not good messaging. Just say linked, flexibly repurpose-able data, and rich vocabularies and structure. Just the right amount of structure for the application, less rigid and easier to change than relational.

Do not even mention the different serialization formats. Just say that it fits on top of the accepted web infrastructure — HTTP, URIs, and XML where desired.

It is misleading to say inference is a box at some specific place in the diagram. Inference of different types may or may not take place at diverse points, whether presentation or storage, on demand or as a preprocessing step. Since there is structure and semantics, inference is possible if desired.

"Can I make a social network application in RDF only, with no RDBMS?"

Yes, in principle, but what do you have in mind? The answer is very context dependent. The person posing the question had an E-learning system in mind, with things such as course catalogues, course material, etc. In such a case, RDF is a great match, especially since the user count will not be in the millions. No university has that many students and anyway they do not hang online browsing the course catalogue.

On the other hand, if I think of making a social network site with RDF as the exclusive data model, I see things that would be very inefficient. For example, keeping a count of logins or the last time of login would be by default several times less efficient than with a RDBMS.

If some application is really large scale and has a knowable workload profile, like any social network does, then some task-specific data structure is simply economical. This does not mean that the application language cannot be SPARQL but this means that the storage format must be tuned to favor some operations over others, relational style. This is a matter of cost more than of feasibility. Ten servers cost less than a hundred and have failures ten times less frequently.

In the near term we will see the birth of an application paradigm for the data web. The data will be open, exposed, first-class citizen; yet the user experience will not have to be in a 1:1 image of the data.

# PermaLink Comments [0]
10/26/2008 12:02 GMT Modified: 10/27/2008 11:28 GMT
State of the Semantic Web, Part 2 - The Technical Questions (updated) [ Orri Erling ]

Here I will talk about some more technical questions that came up. This is mostly general; Virtuoso specific questions and answers are separate.

"How to Bootstrap? Where will the triples come from?"

There are already wrappers producing RDF from many applications. Since any structured or semi-structured data can be converted to RDF and often there is even a pre-existing terminology for the application domain, the availability of the data per se is not the concern.

The triples may come from any application or database, but they will not come from the end user directly. There was a good talk about photograph annotation in Vienna, describing many ways of deriving metadata for photos. The essential wisdom is annotating on the spot and wherever possible doing so automatically. The consumer is very unlikely to go annotate photos after the fact. Further, one can infer that photos made with the same camera around the same time are from the same location. There are other such heuristics. In this use case, the end user does not need to see triples. There is some benefit though in using commonly used geographical terminology for linking to other data sources.

"How will one develop applications?"

I'd say one will develop them much the same way as thus far. In PHP, for example. Whether one's query language is SPARQL or SQL does not make a large difference in how basic web UI is made.

A SPARQL end-point is no more an end-user item than a SQL command-line is.

A common mistake among techies is that they think the data structure and user experience can or ought to be of the same structure. The UI dialogs do not, for example, have to have a 1:1 correspondence with SQL tables.

The idea of generating UI from data, whether relational or data-web, is so seductive that generation upon generation of developers fall for it, repeatedly. Even I, at OpenLink, after supposedly having been around the block a couple of times made some experiments around the topic. What does make sense is putting a thin wrapper or HTML around the application, using XSLT and such for formatting. Since the model does allow for unforeseen properties of data, one can build a viewer for these alongside the regular forms. For this, Ajax technologies like OAT (the OpenLink AJAX Toolkit) will be good.

The UI ought not to completely hide the URIs of the data from the user. It should offer a drill down to faceted views of the triples for example. Remember when Xerox talked about graphical user interfaces in 1980? "Don't mode me in" was the slogan, as I recall.

Since then, we have vacillated between modal and non-modal interaction models. Repetitive workflows like order entry go best modally and are anyway being replaced by web services. Also workflows that are very infrequent benefit from modality; take personal network setup wizards, for example. But enabling the knowledge worker is a domain that by its nature must retain some respect for human intelligence and not kill this by denying access to the underlying data, including provenance and URIs. Face it: the world is not getting simpler. It is increasingly data dependent and when this is so, having semantics and flexibility of access for the data is important.

For a real-time task-oriented user interface like a fighter plane cockpit, one will not show URIs unless specifically requested. For planning fighter sorties though, there is some potential benefit in having all data such as friendly and hostile assets, geography, organizational structure, etc., as linked data. It makes for more flexible querying. Linked data does not per se mean open, so one can be joinable with open data through using the same identifiers even while maintaining arbitrary levels of security and compartmentalization.

For automating tasks that every time involve the same data and queries, RDF has no intrinsic superiority. Thus the user interfaces in places where RDF will have real edge must be more capable of ad hoc viewing and navigation than regular real-time or line of business user interfaces.

The OpenLink Data Explorer idea of a "data behind the web page" view goes in this direction. Read the web as before, then hit a switch to go to the data view. There are and will be separate clarifications and demos about this.

"What of the proliferation of standards? Does this not look too tangled, no clear identity? How would one know where to begin?"

When SWEO was beginning, there was an endlessly protracted discussion of the so-called layer cake. This acronym jungle is not good messaging. Just say linked, flexibly repurpose-able data, and rich vocabularies and structure. Just the right amount of structure for the application, less rigid and easier to change than relational.

Do not even mention the different serialization formats. Just say that it fits on top of the accepted web infrastructure — HTTP, URIs, and XML where desired.

It is misleading to say inference is a box at some specific place in the diagram. Inference of different types may or may not take place at diverse points, whether presentation or storage, on demand or as a preprocessing step. Since there is structure and semantics, inference is possible if desired.

"Can I make a social network application in RDF only, with no RDBMS?"

Yes, in principle, but what do you have in mind? The answer is very context dependent. The person posing the question had an E-learning system in mind, with things such as course catalogues, course material, etc. In such a case, RDF is a great match, especially since the user count will not be in the millions. No university has that many students and anyway they do not hang online browsing the course catalogue.

On the other hand, if I think of making a social network site with RDF as the exclusive data model, I see things that would be very inefficient. For example, keeping a count of logins or the last time of login would be by default several times less efficient than with a RDBMS.

If some application is really large scale and has a knowable workload profile, like any social network does, then some task-specific data structure is simply economical. This does not mean that the application language cannot be SPARQL but this means that the storage format must be tuned to favor some operations over others, relational style. This is a matter of cost more than of feasibility. Ten servers cost less than a hundred and have failures ten times less frequently.

In the near term we will see the birth of an application paradigm for the data web. The data will be open, exposed, first-class citizen; yet the user experience will not have to be in a 1:1 image of the data.

# PermaLink Comments [0]
10/26/2008 12:02 GMT Modified: 10/27/2008 11:28 GMT
Time for Context Lenses (Update) [ Kingsley Uyi Idehen ]
As the Linked Data meme continues on it's quest to unravel the mysteries of the Semantic Web vision, it's quite gratifying to see that data virtualization comprehension: creating "Conceptual Views" into logically organized "Disparate & Heterogeneous Data Sources" via "Context Lenses" is taking shape, as illustrated in the "note-to-self" post by David Provost.




Virtualization of heterogeneous data sources is only achievable if you have a dexterous data model based "Bus" into which the data sources are plugged. RDF has offered such a model for a long time.



Image



When heterogeneous data sources are plugged into an RDF based integration bus e.g., customer records sourced from a variety of tables, across a plethora of databases, you can only end up with true value if the emergent entities from such an effort are coherently linked and (de)referencable; which is what Linked Data's fundamental preoccupation with dereferencable URIs is all about. Of course, Even when you have all of the above in place, you also need to be able to construct "Context Lenses" i.e., context driven views of the Linked Data Mesh (or Linked Data Spaces).


Additional Diagrams:


1. Clients of the RDF Bus
2. RDF Bus Server plugins: Scripts that emit RDF
3. RDF Bus Servers: RDF Data Managers (Triple or Quad Stores)
4. RDF Bus Servers: Relational to RDF Mappers (RDF Views, Semantic Covers etc.)
5. RDF Bus Server plugins: XML to RDF Mappers
6. RDF Bus Server plugins: GRDDL based XSLT stylesheets that emit RDF
7. RDF Bus Server plugins: Intelligent RDF Middleware






# PermaLink Comments [0]
08/02/2008 19:06 GMT Modified: 08/04/2008 11:24 GMT
 <<     | 1 | 2 | 3 |     >>
Powered by OpenLink Virtuoso Universal Server
Running on Linux platform