Details

OpenLink Software
Burlington, United States

Subscribe

Post Categories

Recent Articles

Community Member Blogs

Display Settings

articles per page.
order.

Translate

Showing posts in all categories RefreshRefresh
Short Recap of Virtuoso Basics (#3 of 5) [ Orri Erling ]

(Third of five posts related to the WWW 2009 conference, held the week of April 20, 2009.)

There are some points that came up in conversation at WWW 2009 that I will reiterate here. We find there is still some lack of clarity in the product image, so I will here condense it.

Virtuoso is a DBMS. We pitch it primarily to the data web space because this is where we see the emerging frontier. Virtuoso does both SQL and SPARQL and can do both at large scale and high performance. The popular perception of RDF and Relational models as mutually exclusive and antagonistic poles is based on the poor scalability of early RDF implementations. What we do is to have all the RDF specifics, like IRIs and typed literals as native SQL types, and to have a cost based optimizer that knows about this all.

If you want application-specific data structures as opposed to a schema-agnostic quad-store model (triple + graph-name), then Virtuoso can give you this too. Rendering application specific data structures as RDF applies equally to relational data in non-Virtuoso databases because Virtuoso SQL can federate tables from heterogenous DBMS.

On top of this, there is a web server built in, so that no extra server is needed for web services, web pages, and the like.

Installation is simple, just one exe and one config file. There is a huge amount of code in installers — application code and test suites and such — but none of this is needed when you deploy. Scale goes from a 25MB memory footprint on the desktop to hundreds of gigabytes of RAM and endless terabytes of disk on shared-nothing clusters.

Clusters (coming in Release 6) and SQL federation are commercial only; the rest can be had under GPL.

To condense further:

  • Scalable Delivery of Linked Data
  • SPARQL and SQL
    • Arbitrary RDF Data + Relational
    • Also From 3rd Party RDBMS
  • Easy Deployment
  • Standard Interfaces
# PermaLink Comments [0]
04/30/2009 11:49 GMT Modified: 04/30/2009 12:11 GMT
Short Recap of Virtuoso Basics (#3 of 5) [ Virtuso Data Space Bot ]

(Third of five posts related to the WWW 2009 conference, held the week of April 20, 2009.)

There are some points that came up in conversation at WWW 2009 that I will reiterate here. We find there is still some lack of clarity in the product image, so I will here condense it.

Virtuoso is a DBMS. We pitch it primarily to the data web space because this is where we see the emerging frontier. Virtuoso does both SQL and SPARQL and can do both at large scale and high performance. The popular perception of RDF and Relational models as mutually exclusive and antagonistic poles is based on the poor scalability of early RDF implementations. What we do is to have all the RDF specifics, like IRIs and typed literals as native SQL types, and to have a cost based optimizer that knows about this all.

If you want application-specific data structures as opposed to a schema-agnostic quad-store model (triple + graph-name), then Virtuoso can give you this too. Rendering application specific data structures as RDF applies equally to relational data in non-Virtuoso databases because Virtuoso SQL can federate tables from heterogenous DBMS.

On top of this, there is a web server built in, so that no extra server is needed for web services, web pages, and the like.

Installation is simple, just one exe and one config file. There is a huge amount of code in installers — application code and test suites and such — but none of this is needed when you deploy. Scale goes from a 25MB memory footprint on the desktop to hundreds of gigabytes of RAM and endless terabytes of disk on shared-nothing clusters.

Clusters (coming in Release 6) and SQL federation are commercial only; the rest can be had under GPL.

To condense further:

  • Scalable Delivery of Linked Data
  • SPARQL and SQL
    • Arbitrary RDF Data + Relational
    • Also From 3rd Party RDBMS
  • Easy Deployment
  • Standard Interfaces
# PermaLink Comments [0]
04/30/2009 11:49 GMT Modified: 04/30/2009 12:11 GMT
Search at WWW 2009 (#2 of 5) [ Orri Erling ]

(Second of five posts related to the WWW 2009 conference, held the week of April 20, 2009.)

There was a workshop on semantic search plus a number of papers and of course keynotes from Google and Yahoo.

A general topic was the use of and access to query logs. Are these the monopoly of GYM (Google, Yahoo, Microsoft) or should they be made more generally available? This is a privacy question. Use of query logs and click through of search results for improved ranking was mentioned many times throughout the conference.

The semantic search workshop was largely about benchmarks for keyword search in information retrieval. For linked data, which is a database proposition, these benchmarks are not really applicable. For document search aided by semantics derived by NLP, these are of course applicable. But there is a divide in approach.

Giovanni Tummarello presented Sig.ma, a service using Sindice's RDF index for collecting all RDF statements about entities matching some set of keywords. One could then choose which sources and which entities were the right ones. One could further store such a query and embed it on a page. The point was that the filtering done manually could be persisted and republished, so as to create dynamic content aggregated from selected live sources. Further speculating, one could use such user feedback for adjusting ranking, even though Sig.ma did not. We may adopt the idea of manually excluding sources into our browser too. Fresnel lenses are another thing to look at.

There was a paper by Josep M. Pujol and Pablo Rodriguez, of Telefonica Research, about returning search to the people by means of Porqpine, a peer-to-peer search implementation based on sharing search results from search engines among peers and indexing them locally as they were retrieved. For users with similar interests, this can give a community based ranking model but has issues of privacy. Another point was that with local processing and personal scale data volumes various kinds of brute force processing were feasible that would cost a lot for the web scale. Much can be done web scale but it must be done cleverly, not with a shell script and not so ad hoc.

As a counterpoint to this, there was a talk about Hadoop and Hive, a map-reduce-based SQL-like framework. One could do an SQL GROUP BY on text files with record parsing at run time, all spread over a Hadoop cluster. The issue is, if you have a petabyte of data, you may wish to run more than one ad hoc query on it. This means that joining between partitions and complex processing becomes important. This cannot be done without indices and complex query optimization, and needs a DBMS. Stonebraker and company are fully justified in their critique of map reduce. It looks like each generation must get dazzled by the oversimplified and then retrace the same discoveries of complexity as the previous one.

Some of our future plans were confirmed by what we saw, for example as concerns:

  • Interactively selecting sources for search, showing the graphs, then interactively refining
  • More social networks, more network analysis, and more work on social recommendation
  • Real time indexing of new pings, filling the store by forwarding queries to search engines, and harvesting micro-formats from results
  • Using entity extraction

These are all items in the pipeline, easy to do on top of the existing platform. For the machine learning and NLP parts, we will partner with others, details will be worked out while we work on the items we implement by ourselves.

# PermaLink Comments [0]
04/30/2009 11:18 GMT Modified: 04/30/2009 12:51 GMT
Search at WWW 2009 (#2 of 5) [ Virtuso Data Space Bot ]

(Second of five posts related to the WWW 2009 conference, held the week of April 20, 2009.)

There was a workshop on semantic search plus a number of papers and of course keynotes from Google and Yahoo.

A general topic was the use of and access to query logs. Are these the monopoly of GYM (Google, Yahoo, Microsoft) or should they be made more generally available? This is a privacy question. Use of query logs and click through of search results for improved ranking was mentioned many times throughout the conference.

The semantic search workshop was largely about benchmarks for keyword search in information retrieval. For linked data, which is a database proposition, these benchmarks are not really applicable. For document search aided by semantics derived by NLP, these are of course applicable. But there is a divide in approach.

Giovanni Tummarello presented Sig.ma, a service using Sindice's RDF index for collecting all RDF statements about entities matching some set of keywords. One could then choose which sources and which entities were the right ones. One could further store such a query and embed it on a page. The point was that the filtering done manually could be persisted and republished, so as to create dynamic content aggregated from selected live sources. Further speculating, one could use such user feedback for adjusting ranking, even though Sig.ma did not. We may adopt the idea of manually excluding sources into our browser too. Fresnel lenses are another thing to look at.

There was a paper by Josep M. Pujol and Pablo Rodriguez, of Telefonica Research, about returning search to the people by means of Porqpine, a peer-to-peer search implementation based on sharing search results from search engines among peers and indexing them locally as they were retrieved. For users with similar interests, this can give a community based ranking model but has issues of privacy. Another point was that with local processing and personal scale data volumes various kinds of brute force processing were feasible that would cost a lot for the web scale. Much can be done web scale but it must be done cleverly, not with a shell script and not so ad hoc.

As a counterpoint to this, there was a talk about Hadoop and Hive, a map-reduce-based SQL-like framework. One could do an SQL GROUP BY on text files with record parsing at run time, all spread over a Hadoop cluster. The issue is, if you have a petabyte of data, you may wish to run more than one ad hoc query on it. This means that joining between partitions and complex processing becomes important. This cannot be done without indices and complex query optimization, and needs a DBMS. Stonebraker and company are fully justified in their critique of map reduce. It looks like each generation must get dazzled by the oversimplified and then retrace the same discoveries of complexity as the previous one.

Some of our future plans were confirmed by what we saw, for example as concerns:

  • Interactively selecting sources for search, showing the graphs, then interactively refining
  • More social networks, more network analysis, and more work on social recommendation
  • Real time indexing of new pings, filling the store by forwarding queries to search engines, and harvesting micro-formats from results
  • Using entity extraction

These are all items in the pipeline, easy to do on top of the existing platform. For the machine learning and NLP parts, we will partner with others, details will be worked out while we work on the items we implement by ourselves.

# PermaLink Comments [0]
04/30/2009 11:18 GMT Modified: 04/30/2009 12:51 GMT
Linked Data at WWW 2009 (#1 of 5) [ Orri Erling ]

(First of five posts related to the WWW 2009 conference, held the week of April 20, 2009.)

We gave a talk at the Linked Open Data workshop, LDOW 2009, at WWW 2009. I did not go very far into the technical points in the talk, as there was almost no time and the points are rather complex. Instead, I emphasized what new things had become possible with recent developments.

The problem we do not cease hearing about is scale. We have solved most of it. There is scale in the schema: Put together, ontologies go over a million classes/properties. Which ones are relevant depends, and the user should have the choice. The instance data is in the tens of billions of triples, much derived from Web 2.0 sources but also much published as RDF.

To make sense of this all, we need quick summaries and search. Without navigation via joins, the value will be limited. Fast joining, counting, grouping, and ranking are key.

People will use different terms for the same thing. The issue of identity is philosophical. In order to do reasoning one needs strong identity; a statement like x is a bit like y is not very useful in a database context. Whether any x and y can be considered the same depends on the context. So leave this for query time. The conditions under which two people are considered the same will depend on whether you are doing marketing analysis or law enforcement. A general purpose data store cannot anticipate all the possibilities, so smush on demand, as you go, as has been said many times.

Against this backdrop, we offer a solution with which anybody who so chooses can play with big data, whether a search or analytics player.

We are going in the direction of more and more ad hoc processing at larger and larger scale. With good query parallelization, we can do big joins without complex programming. No explicit Map Reduce jobs or the like. What was done with special code with special parallel programming models, can now be done in SQL and SPARQL.

To showcase this, we do linked data search, browsing, and so on, but are essentially a platform provider.

Entry costs into relatively high end databases have dropped significantly. A cluster with 1 TB of RAM sells for $75K or so at today's retail prices and fits under a desk. For intermittent use, the rent for 1TB RAM is $1228 per day on EC2. With this on one side and Virtuoso on the other, a lot that was impractical in the past is now within reach. Like Giovanni Tummarello put it for airplanes, the physics are as they were for da Vinci but materials and engines had to develop a bit before there was commercial potential. So it is also with analytics for everyone.

A remark from the audience was that all the stuff being shown, not limited to Virtuoso, was non-standard, having to do with text search, with ranking, with extensions, and was in fact not SPARQL and pure linked data principles. Further, by throwing this all together, one got something overcomplicated, too heavy.

I answered as follows, which apparently cannot be repeated too much:

First, everybody expects a text search box, and is conditioned to having one. No text search and no ranking is a non-starter. Ceterum censeo, for database, the next generation cannot be less expressive than the previous. All of SQL and then some is where SPARQL must be. The barest minimum is being able to say anything one can say in SQL, and then justify SPARQL by saying that it is better for heterogenous data, schema last, and so on. On top of this, transitivity and rules will not hurt. For now, the current SPARQL working group will at least reach basic SQL parity; the edge will still remain implementation dependent.

Another remark was that joining is slow. Depends. Anything involving more complex disk access than linear reading of a blob is generally not good for interactive use. But with adequate memory, and with all hot spots in memory, we do some 3.2 million random-accesses-per-second on 12 cores, with easily 80% platform utilization for a single large query. The high utilization means that times drop as processing gets divided over more partitions.

There was a talk about MashQL by Mustafa Jarrar, concerning an abstraction on top of SPARQL for easy composition of tree-structured queries. The idea was that such queries can be evaluated "on the fly" as they are being composed. As it happens, we already have an XML-based query abstraction layer incorporated into Virtuoso 6.0's built-in Faceted Data Browser Service, and the effects are probably quite similar. The most important point here is that by using XML, both of these approaches are interoperable against a Virtuoso back-end. Along similar lines, we did not get to talk to the G Facets people but our message to them is the same: Use the faceted browser service to get vastly higher performance when querying against Linked Data, be it DBpedia or the entity LOD Cloud. Virtuoso 6.0 (Open Source Edition) "TP1" is now publicly available as a Technology Preview (beta).

We heard that there is an effort for porting Freebase's Parallax to SPARQL. The same thing applies to this. With a number of different data viewers on top of SPARQL, we come closer to broad-audience linked-data applications. These viewers are still too generic for the end user, though. We fully believe that for both search and transactions, application-domain-specific workflows will stay relevant. But these can be made to a fair degree by specializing generic linked-data-bound controls and gluing them together with some scripting.

As said before, the application will interface the user to the vocabulary. The vocabulary development takes the modeling burden from the application and makes for interchangeable experience on the same data. The data in turn is "virtualized" into the database cloud or the local secure server, as the use case may require.

For ease of adoption, open competition, and safety from lock-in, the community needs a SPARQL whose usability is not totally dependent on vendor extensions. But we might de facto have that in just a bit, whenever there is a working draft from the SPARQL WG.

Another topic that we encounter often is the question of integration (or lack thereof) between communities. For example, database conferences reject semantic web papers and vice versa. Such politics would seem to emerge naturally but are nonetheless detrimental. We really should partner with people who write papers as their principal occupation. We ourselves do software products and use very little time for papers, so some of the bad reviews we have received do make a legitimate point. By rights, we should go for database venues but we cannot have this take too much time. So we are open to partnering for splitting the opportunity cost of multiple submissions.

For future work, there is nothing radically new. We continue testing and productization of cluster databases. Just deliver what is in the pipeline. The essential nature of this is adding more and more cases of better and better parallelization in different query situations. The present usage patterns work well for finding bugs and performance bottlenecks. For presentation, our goal is to have third party viewers operate with our platform. We cannot completely leave data browsing and UI to third parties since we must from time to time introduce various unique functionality. Most interaction should however go via third party applications.

# PermaLink Comments [0]
04/27/2009 17:28 GMT Modified: 04/28/2009 11:27 GMT
Linked Data at WWW 2009 (#1 of 5) [ Virtuso Data Space Bot ]

(First of five posts related to the WWW 2009 conference, held the week of April 20, 2009.)

We gave a talk at the Linked Open Data workshop, LDOW 2009, at WWW 2009. I did not go very far into the technical points in the talk, as there was almost no time and the points are rather complex. Instead, I emphasized what new things had become possible with recent developments.

The problem we do not cease hearing about is scale. We have solved most of it. There is scale in the schema: Put together, ontologies go over a million classes/properties. Which ones are relevant depends, and the user should have the choice. The instance data is in the tens of billions of triples, much derived from Web 2.0 sources but also much published as RDF.

To make sense of this all, we need quick summaries and search. Without navigation via joins, the value will be limited. Fast joining, counting, grouping, and ranking are key.

People will use different terms for the same thing. The issue of identity is philosophical. In order to do reasoning one needs strong identity; a statement like x is a bit like y is not very useful in a database context. Whether any x and y can be considered the same depends on the context. So leave this for query time. The conditions under which two people are considered the same will depend on whether you are doing marketing analysis or law enforcement. A general purpose data store cannot anticipate all the possibilities, so smush on demand, as you go, as has been said many times.

Against this backdrop, we offer a solution with which anybody who so chooses can play with big data, whether a search or analytics player.

We are going in the direction of more and more ad hoc processing at larger and larger scale. With good query parallelization, we can do big joins without complex programming. No explicit Map Reduce jobs or the like. What was done with special code with special parallel programming models, can now be done in SQL and SPARQL.

To showcase this, we do linked data search, browsing, and so on, but are essentially a platform provider.

Entry costs into relatively high end databases have dropped significantly. A cluster with 1 TB of RAM sells for $75K or so at today's retail prices and fits under a desk. For intermittent use, the rent for 1TB RAM is $1228 per day on EC2. With this on one side and Virtuoso on the other, a lot that was impractical in the past is now within reach. Like Giovanni Tummarello put it for airplanes, the physics are as they were for da Vinci but materials and engines had to develop a bit before there was commercial potential. So it is also with analytics for everyone.

A remark from the audience was that all the stuff being shown, not limited to Virtuoso, was non-standard, having to do with text search, with ranking, with extensions, and was in fact not SPARQL and pure linked data principles. Further, by throwing this all together, one got something overcomplicated, too heavy.

I answered as follows, which apparently cannot be repeated too much:

First, everybody expects a text search box, and is conditioned to having one. No text search and no ranking is a non-starter. Ceterum censeo, for database, the next generation cannot be less expressive than the previous. All of SQL and then some is where SPARQL must be. The barest minimum is being able to say anything one can say in SQL, and then justify SPARQL by saying that it is better for heterogenous data, schema last, and so on. On top of this, transitivity and rules will not hurt. For now, the current SPARQL working group will at least reach basic SQL parity; the edge will still remain implementation dependent.

Another remark was that joining is slow. Depends. Anything involving more complex disk access than linear reading of a blob is generally not good for interactive use. But with adequate memory, and with all hot spots in memory, we do some 3.2 million random-accesses-per-second on 12 cores, with easily 80% platform utilization for a single large query. The high utilization means that times drop as processing gets divided over more partitions.

There was a talk about MashQL by Mustafa Jarrar, concerning an abstraction on top of SPARQL for easy composition of tree-structured queries. The idea was that such queries can be evaluated "on the fly" as they are being composed. As it happens, we already have an XML-based query abstraction layer incorporated into Virtuoso 6.0's built-in Faceted Data Browser Service, and the effects are probably quite similar. The most important point here is that by using XML, both of these approaches are interoperable against a Virtuoso back-end. Along similar lines, we did not get to talk to the G Facets people but our message to them is the same: Use the faceted browser service to get vastly higher performance when querying against Linked Data, be it DBpedia or the entity LOD Cloud. Virtuoso 6.0 (Open Source Edition) "TP1" is now publicly available as a Technology Preview (beta).

We heard that there is an effort for porting Freebase's Parallax to SPARQL. The same thing applies to this. With a number of different data viewers on top of SPARQL, we come closer to broad-audience linked-data applications. These viewers are still too generic for the end user, though. We fully believe that for both search and transactions, application-domain-specific workflows will stay relevant. But these can be made to a fair degree by specializing generic linked-data-bound controls and gluing them together with some scripting.

As said before, the application will interface the user to the vocabulary. The vocabulary development takes the modeling burden from the application and makes for interchangeable experience on the same data. The data in turn is "virtualized" into the database cloud or the local secure server, as the use case may require.

For ease of adoption, open competition, and safety from lock-in, the community needs a SPARQL whose usability is not totally dependent on vendor extensions. But we might de facto have that in just a bit, whenever there is a working draft from the SPARQL WG.

Another topic that we encounter often is the question of integration (or lack thereof) between communities. For example, database conferences reject semantic web papers and vice versa. Such politics would seem to emerge naturally but are nonetheless detrimental. We really should partner with people who write papers as their principal occupation. We ourselves do software products and use very little time for papers, so some of the bad reviews we have received do make a legitimate point. By rights, we should go for database venues but we cannot have this take too much time. So we are open to partnering for splitting the opportunity cost of multiple submissions.

For future work, there is nothing radically new. We continue testing and productization of cluster databases. Just deliver what is in the pipeline. The essential nature of this is adding more and more cases of better and better parallelization in different query situations. The present usage patterns work well for finding bugs and performance bottlenecks. For presentation, our goal is to have third party viewers operate with our platform. We cannot completely leave data browsing and UI to third parties since we must from time to time introduce various unique functionality. Most interaction should however go via third party applications.

# PermaLink Comments [0]
04/27/2009 17:28 GMT Modified: 04/28/2009 11:27 GMT
Beyond Applications - Introducing the Planetary Datasphere (Part 1) [ Orri Erling ]

This is the first in a short series of blog posts about what becomes possible when essentially unlimited linked data can be deployed on the open web and private intranets.

The term DataSphere comes from Dan Simmons' Hyperion science fiction series, where it is a sort of pervasive computing capability that plays host to all sorts of processes, including what people do on the net today, and then some. I use this term here in order to emphasize the blurring of silo and application boundaries. The network is not only the computer but also the database. I will look at what effects the birth of a sort of linked data stratum can have on end-user experience, application development, application deployment and hosting, business models and advertising, and security; how cloud computing fits in; and how back-end software such as databases must evolve to support all of these.

This is a mid-term vision. The components are coming into production as we speak, but the end result is not here quite yet.

I use the word DataSphere to refer to a worldwide database fabric, a global Distributed DBMS collective, within which there are many Data Spaces, or Named Data Spaces. A Data Space is essentially a person's or organization's contribution to the DataSphere. I use Linked Data Web to refer to component technologies and practices such as RDF, SPARQL, Linked Data practices, etc. The DataSphere does not have to be built on this technology stack per se, but this stack is still the best bet for it.

General

There exist applications for performing specialized functions such as social networking, shopping, document search, and C2C commerce at planetary scale. All these applications run on their own databases, each with a task specific schema. They communicate by web pages and by predefined messages for diverse application-specific transactions and reports.

These silos are scalable because in general their data has some natural partitioning, and because the set of transactions is predetermined and the data structure is set up for this.

The Linked Data Web proposes to create a data infrastructure that can hold anything, just like a network can transport anything. This is not a network with a memory of messages, but a whole that can answer arbitrary questions about what has been said. The prerequisite is that the questions are phrased in a vocabulary that is compatible with the vocabulary in which the statements themselves were made.

In this setting, the vocabulary takes the place of the application. Of course, there continues to be a procedural element to applications; this has the function of translating statements between the domain vocabulary and a user interface. Examples are data import from existing applications, running predefined reports, composing new reports, and translating between natural language and the domain vocabulary.

The big difference is that the database moves outside of the silo, at least in logical terms. The database will be like the network — horizontal and ubiquitous. The equivalent of TCP/IP will be the RDF/SPARQL combination. The equivalent of routing protocols between ISPs will be gateways between the specific DBMS engines supporting the services.

The place of the DBMS in the stack changes

The RDBMS in itself is eternal, or at least as eternal as a culture with heavy reliance on written records is. Any such culture will invent the RDBMS and use it where it best fits. We are not replacing this; we are building an abstracted worldwide data layer. This is to the RDBMS supporting line-of-business applications what the www was to enterprise content management systems.

For transactions, the Web 2.0-style application-specific messages are fine. Also, any transactional system that must be audited must physically reside somewhere, have physical security, etc. It can't just be somewhere in the DataSphere, managed by some system with which one has no contract, just like Google's web page cache can't be relied on as a permanent repository of web content.

Providing space on the Linked Data Web is like providing hosting on the Document Web. This may have varying service levels, pricing models, etc. The value of a queriable DataSphere is that a new application does not have to begin by building its own schema, database infrastructure, service hosting, etc. The application becomes more like a language meme, a cultural form of interaction mediated by a relatively lightweight user-facing component, laterally open for unforeseen interaction with other applications from other domains of discourse.

End User Benefits

For the end user, the web will still look like a place where one can shop, discuss, date, whatever. These activities will be mediated by user interfaces as they are now. Right now, the end user's web presence is his/her blog or web site, and their contributions to diverse wikis, social web sites, and so forth. These are scattered. The user's Data Space is the collection of all these things, now presented in a queriable form. The user's Data Space is the user's statement of presence, referencing the diverse contributions of the user on diverse sites.

The personal Data Space being a queriable, structured whole facilitates finding and being found, which is what brings individuals to the web in the first place. The best applications and sites are those which make this the easiest. The Linked Data Web allows saying what one wishes in a structured, queriable manner, across all application domains, independently of domain specific silos. The end user's interaction with the personal data space is through applications, like now. But these applications are just wrappers on top of self describing data, represented in domain specific vocabularies; one vocabulary is used for social networking, another for C2C commerce, and so on. The user is the master of their personal Data Space, free to take it where he or she wishes.

Further benefits will include more ready referencing between these spaces, more uniform identity management, cross-application operations, and the emergence of "meta-applications," i.e., unified interfaces for managing many related applications/tasks.

Of course, there is the increase in semantic richness, such as better contextuality derived from entity extraction from text. But this is also possible in a silo. The Linked Data Web angle is the sharing of identifiers for real world entities, which makes extracts of different sources by different parties potentially joinable. The user interaction will hardly ever be with the raw data. But the raw data being still at hand makes for better targeting of advertisements, better offering of related services, easier discovery of related content, and less noise overall.

Kingsley Idehen has coined the term SDQ, for Serendipitous Discovery Quotient, to denote this. When applications expose explicit semantics, constructing a user experience that combines relevant data from many sources, including applications as well as highly targeted advertising, becomes natural. It is no longer a matter of "mashing up" web service interfaces with procedural code, but of "meshing" data through declarative queries across application spaces.

Applications in the DataSphere

The workflows supported by the DataSphere are essentially those taking place on the web now. The DataSphere dimension is expressed by bookmarklets, browser plugins, and the like, with ready access to related data and actions that are relevant for this data. Actions triggered by data can be anything from posting a comment to making an e-commerce purchase. Web 2.0 models fit right in.

Web application development now consists of designing an application-specific database schema and writing web pages to interact with this schema. In the DataSphere, the database is abstracted away, as is a large part of the schema. The application floats on a sea of data instead of being tied to its own specific store and schema. Some local transaction data should still be handled in the old way, though.

For the application developer, the question becomes one of vocabulary choice. How will the application synthesize URIs from the user interaction? Which URIs will be used, since pretty much anything will in practice have many names (e.g., DBpedia Vs. Freebase identifiers). The end user will generally have no idea of this choice, nor of the various degrees of normalization, etc., in the vocabularies. Still, usage of such applications will produce data using some identifiers and vocabularies. Benefits of ready joining without translation will drive adoption. A vocabulary with instance data will get more instance data.

The Linked Data Web infrastructure itself must support vocabulary and identifier choice by answering questions about who uses a particular identifier and where. Even now, we offer entity ranks and resolution of synonyms, queries on what graphs mention a certain identifier and so on. This is a means of finding the most commonly used term for each situation. Convergence of terminology cuts down on translation and makes for easier and more efficient querying.

Advertising

The application developer is, for purposes of advertising, in the position of the inventory owner, just like a traditional publisher, whether web or other. But with smarter data, it is not a matter of static keywords but of the semantically explicit data behind each individual user impression driving the ads. Data itself carries no ads but the user impression will still go through a display layer that can show ads. If the application relies on reuse of licensed content, such as media, then the content provider may get a cut of the ad revenue even if it is not the direct owner of the inventory. The specifics of implementing and enforcing this are to be worked out.

Content Providers, License, and Attribution

For the content provider, the URI is the brand carrier. If the data is well linked and queriable, this will drive usage and traffic to the services of the content provider. This is true of any provider, whether a media publisher, e-commerce business, government agency, or anything else.

Intellectual property considerations will make the URI a first class citizen. Just like the URI is a part of the document web experience, it is a part of the Linked Data Web experience. Just like Creative Commons licenses allow the licensor to define what type of attribution is required, a data publisher can mandate that a user experience mediated by whatever application should expose the source as a dereferenceable URI.

One element of data dereferencing must be linking to applications that facilitate human interaction with the data. A generic data browser is a developer tool; the end user experience must still be mediated by interfaces tailored to the domain. This layer can take care of making the brand visible and can show advertising or be monetized on a usage basis.

Next we will look at the service provider and infrastructure side of this.

Related

# PermaLink Comments [0]
03/24/2009 09:38 GMT Modified: 03/24/2009 10:50 GMT
Beyond Applications - Introducing the Planetary Datasphere (Part 1) [ Virtuso Data Space Bot ]

This is the first in a short series of blog posts about what becomes possible when essentially unlimited linked data can be deployed on the open web and private intranets.

The term DataSphere comes from Dan Simmons' Hyperion science fiction series, where it is a sort of pervasive computing capability that plays host to all sorts of processes, including what people do on the net today, and then some. I use this term here in order to emphasize the blurring of silo and application boundaries. The network is not only the computer but also the database. I will look at what effects the birth of a sort of linked data stratum can have on end-user experience, application development, application deployment and hosting, business models and advertising, and security; how cloud computing fits in; and how back-end software such as databases must evolve to support all of these.

This is a mid-term vision. The components are coming into production as we speak, but the end result is not here quite yet.

I use the word DataSphere to refer to a worldwide database fabric, a global Distributed DBMS collective, within which there are many Data Spaces, or Named Data Spaces. A Data Space is essentially a person's or organization's contribution to the DataSphere. I use Linked Data Web to refer to component technologies and practices such as RDF, SPARQL, Linked Data practices, etc. The DataSphere does not have to be built on this technology stack per se, but this stack is still the best bet for it.

General

There exist applications for performing specialized functions such as social networking, shopping, document search, and C2C commerce at planetary scale. All these applications run on their own databases, each with a task specific schema. They communicate by web pages and by predefined messages for diverse application-specific transactions and reports.

These silos are scalable because in general their data has some natural partitioning, and because the set of transactions is predetermined and the data structure is set up for this.

The Linked Data Web proposes to create a data infrastructure that can hold anything, just like a network can transport anything. This is not a network with a memory of messages, but a whole that can answer arbitrary questions about what has been said. The prerequisite is that the questions are phrased in a vocabulary that is compatible with the vocabulary in which the statements themselves were made.

In this setting, the vocabulary takes the place of the application. Of course, there continues to be a procedural element to applications; this has the function of translating statements between the domain vocabulary and a user interface. Examples are data import from existing applications, running predefined reports, composing new reports, and translating between natural language and the domain vocabulary.

The big difference is that the database moves outside of the silo, at least in logical terms. The database will be like the network — horizontal and ubiquitous. The equivalent of TCP/IP will be the RDF/SPARQL combination. The equivalent of routing protocols between ISPs will be gateways between the specific DBMS engines supporting the services.

The place of the DBMS in the stack changes

The RDBMS in itself is eternal, or at least as eternal as a culture with heavy reliance on written records is. Any such culture will invent the RDBMS and use it where it best fits. We are not replacing this; we are building an abstracted worldwide data layer. This is to the RDBMS supporting line-of-business applications what the www was to enterprise content management systems.

For transactions, the Web 2.0-style application-specific messages are fine. Also, any transactional system that must be audited must physically reside somewhere, have physical security, etc. It can't just be somewhere in the DataSphere, managed by some system with which one has no contract, just like Google's web page cache can't be relied on as a permanent repository of web content.

Providing space on the Linked Data Web is like providing hosting on the Document Web. This may have varying service levels, pricing models, etc. The value of a queriable DataSphere is that a new application does not have to begin by building its own schema, database infrastructure, service hosting, etc. The application becomes more like a language meme, a cultural form of interaction mediated by a relatively lightweight user-facing component, laterally open for unforeseen interaction with other applications from other domains of discourse.

End User Benefits

For the end user, the web will still look like a place where one can shop, discuss, date, whatever. These activities will be mediated by user interfaces as they are now. Right now, the end user's web presence is his/her blog or web site, and their contributions to diverse wikis, social web sites, and so forth. These are scattered. The user's Data Space is the collection of all these things, now presented in a queriable form. The user's Data Space is the user's statement of presence, referencing the diverse contributions of the user on diverse sites.

The personal Data Space being a queriable, structured whole facilitates finding and being found, which is what brings individuals to the web in the first place. The best applications and sites are those which make this the easiest. The Linked Data Web allows saying what one wishes in a structured, queriable manner, across all application domains, independently of domain specific silos. The end user's interaction with the personal data space is through applications, like now. But these applications are just wrappers on top of self describing data, represented in domain specific vocabularies; one vocabulary is used for social networking, another for C2C commerce, and so on. The user is the master of their personal Data Space, free to take it where he or she wishes.

Further benefits will include more ready referencing between these spaces, more uniform identity management, cross-application operations, and the emergence of "meta-applications," i.e., unified interfaces for managing many related applications/tasks.

Of course, there is the increase in semantic richness, such as better contextuality derived from entity extraction from text. But this is also possible in a silo. The Linked Data Web angle is the sharing of identifiers for real world entities, which makes extracts of different sources by different parties potentially joinable. The user interaction will hardly ever be with the raw data. But the raw data being still at hand makes for better targeting of advertisements, better offering of related services, easier discovery of related content, and less noise overall.

Kingsley Idehen has coined the term SDQ, for Serendipitous Discovery Quotient, to denote this. When applications expose explicit semantics, constructing a user experience that combines relevant data from many sources, including applications as well as highly targeted advertising, becomes natural. It is no longer a matter of "mashing up" web service interfaces with procedural code, but of "meshing" data through declarative queries across application spaces.

Applications in the DataSphere

The workflows supported by the DataSphere are essentially those taking place on the web now. The DataSphere dimension is expressed by bookmarklets, browser plugins, and the like, with ready access to related data and actions that are relevant for this data. Actions triggered by data can be anything from posting a comment to making an e-commerce purchase. Web 2.0 models fit right in.

Web application development now consists of designing an application-specific database schema and writing web pages to interact with this schema. In the DataSphere, the database is abstracted away, as is a large part of the schema. The application floats on a sea of data instead of being tied to its own specific store and schema. Some local transaction data should still be handled in the old way, though.

For the application developer, the question becomes one of vocabulary choice. How will the application synthesize URIs from the user interaction? Which URIs will be used, since pretty much anything will in practice have many names (e.g., DBpedia Vs. Freebase identifiers). The end user will generally have no idea of this choice, nor of the various degrees of normalization, etc., in the vocabularies. Still, usage of such applications will produce data using some identifiers and vocabularies. Benefits of ready joining without translation will drive adoption. A vocabulary with instance data will get more instance data.

The Linked Data Web infrastructure itself must support vocabulary and identifier choice by answering questions about who uses a particular identifier and where. Even now, we offer entity ranks and resolution of synonyms, queries on what graphs mention a certain identifier and so on. This is a means of finding the most commonly used term for each situation. Convergence of terminology cuts down on translation and makes for easier and more efficient querying.

Advertising

The application developer is, for purposes of advertising, in the position of the inventory owner, just like a traditional publisher, whether web or other. But with smarter data, it is not a matter of static keywords but of the semantically explicit data behind each individual user impression driving the ads. Data itself carries no ads but the user impression will still go through a display layer that can show ads. If the application relies on reuse of licensed content, such as media, then the content provider may get a cut of the ad revenue even if it is not the direct owner of the inventory. The specifics of implementing and enforcing this are to be worked out.

Content Providers, License, and Attribution

For the content provider, the URI is the brand carrier. If the data is well linked and queriable, this will drive usage and traffic to the services of the content provider. This is true of any provider, whether a media publisher, e-commerce business, government agency, or anything else.

Intellectual property considerations will make the URI a first class citizen. Just like the URI is a part of the document web experience, it is a part of the Linked Data Web experience. Just like Creative Commons licenses allow the licensor to define what type of attribution is required, a data publisher can mandate that a user experience mediated by whatever application should expose the source as a dereferenceable URI.

One element of data dereferencing must be linking to applications that facilitate human interaction with the data. A generic data browser is a developer tool; the end user experience must still be mediated by interfaces tailored to the domain. This layer can take care of making the brand visible and can show advertising or be monetized on a usage basis.

Next we will look at the service provider and infrastructure side of this.

Related

# PermaLink Comments [0]
03/24/2009 09:38 GMT Modified: 03/24/2009 10:50 GMT
Linked Data & The Year 2009 (updated) [ Orri Erling ]

As is fitting for the season, I will editorialize a bit about what has gone before and what is to come.

Sir Tim said it at WWW08 in Beijinglinked data and the linked data web is the semantic web and the Web done right.

The grail of ad hoc analytics on infinite data has lost none of its appeal. We have seen fresh evidence of this in the realm of data warehousing products, as well as storage in general.

The benefits of a data model more abstract than the relational are being increasingly appreciated also outside the data web circles. Microsoft's Entity Frameworks technology is an example. Agility has been a buzzword for a long time. Everything should be offered in a service based business model and should interoperate and integrate with everything else — business needs first; schema last.

Not to forget that when money is tight, reuse of existing assets and paying on a usage basis are naturally emphasized. Information, as the asset it is, is none the less important, on the contrary. But even with information, value should be realized economically, which, among other things, entails not reinventing the wheel.

It is against this backdrop that this year will play out.

As concerns research, I will again quote Harry Halpin at ESWC 2008: "Men will fight in a war, and even lose a war, for what they believe just. And it may come to pass that later, even though the war were lost, the things then fought for will emerge under another name and establish themselves as the prevailing reality" [or words to this effect].

Something like the data web, and even the semantic web, will happen. Harry's question was whether this would be the descendant of what is today called semantic web research.

I heard in conversation about a project for making a very large metadata store. I also heard that the makers did not particularly insist on this being RDF-based, though.

Why should such a thing be RDF-based? If it is already accepted that there will be ad hoc schema and that queries ought to be able to view the data from all angles, not be limited by having indices one way and not another way, then why not RDF?

The justification of RDF is in reusing and linking-to data and terminology out there. Another justification is that by using an RDF store, one is spared a lot of work and tons of compromises which attend making an entity-attribute-value (EAV, i.e., triple) store on a generic RDBMS. The sem-web world has been there, trust me. We came out well because we put all inside the RDBMS, lowest level, which you can't do unless you own the RDBMS. Source access is not enough; you also need the knowledge.

Technicalities aside, the question is one of proprietary vs. standards-based. This is not only so with software components, where standards have consistently demonstrated benefits, but now also with the data. Zemanta and OpenCalais serving DBpedia URIs are examples. Even in entirely closed applications, there is benefit in reusing open vocabularies and identifiers: One does not need to create a secret language for writing a secret memo.

Where data is a carrier of value, its value is enhanced by it being easy to repurpose (i.e., standard vocabularies) and to discover (i.e., data set metadata). As on the web, so on the enterprise intranet. In this lies the strength of RDF as opposed to proprietary flexible database schemes. This is a qualitative distinction.

Linking Open Data project logo
In hoc signo vinces.

In this light, we welcome the voiD (VOcabulary of Interlinked Data), which is the first promise of making federatable data discoverable. Now that there is a point of focus for these efforts, the needed expressivity will no doubt accrete around the voiD core.

For data as a service, we clearly see the value of open terminologies as prerequisites for service interchangeability, i.e., creating a marketplace. XML is for the transaction; RDF is for the discovery, query, and analytics. As with databases in general, first there was the transaction; then there was the query. Same here. For monetizing the query, there are models ranging from renting data sets and server capacity in the clouds to hosted services where one pays for processing past a certain quota. For the hosted case, we just removed a major barrier to offering unlimited query against unlimited data when we completed the Virtuoso Anytime feature. With this, the user gets what is found within a set time, which is already something, and in case of needing more, one can pay for the usage. Of course, we do not forget advertising. When data has explicit semantics, contextuality is better than with keywords.

For these visions to materialize on top of the linked data platform, linked data must join the world of data. This means messaging that is geared towards the database public. They know the problem, but the RDF proposition is still not well enough understood for it to connect.

For the relational IT world, we offer passage to the data web and its promise of integration through RDF mapping. We are also bringing out new Microsoft Entity Framework components. This goes in the direction of defining a unified database frontier with RDF and non-RDF entity models side by side.

For OpenLink Software, 2008 was about developing technology for scale, RDF as well as generic relational. We did show a tiny preview with the Billion Triples Challenge demo. Now we are set to come out with the real thing, featuring, among other things, faceted search at the billion triple scale. We started offering ready-to-go Virtuoso-hosted linked open data sets on Amazon EC2 in December. Now we continue doing this based on our next-generation server, as well as make Virtuoso 6 Cluster commercially available. Technical specifics are amply discussed on this blog. There are still some new technology things to be developed this year; first among these are strong SPARQL federation, and on-the-fly resizing of server clusters. On the research partnerships side, we have an EU grant for working with the OntoWiki project from the University of Leipzig, and we are partners in DERI's Líon project. These will provide platforms for further demonstrating the "web" in data web, as in web-scale smart databasing.

2009 will see change through scale. The things that exist will start interconnecting and there will be emergent value. Deployments will be larger and scale will be readily available through a services model or by installation at one's own facilities. We may see the start of Search becoming Find, like Kingsley says, meaning semantics of data guiding search. Entity extraction will multiply data volumes and bring parts of the data web to real time.

Exciting 2009 to all.

# PermaLink Comments [0]
01/02/2009 16:17 GMT Modified: 01/02/2009 13:26 GMT
Linked Data & The Year 2009 (updated) [ Virtuso Data Space Bot ]

As is fitting for the season, I will editorialize a bit about what has gone before and what is to come.

Sir Tim said it at WWW08 in Beijinglinked data and the linked data web is the semantic web and the Web done right.

The grail of ad hoc analytics on infinite data has lost none of its appeal. We have seen fresh evidence of this in the realm of data warehousing products, as well as storage in general.

The benefits of a data model more abstract than the relational are being increasingly appreciated also outside the data web circles. Microsoft's Entity Frameworks technology is an example. Agility has been a buzzword for a long time. Everything should be offered in a service based business model and should interoperate and integrate with everything else — business needs first; schema last.

Not to forget that when money is tight, reuse of existing assets and paying on a usage basis are naturally emphasized. Information, as the asset it is, is none the less important, on the contrary. But even with information, value should be realized economically, which, among other things, entails not reinventing the wheel.

It is against this backdrop that this year will play out.

As concerns research, I will again quote Harry Halpin at ESWC 2008: "Men will fight in a war, and even lose a war, for what they believe just. And it may come to pass that later, even though the war were lost, the things then fought for will emerge under another name and establish themselves as the prevailing reality" [or words to this effect].

Something like the data web, and even the semantic web, will happen. Harry's question was whether this would be the descendant of what is today called semantic web research.

I heard in conversation about a project for making a very large metadata store. I also heard that the makers did not particularly insist on this being RDF-based, though.

Why should such a thing be RDF-based? If it is already accepted that there will be ad hoc schema and that queries ought to be able to view the data from all angles, not be limited by having indices one way and not another way, then why not RDF?

The justification of RDF is in reusing and linking-to data and terminology out there. Another justification is that by using an RDF store, one is spared a lot of work and tons of compromises which attend making an entity-attribute-value (EAV, i.e., triple) store on a generic RDBMS. The sem-web world has been there, trust me. We came out well because we put all inside the RDBMS, lowest level, which you can't do unless you own the RDBMS. Source access is not enough; you also need the knowledge.

Technicalities aside, the question is one of proprietary vs. standards-based. This is not only so with software components, where standards have consistently demonstrated benefits, but now also with the data. Zemanta and OpenCalais serving DBpedia URIs are examples. Even in entirely closed applications, there is benefit in reusing open vocabularies and identifiers: One does not need to create a secret language for writing a secret memo.

Where data is a carrier of value, its value is enhanced by it being easy to repurpose (i.e., standard vocabularies) and to discover (i.e., data set metadata). As on the web, so on the enterprise intranet. In this lies the strength of RDF as opposed to proprietary flexible database schemes. This is a qualitative distinction.

Linking Open Data project logo
In hoc signo vinces.

In this light, we welcome the voiD (VOcabulary of Interlinked Data), which is the first promise of making federatable data discoverable. Now that there is a point of focus for these efforts, the needed expressivity will no doubt accrete around the voiD core.

For data as a service, we clearly see the value of open terminologies as prerequisites for service interchangeability, i.e., creating a marketplace. XML is for the transaction; RDF is for the discovery, query, and analytics. As with databases in general, first there was the transaction; then there was the query. Same here. For monetizing the query, there are models ranging from renting data sets and server capacity in the clouds to hosted services where one pays for processing past a certain quota. For the hosted case, we just removed a major barrier to offering unlimited query against unlimited data when we completed the Virtuoso Anytime feature. With this, the user gets what is found within a set time, which is already something, and in case of needing more, one can pay for the usage. Of course, we do not forget advertising. When data has explicit semantics, contextuality is better than with keywords.

For these visions to materialize on top of the linked data platform, linked data must join the world of data. This means messaging that is geared towards the database public. They know the problem, but the RDF proposition is still not well enough understood for it to connect.

For the relational IT world, we offer passage to the data web and its promise of integration through RDF mapping. We are also bringing out new Microsoft Entity Framework components. This goes in the direction of defining a unified database frontier with RDF and non-RDF entity models side by side.

For OpenLink Software, 2008 was about developing technology for scale, RDF as well as generic relational. We did show a tiny preview with the Billion Triples Challenge demo. Now we are set to come out with the real thing, featuring, among other things, faceted search at the billion triple scale. We started offering ready-to-go Virtuoso-hosted linked open data sets on Amazon EC2 in December. Now we continue doing this based on our next-generation server, as well as make Virtuoso 6 Cluster commercially available. Technical specifics are amply discussed on this blog. There are still some new technology things to be developed this year; first among these are strong SPARQL federation, and on-the-fly resizing of server clusters. On the research partnerships side, we have an EU grant for working with the OntoWiki project from the University of Leipzig, and we are partners in DERI's Líon project. These will provide platforms for further demonstrating the "web" in data web, as in web-scale smart databasing.

2009 will see change through scale. The things that exist will start interconnecting and there will be emergent value. Deployments will be larger and scale will be readily available through a services model or by installation at one's own facilities. We may see the start of Search becoming Find, like Kingsley says, meaning semantics of data guiding search. Entity extraction will multiply data volumes and bring parts of the data web to real time.

Exciting 2009 to all.

# PermaLink Comments [0]
01/02/2009 16:17 GMT Modified: 01/02/2009 13:26 GMT
 <<     | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |     >>
Powered by OpenLink Virtuoso Universal Server
Running on Linux platform