<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0">
<channel>

<title>Orri Erling&#39;s Weblog</title><link>http://www.openlinksw.com/weblog/oerling/</link><description /><managingEditor>oerling@openlinksw.com</managingEditor><pubDate>Mon, 23 Nov 2009 11:22:09 GMT</pubDate><generator>Virtuoso Universal Server 05.12.3041</generator><webMaster>oerling@openlinksw.com</webMaster><image><title>Orri Erling&#39;s Weblog</title><url>http://www.openlinksw.com/weblog/public/images/vbloglogo.gif</url><link>http://www.openlinksw.com/weblog/oerling/</link><description /><width>88</width><height>31</height></image>
<item><title>Short Recap of Virtuoso Basics (#3 of 5)</title><guid>http://www.openlinksw.com/weblog/oerling/?date=2009-04-30#1550</guid><comments>http://www.openlinksw.com/weblog/oerling/?id=1550#comments</comments><pubDate>Thu, 30 Apr 2009 15:49:53 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2009-04-30T12:11:43.000001-04:00</n0:modified><description>&lt;p&gt;(Third of five posts related to the &lt;a href=&quot;http://www2009.org/&quot; id=&quot;link-id0x1081fe40&quot;&gt;WWW 2009&lt;/a&gt; conference, held the week of April 20, 2009.)

&lt;/p&gt;
&lt;p&gt;There are some points that came up in conversation at WWW 2009 that I will reiterate here. We find there is still some lack of clarity in the product image, so I will here condense it.&lt;/p&gt;

&lt;p&gt;
&lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0xd0e85f0&quot;&gt;Virtuoso&lt;/a&gt; is a DBMS. We pitch it primarily to the &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x14a294d8&quot;&gt;data&lt;/a&gt; web space because this is where we see the emerging frontier. Virtuoso does both &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x108042f8&quot;&gt;SQL&lt;/a&gt; and &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x10889878&quot;&gt;SPARQL&lt;/a&gt; and can do both at large scale and high performance. The popular perception of &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x107d3b40&quot;&gt;RDF&lt;/a&gt; and Relational models as mutually exclusive and antagonistic poles is based on the poor scalability of early RDF implementations. What we do is to have all the RDF specifics, like IRIs and typed literals as native SQL types, and to have a cost based optimizer that knows about this all.&lt;/p&gt;

&lt;p&gt;If you want application-specific data structures as opposed to a schema-agnostic quad-store model (triple + graph-name), then Virtuoso can give you this too.  &lt;a href=&quot;http://docs.openlinksw.com/virtuoso/rdfsparqlintegrationmiddleware.html#rdfviews&quot; id=&quot;link-id14ddc7c8&quot;&gt;Rendering application specific data structures as RDF&lt;/a&gt; applies equally to relational data in non-Virtuoso databases because Virtuoso SQL can &lt;a href=&quot;http://docs.openlinksw.com/virtuoso/qsvdbsrv.html&quot; id=&quot;link-id14aaea70&quot;&gt;federate tables from heterogenous DBMS&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;On top of this, there is a &lt;a href=&quot;http://docs.openlinksw.com/virtuoso/qswebserver.html&quot; id=&quot;link-id16fcde60&quot;&gt;web server built in&lt;/a&gt;, so that no extra server is needed for web services, web pages, and the like.&lt;/p&gt;

&lt;p&gt;Installation is simple, just one exe and one config file. There is a huge amount of code in &lt;a href=&quot;http://docs.openlinksw.com/virtuoso/installation.html&quot; id=&quot;link-id16767b40&quot;&gt;installers&lt;/a&gt; â application code and test suites and such â but none of this is needed when you deploy. Scale goes from a 25MB memory footprint on the desktop to hundreds of gigabytes of RAM and endless terabytes of disk on shared-nothing clusters.&lt;/p&gt;

&lt;p&gt;Clusters (coming in Release 6) and SQL federation are &lt;a href=&quot;http://download.openlinksw.com/download/product_matrix.vsp?p=l_os&amp;amp;c=39&amp;amp;df=16&quot; id=&quot;link-id16722550&quot;&gt;commercial only&lt;/a&gt;; the rest can be had &lt;a href=&quot;http://sourceforge.net/project/showfiles.php?group_id=161622&quot; id=&quot;link-id131080a8&quot;&gt;under GPL&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;To condense further:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scalable Delivery of &lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id0x12211da8&quot;&gt;Linked Data&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;SPARQL and SQL
&lt;ul&gt;
    &lt;li&gt;Arbitrary RDF Data + Relational&lt;/li&gt;
&lt;li&gt;Also From 3rd Party &lt;a href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id0x168db0e0&quot;&gt;RDBMS&lt;/a&gt;
    &lt;/li&gt;
  &lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Easy Deployment &lt;/li&gt;
&lt;li&gt;Standard Interfaces
&lt;ul&gt;
    &lt;li&gt;
      &lt;a href=&quot;http://dbpedia.org/resource/Open_Database_Connectivity&quot; id=&quot;link-id0x10473bf0&quot;&gt;ODBC&lt;/a&gt;, &lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id0x12187f58&quot;&gt;JDBC&lt;/a&gt;, OLE DB, &lt;a href=&quot;http://dbpedia.org/resource/ADO.NET&quot; id=&quot;link-id0x10354e48&quot;&gt;ADO&lt;/a&gt;.&lt;a href=&quot;http://dbpedia.org/resource/.NET_Framework&quot; id=&quot;link-id0x16eeadd0&quot;&gt;NET&lt;/a&gt;, XMLA&lt;/li&gt;
&lt;li&gt;
      &lt;a href=&quot;http://jena.sourceforge.net/&quot; id=&quot;link-id0x12e3fe08&quot;&gt;Jena&lt;/a&gt;, &lt;a href=&quot;http://sourceforge.net/projects/sesame/&quot; id=&quot;link-id0x15e62470&quot;&gt;Sesame&lt;/a&gt;, etc.&lt;/li&gt;
&lt;li&gt;All Web Protocols &lt;/li&gt;
  &lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
</description></item><item><title>Search at WWW 2009 (#2 of 5)</title><guid>http://www.openlinksw.com/weblog/oerling/?date=2009-04-30#1548</guid><comments>http://www.openlinksw.com/weblog/oerling/?id=1548#comments</comments><pubDate>Thu, 30 Apr 2009 15:18:24 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2009-04-30T12:51:48-04:00</n0:modified><description>&lt;p&gt;(Second of five posts related to the &lt;a href=&quot;http://www2009.org/&quot; id=&quot;link-id124024c8&quot;&gt;WWW 2009&lt;/a&gt; conference, held the week of April 20, 2009.)

&lt;/p&gt;
&lt;p&gt;There was a &lt;a href=&quot;http://data.semanticweb.org/conference/www/2009/paper/109/html&quot; id=&quot;link-id1207a3b0&quot;&gt;workshop on semantic search&lt;/a&gt; plus &lt;a href=&quot;http://data.semanticweb.org/conference/www/2009/html&quot; id=&quot;link-id1704ff48&quot;&gt;a number of papers&lt;/a&gt; and of course &lt;a href=&quot;http://www2009.org/keynote.html&quot; id=&quot;link-id11ec08d8&quot;&gt;keynotes from Google and Yahoo&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;A general topic was the use of and access to query logs. Are these the monopoly of GYM (Google, Yahoo, Microsoft) or should they be made more generally available? This is a privacy question. Use of query logs and click through of search results for improved ranking was mentioned many times throughout the conference.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;http://data.semanticweb.org/conference/www/2009/paper/109/html&quot; id=&quot;link-id120b7d38&quot;&gt;semantic search workshop&lt;/a&gt; was largely about benchmarks for keyword search in &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id171e2950&quot;&gt;information&lt;/a&gt; retrieval. For &lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id11e1a9b0&quot;&gt;linked data&lt;/a&gt;, which is a database proposition, these benchmarks are not really applicable. For document search aided by semantics derived by NLP, these are of course applicable. But there is a divide in approach.&lt;/p&gt;

&lt;p&gt;
&lt;a href=&quot;http://g1o.net/foaf.rdf#me&quot; id=&quot;link-id11d1c7b0&quot;&gt;Giovanni Tummarello&lt;/a&gt; &lt;a href=&quot;http://data.semanticweb.org/conference/www/2009/paper/59/html&quot; id=&quot;link-id169add28&quot;&gt;presented&lt;/a&gt; &lt;a href=&quot;http://sig.ma/&quot; id=&quot;link-id11af0128&quot;&gt;Sig.ma&lt;/a&gt;, a service using &lt;a href=&quot;http://sindice.com/&quot; id=&quot;link-id11a69fa0&quot;&gt;Sindice&lt;/a&gt;&amp;#39;s &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id11f3a088&quot;&gt;RDF&lt;/a&gt; index for collecting all RDF statements about entities matching some set of keywords. One could then choose which sources and which entities were the right ones. One could further store such a query and embed it on a page. The point was that the filtering done manually could be persisted and republished, so as to create dynamic content aggregated from selected live sources. Further speculating, one could use such user feedback for adjusting ranking, even though Sig.ma did not. We may adopt the idea of manually excluding sources into our browser too. Fresnel lenses are another thing to look at.&lt;/p&gt;

&lt;p&gt;There was &lt;a href=&quot;http://www2009.eprints.org/242/&quot; id=&quot;link-id11dc7c68&quot;&gt;a paper by Josep M. Pujol and Pablo Rodriguez, of Telefonica Research&lt;/a&gt;, about returning search to the people by means of Porqpine, a peer-to-peer search implementation based on sharing search results from search engines among peers and indexing them locally as they were retrieved. For users with similar interests, this can give a community based ranking model but has issues of privacy. Another point was that with local processing and personal scale &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id171a8948&quot;&gt;data&lt;/a&gt; volumes various kinds of brute force processing were feasible that would cost a lot for the web scale. Much can be done web scale but it must be done cleverly, not with a shell script and not so ad hoc.&lt;/p&gt;

&lt;p&gt;As a counterpoint to this, there was &lt;a href=&quot;http://www2009.eprints.org/220/&quot; id=&quot;link-id120bf9e0&quot;&gt;a talk about Hadoop and Hive&lt;/a&gt;, a map-reduce-based &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-idee5d700&quot;&gt;SQL&lt;/a&gt;-like framework. One could do an SQL &lt;code&gt;GROUP BY&lt;/code&gt; on text files with record parsing at run time, all spread over a Hadoop cluster. The issue is, if you have a petabyte of data, you may wish to run more than one ad hoc query on it. This means that joining between partitions and complex processing becomes important. This cannot be done without indices and complex query optimization, and needs a DBMS. Stonebraker and company are fully justified in their &lt;a href=&quot;http://database.cs.brown.edu/sigmod09/&quot; id=&quot;link-id11be1088&quot;&gt;critique of map reduce&lt;/a&gt;. It looks like each generation must get dazzled by the oversimplified and then retrace the same discoveries of complexity as the previous one.&lt;/p&gt;

&lt;p&gt;Some of our future plans were confirmed by what we saw, for example as concerns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Interactively selecting sources for search, showing the graphs, then interactively refining&lt;/li&gt;
&lt;li&gt;More social networks, more network analysis, and more work on social recommendation&lt;/li&gt;
&lt;li&gt;Real time indexing of new pings, filling the store by forwarding queries to search engines, and harvesting micro-formats from results&lt;/li&gt;
&lt;li&gt;Using &lt;a href=&quot;http://dbpedia.org/resource/Entity&quot; id=&quot;link-id16440770&quot;&gt;entity&lt;/a&gt; extraction&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are all items in the pipeline, easy to do on top of the existing platform. For the machine learning and NLP parts, we will partner with others, details will be worked out while we work on the items we implement by ourselves.&lt;/p&gt;</description></item><item><title>Linked Data at WWW 2009 (#1 of 5)</title><guid>http://www.openlinksw.com/weblog/oerling/?date=2009-04-27#1544</guid><comments>http://www.openlinksw.com/weblog/oerling/?id=1544#comments</comments><pubDate>Mon, 27 Apr 2009 21:28:11 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2009-04-28T11:27:50-04:00</n0:modified><description>&lt;p&gt;(First of five posts related to the &lt;a href=&quot;http://www2009.org/&quot; id=&quot;link-id0x114c2450&quot;&gt;WWW 2009&lt;/a&gt; conference, held the week of April 20, 2009.)&lt;/p&gt;

&lt;p&gt;We gave a talk at the &lt;a href=&quot;http://community.linkeddata.org/dataspace/organization/lod#this&quot; id=&quot;link-id0x166e10f0&quot;&gt;Linked Open Data&lt;/a&gt; workshop, &lt;a href=&quot;http://events.linkeddata.org/ldow2009/&quot; id=&quot;link-id0x19c2b1f0&quot;&gt;LDOW 2009&lt;/a&gt;, at WWW 2009. I did not go very far into the technical points in the talk, as there was almost no time and the points are rather complex. Instead, I emphasized what new things had become possible with recent developments.&lt;/p&gt;

&lt;p&gt;The problem we do not cease hearing about is scale. We have solved most of it. There is scale in the schema: Put together, ontologies go over a million classes/properties. Which ones are relevant depends, and the user should have the choice. The instance &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x12c65250&quot;&gt;data&lt;/a&gt; is in the tens of billions of triples, much derived from Web 2.0 sources but also much published as &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x441128e0&quot;&gt;RDF&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;To make sense of this all, we need quick summaries and search. Without navigation via joins, the value will be limited. Fast joining, counting, grouping, and ranking are key.&lt;/p&gt;

&lt;p&gt;People will use different terms for the same thing. The issue of identity is philosophical. In order to do reasoning one needs strong identity; a statement like &lt;i&gt;x is a bit like y&lt;/i&gt; is not very useful in a database context. Whether any x and y can be considered the same depends on the context. So leave this for query time. The conditions under which two people are considered the same will depend on whether you are doing marketing analysis or law enforcement. A general purpose data store cannot anticipate all the possibilities, so smush on demand, as you go, as has been said many times.&lt;/p&gt;

&lt;p&gt;Against this backdrop, we offer a solution with which anybody who so chooses can play with big data, whether a search or analytics player.&lt;/p&gt;

&lt;p&gt;We are going in the direction of more and more ad hoc processing at larger and larger scale. With good query parallelization, we can do big joins without complex programming. No explicit Map Reduce jobs or the like. What was done with special code with special parallel programming models, can now be done in &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x16766eb0&quot;&gt;SQL&lt;/a&gt; and &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x1645ddc8&quot;&gt;SPARQL&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;To showcase this, we do &lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id0xa167e698&quot;&gt;linked data&lt;/a&gt; search, browsing, and so on, but are essentially a platform provider.&lt;/p&gt;

&lt;p&gt;Entry costs into relatively high end databases have dropped significantly. A cluster with 1 TB of RAM sells for $75K or so at today&amp;#39;s retail prices and fits under a desk. For intermittent use, the rent for 1TB RAM is $1228 per day on &lt;a href=&quot;http://aws.amazon.com/ec2/&quot; id=&quot;link-id0xa1a67b70&quot;&gt;EC2&lt;/a&gt;. With this on one side and &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x1622d4e0&quot;&gt;Virtuoso&lt;/a&gt; on the other, a lot that was impractical in the past is now within reach. Like &lt;a href=&quot;http://g1o.net/foaf.rdf#me&quot; id=&quot;link-id0x3d5c8b50&quot;&gt;Giovanni Tummarello&lt;/a&gt; put it for airplanes, the physics are as they were for &lt;a href=&quot;http://dbpedia.org/resource/Leonardo_da_Vinci&quot; id=&quot;link-id0x198e7cc0&quot;&gt;da Vinci&lt;/a&gt; but materials and engines had to develop a bit before there was commercial potential. So it is also with analytics for everyone.&lt;/p&gt;

&lt;p&gt;A remark from the audience was that all the stuff being shown, not limited to Virtuoso, was non-standard, having to do with text search, with ranking, with extensions, and was in fact not SPARQL and pure linked data principles. Further, by throwing this all together, one got something overcomplicated, too heavy.&lt;/p&gt;

&lt;p&gt;I answered as follows, which apparently cannot be repeated too much:&lt;/p&gt;

&lt;p&gt;First, everybody expects a text search box, and is conditioned to having one. No text search and no ranking is a non-starter. &lt;i&gt;Ceterum censeo&lt;/i&gt;, for database, the next generation cannot be less expressive than the previous. All of SQL and then some is where SPARQL must be. The barest minimum is being able to say anything one can say in SQL, and then justify SPARQL by saying that it is better for heterogenous data, schema last, and so on. On top of this, transitivity and rules will not hurt. For now, the current SPARQL working group will at least reach basic SQL parity; the edge will still remain implementation dependent.&lt;/p&gt;

&lt;p&gt;Another remark was that joining is slow. Depends. Anything involving more complex disk access than linear reading of a blob is generally not good for interactive use. But with adequate memory, and with all hot spots in memory, we do some 3.2 million random-accesses-per-second on 12 cores, with easily 80% platform utilization for a single large query. The high utilization means that times drop as processing gets divided over more partitions.&lt;/p&gt;

&lt;p&gt;There was a talk about &lt;a href=&quot;http://semanticweb.org/wiki/MashQL&quot; id=&quot;link-id0x60bd57b0&quot;&gt;MashQL&lt;/a&gt; by &lt;a href=&quot;http://data.semanticweb.org/person/mustafa-jarrar&quot; id=&quot;link-id0xa1fb98d8&quot;&gt;Mustafa Jarrar&lt;/a&gt;, concerning an abstraction on top of SPARQL for easy composition of tree-structured queries. The idea was that such queries can be evaluated &amp;quot;on the fly&amp;quot; as they are being composed. As it happens, we already have an &lt;a href=&quot;http://dbpedia.org/resource/XML&quot; id=&quot;link-id0x1923a380&quot;&gt;XML&lt;/a&gt;-based query abstraction layer incorporated into Virtuoso 6.0&amp;#39;s built-in &lt;a href=&quot;http://lod.openlinksw.com/fct/facet.vsp&quot; id=&quot;link-id0x67712740&quot;&gt;Faceted Data Browser Service&lt;/a&gt;, and the effects are probably quite similar. The most important point here is that by using XML, both of these approaches are interoperable against a Virtuoso back-end. Along similar lines, we did not get to talk to the G Facets people but our message to them is the same: &lt;i&gt;Use the &lt;a href=&quot;http://lod.openlinksw.com/fct/facet.vsp&quot; id=&quot;link-id0x70df2798&quot;&gt;faceted browser service&lt;/a&gt; to get vastly higher performance when querying against Linked Data, be it &lt;a href=&quot;http://dbpedia.org/resource/DBpedia&quot; id=&quot;link-id0x1b3fd608&quot;&gt;DBpedia&lt;/a&gt; or the &lt;a href=&quot;http://dbpedia.org/resource/Entity&quot; id=&quot;link-id0x13ecd708&quot;&gt;entity&lt;/a&gt; &lt;a href=&quot;http://community.linkeddata.org/dataspace/organization/lod#this&quot; id=&quot;link-id0x17f16970&quot;&gt;LOD&lt;/a&gt; &lt;a href=&quot;http://lod.openlinksw.com/&quot; id=&quot;link-id0x54334250&quot;&gt;Cloud&lt;/a&gt;. Virtuoso 6.0 (Open Source Edition) &amp;quot;&lt;a href=&quot;http://sourceforge.net/project/showfiles.php?group_id=161622&amp;amp;package_id=319652&amp;amp;release_id=677866&quot; id=&quot;link-id12159728&quot;&gt;TP1&lt;/a&gt;&amp;quot; is now publicly available as a Technology Preview (beta).&lt;/i&gt;
&lt;/p&gt;

&lt;p&gt;We heard that there is an effort for porting Freebase&amp;#39;s Parallax to SPARQL. The same thing applies to this. With a number of different data viewers on top of SPARQL, we come closer to broad-audience linked-data applications. These viewers are still too generic for the end user, though. We fully believe that for both search and transactions, application-domain-specific workflows will stay relevant. But these can be made to a fair degree by specializing generic linked-data-bound controls and gluing them together with some scripting.&lt;/p&gt;

&lt;p&gt;As said before, the application will interface the user to the vocabulary. The vocabulary development takes the modeling burden from the application and makes for interchangeable experience on the same data. The data in turn is &amp;quot;virtualized&amp;quot; into the database cloud or the local secure server, as the use case may require. &lt;/p&gt;

&lt;p&gt;For ease of adoption, open competition, and safety from lock-in, the community needs a SPARQL whose usability is not totally dependent on vendor extensions. But we might &lt;i&gt;de facto&lt;/i&gt; have that in just a bit, whenever there is a working draft from the SPARQL WG.&lt;/p&gt;

&lt;p&gt;Another topic that we encounter often is the question of integration (or lack thereof) between communities. For example, database conferences reject &lt;a href=&quot;http://dbpedia.org/resource/Semantic_Web&quot; id=&quot;link-id0x185d6bf8&quot;&gt;semantic web&lt;/a&gt; papers and vice versa. Such politics would seem to emerge naturally but are nonetheless detrimental. We really should partner with people who write papers as their principal occupation. We ourselves do software products and use very little time for papers, so some of the bad reviews we have received do make a legitimate point. By rights, we should go for database venues but we cannot have this take too much time. So we are open to partnering for splitting the opportunity cost of multiple submissions.&lt;/p&gt;

&lt;p&gt;For future work, there is nothing radically new. We continue testing and productization of cluster databases. Just deliver what is in the pipeline. The essential nature of this is adding more and more cases of better and better parallelization in different query situations. The present usage patterns work well for finding bugs and performance bottlenecks. For presentation, our goal is to have third party viewers operate with our platform. We cannot completely leave data browsing and UI to third parties since we must from time to time introduce various unique functionality. Most interaction should however go via third party applications.&lt;/p&gt;</description></item><item><title>Beyond Applications - Introducing the Planetary Datasphere (Part 1)</title><guid>http://www.openlinksw.com/weblog/oerling/?date=2009-03-24#1535</guid><comments>http://www.openlinksw.com/weblog/oerling/?id=1535#comments</comments><pubDate>Tue, 24 Mar 2009 14:38:57 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2009-03-24T10:50:13-04:00</n0:modified><description>&lt;p&gt;This is the first in a short series of &lt;a href=&quot;http://dbpedia.org/resource/Blog&quot; id=&quot;link-id0x12c91d60&quot;&gt;blog&lt;/a&gt; posts about what becomes possible when essentially unlimited &lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id0x2375f488&quot;&gt;linked data&lt;/a&gt; can be deployed on the open web and private intranets.&lt;/p&gt;

&lt;p&gt;The term &lt;i&gt;DataSphere&lt;/i&gt; comes from Dan Simmons&amp;#39; &lt;i&gt;&lt;a href=&quot;http://dbpedia.org/resource/Hyperion_Cantos&quot; id=&quot;link-id12ad4718&quot;&gt;Hyperion&lt;/a&gt;&lt;/i&gt; science fiction series, where it is a sort of pervasive computing capability that plays host to all sorts of processes, including what people do on the &lt;a href=&quot;http://dbpedia.org/resource/.NET_Framework&quot; id=&quot;link-id0x13084f08&quot;&gt;net&lt;/a&gt; today, and then some. I use this term here in order to emphasize the blurring of silo and application boundaries. The network is not only the computer but also the database. I will look at what effects the birth of a sort of linked data stratum can have on end-user experience, application development, application deployment and hosting, business models and advertising, and security; how cloud computing fits in; and how back-end software such as databases must evolve to support all of these.&lt;/p&gt;

&lt;p&gt;This is a mid-term vision. The components are coming into production as we speak, but the end result is not here quite yet.&lt;/p&gt;

&lt;p&gt;I use the word &lt;i&gt;DataSphere&lt;/i&gt; to refer to a worldwide database fabric, a global Distributed DBMS collective, within which there are many &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x2504fff8&quot;&gt;Data&lt;/a&gt; Spaces, or Named Data Spaces. A &lt;i&gt;Data &lt;a href=&quot;http://en.wikipedia.org/wiki/Data_Spaces&quot; id=&quot;link-id0x81175fa0&quot;&gt;Space&lt;/a&gt;&lt;/i&gt; is essentially a person&amp;#39;s or organization&amp;#39;s contribution to the DataSphere. I use &lt;i&gt;Linked Data &lt;a href=&quot;http://dbpedia.org/resource/Giant_Global_Graph&quot; id=&quot;link-id0x70f4e190&quot;&gt;Web&lt;/a&gt;&lt;/i&gt; to refer to component technologies and practices such as &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x3a5ddcd8&quot;&gt;RDF&lt;/a&gt;, &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x23b049e0&quot;&gt;SPARQL&lt;/a&gt;, Linked Data practices, etc. The DataSphere does not have to be built on this technology stack &lt;i&gt;per se&lt;/i&gt;, but this stack is still the best bet for it.&lt;/p&gt;

&lt;h2&gt;General&lt;/h2&gt;

&lt;p&gt;There exist applications for performing specialized functions such as social networking, shopping, document search, and C2C commerce at planetary scale. All these applications run on their own databases, each with a task specific schema. They communicate by web pages and by predefined messages for diverse application-specific transactions and reports.&lt;/p&gt;

&lt;p&gt;These silos are scalable because in general their data has some natural partitioning, and because the set of transactions is predetermined and the data structure is set up for this.&lt;/p&gt;

&lt;p&gt;The Linked Data Web proposes to create a data infrastructure that can hold anything, just like a network can transport anything. This is not a network with a memory of messages, but a whole that can answer arbitrary questions about what has been said. The prerequisite is that the questions are phrased in a vocabulary that is compatible with the vocabulary in which the statements themselves were made.&lt;/p&gt;

&lt;p&gt;In this setting, the vocabulary takes the place of the application. Of course, there continues to be a procedural element to applications; this has the function of translating statements between the domain vocabulary and a user interface. Examples are data import from existing applications, running predefined reports, composing new reports, and translating between natural language and the domain vocabulary.&lt;/p&gt;

&lt;p&gt;The big difference is that the database moves outside of the silo, at least in logical terms. The database will be like the network â horizontal and ubiquitous. The equivalent of TCP/IP will be the RDF/SPARQL combination. The equivalent of routing protocols between ISPs will be gateways between the specific DBMS engines supporting the services.&lt;/p&gt;

&lt;h2&gt;The place of the DBMS in the stack changes&lt;/h2&gt;

&lt;p&gt;The &lt;a href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id0x10082590&quot;&gt;RDBMS&lt;/a&gt; in itself is eternal, or at least as eternal as a culture with heavy reliance on written records is. Any such culture will invent the RDBMS and use it where it best fits. We are not replacing this; we are building an abstracted worldwide data layer. This is to the RDBMS supporting line-of-business applications what the www was to enterprise content management systems.&lt;/p&gt;

&lt;p&gt;For transactions, the Web 2.0-style application-specific messages are fine. Also, any transactional system that must be audited must physically reside somewhere, have physical security, etc. It can&amp;#39;t just be somewhere in the DataSphere, managed by some system with which one has no contract, just like Google&amp;#39;s web page cache can&amp;#39;t be relied on as a permanent repository of web content.&lt;/p&gt;

&lt;p&gt;Providing space on the Linked Data Web is like providing hosting on the Document Web. This may have varying service levels, pricing models, etc. The value of a queriable DataSphere is that a new application does not have to begin by building its own schema, database infrastructure, service hosting, etc. The application becomes more like a language &lt;a href=&quot;http://dbpedia.org/resource/Meme&quot; id=&quot;link-id0x23c85e68&quot;&gt;meme&lt;/a&gt;, a cultural form of interaction mediated by a relatively lightweight user-facing component, laterally open for unforeseen interaction with other applications from other domains of discourse.&lt;/p&gt;

&lt;h2&gt;End User Benefits&lt;/h2&gt;

&lt;p&gt;For the end user, the web will still look like a place where one can shop, discuss, date, whatever. These activities will be mediated by user interfaces as they are now. Right now, the end user&amp;#39;s web presence is his/her blog or web site, and their contributions to diverse wikis, social web sites, and so forth. These are scattered. The user&amp;#39;s Data Space is the collection of all these things, now presented in a queriable form. The user&amp;#39;s Data Space is the user&amp;#39;s statement of presence, referencing the diverse contributions of the user on diverse sites.&lt;/p&gt;

&lt;p&gt;The personal Data Space being a queriable, structured whole facilitates finding and being found, which is what brings individuals to the web in the first place. The best applications and sites are those which make this the easiest. The Linked Data Web allows saying what one wishes in a structured, queriable manner, across all application domains, independently of domain specific silos. The end user&amp;#39;s interaction with the personal data space is through applications, like now. But these applications are just wrappers on top of self describing data, represented in domain specific vocabularies; one vocabulary is used for social networking, another for C2C commerce, and so on. The user is the master of their personal Data Space, free to take it where he or she wishes.&lt;/p&gt;

&lt;p&gt;Further benefits will include more ready referencing between these spaces, more uniform identity management, cross-application operations, and the emergence of &amp;quot;meta-applications,&amp;quot; i.e., unified interfaces for managing many related applications/tasks.&lt;/p&gt;

&lt;p&gt;Of course, there is the increase in semantic richness, such as better contextuality derived from &lt;a href=&quot;http://dbpedia.org/resource/Entity&quot; id=&quot;link-id0x23904698&quot;&gt;entity&lt;/a&gt; extraction from text. But this is also possible in a silo. The Linked Data Web angle is the sharing of identifiers for real world entities, which makes extracts of different sources by different parties potentially joinable. The user interaction will hardly ever be with the raw data. But the raw data being still at hand makes for better targeting of advertisements, better offering of related services, easier discovery of related content, and less noise overall.&lt;/p&gt;

&lt;p&gt;
&lt;a href=&quot;http://myopenlink.net/dataspace/person/kidehen#this&quot; id=&quot;link-id0x37342a60&quot;&gt;Kingsley Idehen&lt;/a&gt; has coined the term &lt;a href=&quot;http://www.openlinksw.com/blog/~kidehen/?id=1442&quot; id=&quot;link-id0x3a56e4e8&quot;&gt;SDQ&lt;/a&gt;, for &lt;a href=&quot;http://www.openlinksw.com/blog/~kidehen/?id=1442&quot; id=&quot;link-id0x23649b70&quot;&gt;Serendipitous Discovery Quotient&lt;/a&gt;, to denote this. When applications expose explicit semantics, constructing a user experience that combines relevant data from many sources, including applications as well as highly targeted advertising, becomes natural. It is no longer a matter of &amp;quot;mashing up&amp;quot; web service interfaces with procedural code, but of &amp;quot;meshing&amp;quot; data through declarative queries across application spaces.&lt;/p&gt;

&lt;h2&gt;Applications in the DataSphere&lt;/h2&gt;

&lt;p&gt;The workflows supported by the DataSphere are essentially those taking place on the web now. The DataSphere dimension is expressed by bookmarklets, browser plugins, and the like, with ready access to related data and actions that are relevant for this data. Actions triggered by data can be anything from posting a comment to making an e-commerce purchase. Web 2.0 models fit right in.&lt;/p&gt;

&lt;p&gt;Web application development now consists of designing an application-specific database schema and writing web pages to interact with this schema. In the DataSphere, the database is abstracted away, as is a large part of the schema. The application floats on a sea of data instead of being tied to its own specific store and schema. Some local transaction data should still be handled in the old way, though.&lt;/p&gt;

&lt;p&gt;For the application developer, the question becomes one of vocabulary choice. How will the application synthesize URIs from the user interaction? Which URIs will be used, since pretty much anything will in practice have many names (e.g., &lt;a href=&quot;http://dbpedia.org/resource/DBpedia&quot; id=&quot;link-id0x2364eae8&quot;&gt;DBpedia&lt;/a&gt; Vs. Freebase identifiers). The end user will generally have no idea of this choice, nor of the various degrees of normalization, etc., in the vocabularies. Still, usage of such applications will produce data using some identifiers and vocabularies. Benefits of ready joining without translation will drive adoption. A vocabulary with instance data will get more instance data.&lt;/p&gt;

&lt;p&gt;The Linked Data Web infrastructure itself must support vocabulary and identifier choice by answering questions about who uses a particular identifier and where. Even now, we offer entity ranks and resolution of synonyms, queries on what graphs mention a certain identifier and so on. This is a means of finding the most commonly used term for each situation. Convergence of terminology cuts down on translation and makes for easier and more efficient querying.&lt;/p&gt;

&lt;h2&gt;Advertising&lt;/h2&gt;

&lt;p&gt;The application developer is, for purposes of advertising, in the position of the inventory owner, just like a traditional publisher, whether web or other. But with smarter data, it is not a matter of static keywords but of the semantically explicit data behind each individual user impression driving the ads. Data itself carries no ads but the user impression will still go through a display layer that can show ads. If the application relies on reuse of licensed content, such as media, then the content provider may get a cut of the ad revenue even if it is not the direct owner of the inventory. The specifics of implementing and enforcing this are to be worked out.&lt;/p&gt;

&lt;h2&gt;Content Providers, License, and Attribution&lt;/h2&gt;

&lt;p&gt;For the content provider, the &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Identifier&quot; id=&quot;link-id0xa9abc2f8&quot;&gt;URI&lt;/a&gt; is the brand carrier. If the data is well linked and queriable, this will drive usage and traffic to the services of the content provider. This is true of any provider, whether a media publisher, e-commerce business, government agency, or anything else.&lt;/p&gt;

&lt;p&gt;Intellectual property considerations will make the URI a first class citizen. Just like the URI is a part of the document web experience, it is a part of the Linked Data Web experience. Just like Creative Commons licenses allow the licensor to define what type of attribution is required, a data publisher can mandate that a user experience mediated by whatever application should expose the source as a dereferenceable URI.

&lt;/p&gt;
&lt;p&gt;One element of data dereferencing must be linking to applications that facilitate human interaction with the data. A generic data browser is a developer tool; the end user experience must still be mediated by interfaces tailored to the domain. This layer can take care of making the brand visible and can show advertising or be monetized on a usage basis.&lt;/p&gt;

&lt;p&gt;Next we will look at the service provider and infrastructure side of this.&lt;/p&gt;

&lt;h2&gt;Related&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/blog/~kidehen/?id=1442&quot; id=&quot;link-id148ea4e0&quot;&gt;Serendipitous Discovery Quotient (SDQ)&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/blog/~kidehen/?id=1534&quot; id=&quot;link-id14b07f88&quot;&gt;How Linked Data will change Advertising&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/blog/~kidehen/?id=1519&quot; id=&quot;link-id117c6608&quot;&gt;The Time for RDBMS Primacy Downgrade is Nigh!&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/blog/~kidehen/?tag=DataSpace&quot; id=&quot;link-id154e1d58&quot;&gt;Data Spaces&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Linked Data &amp; The Year 2009 (updated)</title><guid>http://www.openlinksw.com/weblog/oerling/?date=2009-01-02#1510</guid><comments>http://www.openlinksw.com/weblog/oerling/?id=1510#comments</comments><pubDate>Fri, 02 Jan 2009 16:17:06 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2009-01-02T13:26:35-05:00</n0:modified><description>&lt;p&gt;As is fitting for the season, I will editorialize a bit about what has gone before and what is to come.&lt;/p&gt;

&lt;p&gt;
&lt;a href=&quot;http://www.w3.org/People/Berners-Lee/card#i&quot; id=&quot;link-id1119f250&quot;&gt;Sir Tim&lt;/a&gt; said it at WWW08 in &lt;a href=&quot;http://www2008.org/&quot; id=&quot;link-id0x14ab66b0&quot;&gt;Beijing&lt;/a&gt; â &lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id0x115a4588&quot;&gt;linked data&lt;/a&gt; and the linked data &lt;a href=&quot;http://dbpedia.org/resource/Giant_Global_Graph&quot; id=&quot;link-id0xa5c678&quot;&gt;web&lt;/a&gt; is the &lt;a href=&quot;http://dbpedia.org/resource/Semantic_Web&quot; id=&quot;link-id0x7cbe5540&quot;&gt;semantic web&lt;/a&gt; and the Web done right.&lt;/p&gt;

&lt;p&gt;The grail of &lt;i&gt;ad hoc&lt;/i&gt; analytics on infinite &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0xa4b25428&quot;&gt;data&lt;/a&gt; has lost none of its appeal.  We have seen fresh evidence of this in the realm of data warehousing products, as well as storage in general.&lt;/p&gt;

&lt;p&gt;The benefits of a data model more abstract than the relational are being increasingly appreciated also outside the data web circles. Microsoft&amp;#39;s &lt;a href=&quot;http://dbpedia.org/resource/Entity&quot; id=&quot;link-id0x1c3c72b0&quot;&gt;Entity&lt;/a&gt; Frameworks technology is an example.  Agility has been a buzzword for a long time.  Everything should be offered in a service based business model and should interoperate and integrate with everything else â business needs first; schema last.&lt;/p&gt;

&lt;p&gt;Not to forget that when money is tight, reuse of existing assets and paying on a usage basis are naturally emphasized.  &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id0xa0743bd8&quot;&gt;Information&lt;/a&gt;, as the asset it is, is none the less important, on the contrary.  But even with information, value should be realized economically, which, among other things, entails not reinventing the wheel.&lt;/p&gt;

&lt;p&gt;It is against this backdrop that this year will play out.&lt;/p&gt;

&lt;p&gt;As concerns research, I will &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1374&quot; id=&quot;link-id1151b128&quot;&gt;again quote&lt;/a&gt; &lt;a href=&quot;http://www.ibiblio.org/hhalpin/#&quot; id=&quot;link-id141cb740&quot;&gt;Harry Halpin&lt;/a&gt; at &lt;a href=&quot;http://www.eswc2008.org/&quot; id=&quot;link-id0x28f68040&quot;&gt;ESWC 2008&lt;/a&gt;: &amp;quot;Men will fight in a war, and even lose a war, for what they believe just.  And it may come to pass that later, even though the war were lost, the things then fought for will emerge under another name and establish themselves as the prevailing reality&amp;quot; [or words to this effect].&lt;/p&gt;

&lt;p&gt;Something like the data web, and even the semantic web, will happen. Harry&amp;#39;s question was whether this would be the descendant of what is today called semantic web research.&lt;/p&gt;

&lt;p&gt;I heard in conversation about a project for making a very large metadata store.  I also heard that the makers did not particularly insist on this being &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x13c8af68&quot;&gt;RDF&lt;/a&gt;-based, though.&lt;/p&gt;

&lt;p&gt;Why should such a thing be RDF-based?  If it is already accepted that there will be &lt;i&gt;ad hoc&lt;/i&gt; schema and that queries ought to be able to view the data from all angles, not be limited by having indices one way and not another way, then why not RDF?&lt;/p&gt;

&lt;p&gt;The justification of RDF is in reusing and linking-to data and terminology out there.  Another justification is that by using an RDF store, one is spared a lot of work and tons of compromises which attend making an &lt;a href=&quot;http://dbpedia.org/resource/Entity-attribute-value_model&quot; id=&quot;link-id0x1ca17b20&quot;&gt;entity&lt;/a&gt;-attribute-value (&lt;a href=&quot;http://dbpedia.org/resource/Entity-attribute-value_model&quot; id=&quot;link-id0x1c9d6050&quot;&gt;EAV&lt;/a&gt;, i.e., triple) store on a generic &lt;a href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id0x557dff0&quot;&gt;RDBMS&lt;/a&gt;.  The sem-web world has been there, trust me.  We came out well because we put all inside the RDBMS, lowest level, which you can&amp;#39;t do unless you own the RDBMS.  Source access is not enough; you also need the &lt;a href=&quot;http://dbpedia.org/resource/Knowledge&quot; id=&quot;link-id0x1470c748&quot;&gt;knowledge&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Technicalities aside, the question is one of proprietary vs. standards-based.  This is not only so with software components, where standards have consistently demonstrated benefits, but now also with the data. &lt;a href=&quot;http://www.zemanta.com/&quot; id=&quot;link-id0x524bea0&quot;&gt;Zemanta&lt;/a&gt; and &lt;a href=&quot;http://www.opencalais.com/&quot; id=&quot;link-id0x46132d38&quot;&gt;OpenCalais&lt;/a&gt; serving &lt;a href=&quot;http://dbpedia.org/resource/DBpedia&quot; id=&quot;link-id0x13624fb8&quot;&gt;DBpedia&lt;/a&gt; URIs are examples.  Even in entirely closed applications, there is benefit in reusing open vocabularies and identifiers: One does not need to create a secret language for writing a secret memo.&lt;/p&gt;

&lt;p&gt;Where data is a carrier of value, its value is enhanced by it being easy to repurpose (i.e., standard vocabularies) and to discover (i.e., data set metadata).  As on the web, so on the enterprise &lt;a href=&quot;http://dbpedia.org/resource/Intranet&quot; id=&quot;link-id0xa1392eb8&quot;&gt;intranet&lt;/a&gt;.  In this lies the strength of RDF as opposed to proprietary flexible database schemes.  This is a qualitative distinction.&lt;/p&gt;
&lt;p align=&quot;center&quot;&gt;
 &lt;a href=&quot;http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData&quot; id=&quot;link-id117178a8&quot;&gt;&lt;img src=&quot;http://www.openlinksw.com/images/logos/LoDLogo.gif&quot; alt=&quot;Linking Open Data project logo&quot; /&gt;
 &lt;/a&gt;
&lt;br /&gt;
 &lt;a href=&quot;http://dbpedia.org/resource/In_hoc_signo_vinces&quot; id=&quot;link-id115f47e8&quot;&gt;&lt;i&gt;In hoc signo vinces.&lt;/i&gt;
 &lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;In this light, we welcome the &lt;a href=&quot;http://semanticweb.org/wiki/VoiD&quot; id=&quot;link-id0x12352cc0&quot;&gt;voiD&lt;/a&gt; (&lt;a href=&quot;http://semanticweb.org/wiki/VoiD&quot; id=&quot;link-id0x722c18&quot;&gt;VOcabulary of Interlinked Data&lt;/a&gt;), which is the first promise of making federatable data discoverable. Now that there is a point of focus for these efforts, the needed expressivity will no doubt accrete around the voiD core.&lt;/p&gt;

&lt;p&gt;For data as a service, we clearly see the value of open terminologies as prerequisites for service interchangeability, i.e., creating a marketplace.  &lt;a href=&quot;http://dbpedia.org/resource/XML&quot; id=&quot;link-id0x2c21c00&quot;&gt;XML&lt;/a&gt; is for the transaction; RDF is for the discovery, query, and analytics.  As with databases in general, first there was the transaction; then there was the query.  Same here.  For monetizing the query, there are models ranging from renting data sets and server capacity in the clouds to hosted services where one pays for processing past a certain quota.  For the hosted case, we just removed a major barrier to offering unlimited query against unlimited data when we completed the &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1374&quot; id=&quot;link-id110b8668&quot;&gt;Virtuoso Anytime&lt;/a&gt; feature.  With this, the user gets what is found within a set time, which is already something, and in case of needing more, one can pay for the usage.  Of course, we do not forget advertising.  When data has explicit semantics, contextuality is better than with keywords.&lt;/p&gt;

&lt;p&gt;For these visions to materialize on top of the linked data platform, linked data must join the world of data.  This means messaging that is geared towards the database public.  They know the problem, but the RDF proposition is still not well enough understood for it to connect.&lt;/p&gt;

&lt;p&gt;For the relational IT world, we offer passage to the data web and its promise of integration through RDF mapping.  We are also bringing out new Microsoft Entity &lt;a href=&quot;http://dbpedia.org/resource/ADO.NET_Entity_Framework&quot; id=&quot;link-id0x723080&quot;&gt;Framework&lt;/a&gt; components.  This goes in the direction of defining a unified database frontier with RDF and non-RDF entity models side by side.&lt;/p&gt;

&lt;p&gt;For &lt;a href=&quot;http://www.openlinksw.com/dataspace/organization/openlink#this&quot; id=&quot;link-id0x11e1dfc0&quot;&gt;OpenLink Software&lt;/a&gt;, 2008 was about developing technology for scale, RDF as well as generic relational.  We did show a tiny preview with the &lt;a href=&quot;http://challenge.semanticweb.org/&quot; id=&quot;link-id0x722d08&quot;&gt;Billion Triples Challenge&lt;/a&gt; demo.  Now we are set to come out with the real thing, featuring, among other things, faceted search at the billion triple scale.  We &lt;a href=&quot;http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?id=1489&quot; id=&quot;link-id150c6090&quot;&gt;started offering ready-to-go Virtuoso-hosted linked open data sets&lt;/a&gt; on Amazon EC2 in December.  Now we continue doing this based on our next-generation server, as well as make Virtuoso 6 Cluster commercially available.  Technical specifics are amply discussed on this &lt;a href=&quot;http://dbpedia.org/resource/Blog&quot; id=&quot;link-id0x10fc1930&quot;&gt;blog&lt;/a&gt;.  There are still some new technology things to be developed this year; first among these are strong &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x7fd25590&quot;&gt;SPARQL&lt;/a&gt; federation, and on-the-fly resizing of server clusters.  On the research partnerships side, we have an EU grant for working with the OntoWiki project from the University of Leipzig, and we are partners in DERI&amp;#39;s &lt;a href=&quot;https://lion.deri.ie/&quot; id=&quot;link-id115c02f8&quot;&gt;LÃ­on project&lt;/a&gt;.  These will provide platforms for further demonstrating the &amp;quot;web&amp;quot; in data web, as in web-scale smart databasing.&lt;/p&gt;

&lt;p&gt;2009 will see change through scale.  The things that exist will start interconnecting and there will be emergent value.  Deployments will be larger and scale will be readily available through a services model or by installation at one&amp;#39;s own facilities.  We may see the start of Search becoming Find, like &lt;a href=&quot;http://myopenlink.net/dataspace/person/kidehen#this&quot; id=&quot;link-id14e43050&quot;&gt;Kingsley&lt;/a&gt; says, meaning semantics of data guiding search.  Entity extraction will multiply data volumes and bring parts of the data web to real time.&lt;/p&gt;

&lt;p&gt;Exciting 2009 to all.&lt;/p&gt;</description></item><item><title>Virtuoso RDF:  A Getting Started Guide for the Developer</title><guid>http://www.openlinksw.com/weblog/oerling/?date=2008-12-17#1504</guid><comments>http://www.openlinksw.com/weblog/oerling/?id=1504#comments</comments><pubDate>Wed, 17 Dec 2008 12:31:34 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-12-17T12:41:21.000001-05:00</n0:modified><description>
&lt;p&gt;It is a long standing promise of mine to dispel the false impression that using &lt;a href=&quot;http://virtuoso.openlinksw.com/&quot; id=&quot;link-id113506d0&quot;&gt;Virtuoso&lt;/a&gt; to work with &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id115d9528&quot;&gt;RDF&lt;/a&gt; is complicated.&lt;/p&gt;

&lt;p&gt;The purpose of this presentation is to show a programmer how to put RDF into Virtuoso and how to query it.  This is done programmatically, with no confusing user interfaces.&lt;/p&gt;

&lt;p&gt;You should have a Virtuoso Open Source tree built and installed.  We will look at the LUBM benchmark demo that comes with the package.  All you need is a Unix shell.  Running the shell under emacs (&lt;code&gt;m-x shell&lt;/code&gt;) is the best.  But the open source &lt;code&gt;isql&lt;/code&gt; utility should have command line editing also.  The emacs shell is however convenient for cutting and pasting things between shell and files.&lt;/p&gt;

&lt;p&gt;To get started, cd into &lt;code&gt;binsrc/tests/lubm&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;To verify that this works, you can do &lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;./test_server.sh virtuoso-t&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;This will test the server with the LUBM queries.  This should report 45 tests passed.  After this we will do the tests step-by-step.&lt;/p&gt;

&lt;h2&gt;Loading the &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id10f7bd90&quot;&gt;Data&lt;/a&gt;
&lt;/h2&gt; 

&lt;p&gt;The file &lt;code&gt;lubm-load.sql&lt;/code&gt; contains the commands for loading the LUBM single university qualification database.&lt;/p&gt;

&lt;p&gt;The data files themselves are in &lt;code&gt;lubm_8000&lt;/code&gt;, 15 files in RDFXML.&lt;/p&gt;

&lt;p&gt;There is also a little ontology called &lt;code&gt;inf.nt&lt;/code&gt;.  This declares the subclass and subproperty relations used in the benchmark.&lt;/p&gt;

&lt;p&gt;So now let&amp;#39;s go through this procedure.&lt;/p&gt;

&lt;p&gt;Start the server:&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;$ virtuoso-t -f &amp;amp;
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;This starts the server in foreground mode, and puts it in the background of the shell.&lt;/p&gt;

&lt;p&gt;Now we connect to it with the isql utility.&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;$ isql 1111 dba dba 
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;This gives a &lt;code&gt;SQL&amp;gt;&lt;/code&gt; prompt.  The default username and password are both &lt;code&gt;dba&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;When a command is &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id1176ce70&quot;&gt;SQL&lt;/a&gt;, it is entered directly.  If it is &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id156df468&quot;&gt;SPARQL&lt;/a&gt;, it is prefixed with the keyword &lt;code&gt;sparql&lt;/code&gt;.  This is how all the SQL clients work.  Any SQL client, such as any &lt;a href=&quot;http://dbpedia.org/resource/Open_Database_Connectivity&quot; id=&quot;link-id152d0a00&quot;&gt;ODBC&lt;/a&gt; or &lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id157ad6a0&quot;&gt;JDBC&lt;/a&gt; application, can use SPARQL if the SQL string starts with this keyword.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;lubm-load.sql&lt;/code&gt; file is quite self-explanatory. It begins with defining an SQL procedure that calls the RDF/XML load function, &lt;code&gt;DB..RDF_LOAD_RDFXML&lt;/code&gt;, for each file in a directory.&lt;/p&gt;

&lt;p&gt;Next it calls this function for the &lt;code&gt;lubm_8000&lt;/code&gt; directory under the server&amp;#39;s working directory.&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;sparql 
   CLEAR GRAPH &amp;lt;lubm&amp;gt;;

sparql 
   CLEAR GRAPH &amp;lt;inf&amp;gt;;

load_lubm ( server_root() || &amp;#39;/lubm_8000/&amp;#39; );
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;Then it verifies that the right number of triples is found in the &amp;lt;lubm&amp;gt; graph.&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;sparql 
   SELECT COUNT(*) 
     FROM &amp;lt;lubm&amp;gt; 
    WHERE { ?x ?y ?z } ;
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;The echo commands below this are interpreted by the isql utility, and produce output to show whether the test was passed.  They can be ignored for now.&lt;/p&gt;

&lt;p&gt;Then it adds some implied &lt;code&gt;subOrganizationOf&lt;/code&gt; triples.  This is part of setting up the LUBM test database.&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;sparql 
   PREFIX  ub:  &amp;lt;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#&amp;gt;
   INSERT 
      INTO GRAPH &amp;lt;lubm&amp;gt; 
      { ?x  ub:subOrganizationOf  ?z } 
   FROM &amp;lt;lubm&amp;gt; 
   WHERE { ?x  ub:subOrganizationOf  ?y  . 
           ?y  ub:subOrganizationOf  ?z  . 
         };
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;Then it loads the ontology file, &lt;code&gt;inf.nt&lt;/code&gt;, using the Turtle load function, &lt;code&gt;DB.DBA.TTLP&lt;/code&gt;.  The arguments of the function are the text to load, the default namespace prefix, and the &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Identifier&quot; id=&quot;link-id15835550&quot;&gt;URI&lt;/a&gt; of the target graph.&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;DB.DBA.TTLP ( file_to_string ( &amp;#39;inf.nt&amp;#39; ), 
              &amp;#39;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl&amp;#39;, 
              &amp;#39;inf&amp;#39; 
            ) ;
sparql 
   SELECT COUNT(*) 
     FROM &amp;lt;inf&amp;gt; 
    WHERE { ?x ?y ?z } ;
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;Then we declare that the triples in the &lt;code&gt;&amp;lt;inf&amp;gt;&lt;/code&gt; graph can be used for inference at run time.  To enable this, a SPARQL query will declare that it uses the &lt;code&gt;&amp;#39;inft&amp;#39;&lt;/code&gt; rule set.  Otherwise this has no effect.&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;rdfs_rule_set (&amp;#39;inft&amp;#39;, &amp;#39;inf&amp;#39;);
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;This is just a log checkpoint to finalize the work and truncate the transaction log.  The server would also eventually do this in its own time.&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;checkpoint;
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;Now we are ready for querying.&lt;/p&gt;

&lt;h2&gt;Querying the Data&lt;/h2&gt; 

&lt;p&gt;The queries are given in 3 different versions: The first file, &lt;code&gt;lubm.sql&lt;/code&gt;, has the queries with most inference open coded as &lt;code&gt;UNIONs&lt;/code&gt;. The second file, &lt;code&gt;lubm-inf.sql&lt;/code&gt;, has the inference performed at run time using the ontology &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id1109faf0&quot;&gt;information&lt;/a&gt; in the &lt;code&gt;&amp;lt;inf&amp;gt;&lt;/code&gt; graph we just loaded.  The last, &lt;code&gt;lubm-phys.sql&lt;/code&gt;, relies on having the entailed triples physically present in the &lt;code&gt;&amp;lt;lubm&amp;gt;&lt;/code&gt; graph.  These entailed triples are inserted by the SPARUL commands in the &lt;code&gt;lubm-cp.sql&lt;/code&gt; file.&lt;/p&gt;

&lt;p&gt;If you wish to run all the commands in a SQL file, you can type &lt;code&gt;load &amp;lt;filename&amp;gt;;&lt;/code&gt; (e.g., &lt;code&gt;load lubm-cp.sql;&lt;/code&gt;) at the &lt;code&gt;SQL&amp;gt;&lt;/code&gt; prompt. If you wish to try individual statements, you can paste them to the command line.&lt;/p&gt;

&lt;p&gt;For example: &lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;SQL&amp;gt; sparql 
   PREFIX ub: &amp;lt;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#&amp;gt;
   SELECT * 
     FROM &amp;lt;lubm&amp;gt;
    WHERE { ?x  a                     ub:Publication                                                . 
            ?x  ub:publicationAuthor  &amp;lt;http://www.Department0.University0.edu/AssistantProfessor0&amp;gt; 
          };

VARCHAR
_______________________________________________________________________

http://www.Department0.University0.edu/AssistantProfessor0/Publication0
http://www.Department0.University0.edu/AssistantProfessor0/Publication1
http://www.Department0.University0.edu/AssistantProfessor0/Publication2
http://www.Department0.University0.edu/AssistantProfessor0/Publication3
http://www.Department0.University0.edu/AssistantProfessor0/Publication4
http://www.Department0.University0.edu/AssistantProfessor0/Publication5

6 Rows. -- 4 msec.
&lt;/pre&gt;&lt;/blockquote&gt;


&lt;p&gt;To stop the server, simply type &lt;code&gt;shutdown;&lt;/code&gt; at the &lt;code&gt;SQL&amp;gt;&lt;/code&gt; prompt.&lt;/p&gt;

&lt;p&gt;If you wish to use a &lt;a href=&quot;http://www.w3.org/TR/rdf-sparql-protocol/&quot; id=&quot;link-id11384668&quot;&gt;SPARQL protocol&lt;/a&gt; end point, just enable the HTTP listener.  This is done by adding a stanza like â&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;[HTTPServer]
ServerPort    = 8421
ServerRoot    = .
ServerThreads = 2
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;â to the end of the &lt;code&gt;virtuoso.ini&lt;/code&gt; file in the &lt;code&gt;lubm&lt;/code&gt; directory.  Then shutdown and restart (type &lt;code&gt;shutdown;&lt;/code&gt; at the &lt;code&gt;SQL&amp;gt;&lt;/code&gt; prompt and then &lt;code&gt;virtuoso-t -f &amp;amp;&lt;/code&gt; at the shell prompt).&lt;/p&gt;

&lt;p&gt;Now you can connect to the end point with a web browser.  The &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Locator&quot; id=&quot;link-id113d02d8&quot;&gt;URL&lt;/a&gt; is &lt;code&gt;http://localhost:8421/sparql&lt;/code&gt;. Without parameters, this will show a human readable form.  With parameters, this will execute SPARQL.&lt;/p&gt;

&lt;p&gt;We have shown how to load and query RDF with Virtuoso using the most basic SQL tools. Next you can access RDF from, for example, &lt;a href=&quot;http://dbpedia.org/resource/PHP&quot; id=&quot;link-id142d0ba0&quot;&gt;PHP&lt;/a&gt;, using the PHP ODBC interface.&lt;/p&gt;

&lt;p&gt;To see how to use &lt;a href=&quot;http://jena.sourceforge.net/&quot; id=&quot;link-id117074f0&quot;&gt;Jena&lt;/a&gt; or &lt;a href=&quot;http://sourceforge.net/projects/sesame/&quot; id=&quot;link-id1103c9b0&quot;&gt;Sesame&lt;/a&gt; with Virtuoso, look at &lt;a href=&quot;http://docs.openlinksw.com/virtuoso/rdfnativestorageproviders.html&quot; id=&quot;link-id15488ce8&quot;&gt;Native RDF Storage Providers&lt;/a&gt;. To see how RDF data types are supported, see &lt;a href=&quot;http://docs.openlinksw.com/virtuoso/VirtuosoDriverJDBC.html#jdbcrdf&quot; id=&quot;link-id15784a40&quot;&gt;Extension datatype for RDF&lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;To work with large volumes of data, you must add memory to the configuration file and use the row-autocommit mode, i.e., do &lt;code&gt;log_enableÂ (2);&lt;/code&gt; before the load command. Otherwise Virtuoso will do the entire load as a single transaction, and will run out of rollback space.  See &lt;a href=&quot;http://docs.openlinksw.com/virtuoso/&quot; id=&quot;link-id111410f0&quot;&gt;documentation&lt;/a&gt; for more.&lt;/p&gt;</description></item><item><title>See the Lite:  Embeddable/Background Virtuoso starts at 25MB</title><guid>http://www.openlinksw.com/weblog/oerling/?date=2008-12-17#1502</guid><comments>http://www.openlinksw.com/weblog/oerling/?id=1502#comments</comments><pubDate>Wed, 17 Dec 2008 09:34:12 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-12-17T12:03:43-05:00</n0:modified><description>&lt;p&gt;We have received many requests for an embeddable-scale &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0xa5aa1b38&quot;&gt;Virtuoso&lt;/a&gt;.  In response to this, we have added a Lite mode, where the initial size of a server process is a tiny fraction of what the initial size would be with default settings.  With 2MB of disk cache buffers (ini file setting, &lt;code&gt;NumberOfBuffers = 256&lt;/code&gt;), the process size stays under 30MB on 32-bit Linux.&lt;/p&gt;

&lt;p&gt;The value of this is that one can now have &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x1db79ac8&quot;&gt;RDF&lt;/a&gt; and full text indexing on the desktop without running a Java VM or any other memory-intensive software.  And of course, all of &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0xa923298&quot;&gt;SQL&lt;/a&gt; (transactions, stored procedures, etc.) is in the same embeddably-sized container.&lt;/p&gt;

&lt;p&gt;The Lite executable is a full Virtuoso executable; the Lite mode is controlled by a switch in the configuration file.  The executable size is about 10MB for 32-bit Linux.  A database created in the Lite mode will be converted into a fully-featured database (tables and indexes are added, among other things) if the server is started with the Lite setting &amp;quot;off&amp;quot;; functionality can be reverted to Lite mode, though it will now consume somewhat more memory, etc.&lt;/p&gt;

&lt;p&gt;Lite mode offers full SQL and &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x1b388830&quot;&gt;SPARQL&lt;/a&gt;/SPARUL (via SPASQL), but disables all &lt;a href=&quot;http://dbpedia.org/resource/Hypertext_Transfer_Protocol&quot; id=&quot;link-id0x1d56b618&quot;&gt;HTTP&lt;/a&gt;-based services (WebDAV, application hosting, etc.).  Clients can still use all typical database access mechanisms (i.e., &lt;a href=&quot;http://dbpedia.org/resource/Open_Database_Connectivity&quot; id=&quot;link-id0x1c5abc38&quot;&gt;ODBC&lt;/a&gt;, &lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id0x1dade1f8&quot;&gt;JDBC&lt;/a&gt;, OLE-DB, &lt;a href=&quot;http://dbpedia.org/resource/ADO.NET&quot; id=&quot;link-id0x25d8e0f0&quot;&gt;ADO&lt;/a&gt;.&lt;a href=&quot;http://dbpedia.org/resource/.NET_Framework&quot; id=&quot;link-id0x1d7a1a28&quot;&gt;NET&lt;/a&gt;, and XMLA) to connect, including the &lt;a href=&quot;http://jena.sourceforge.net/&quot; id=&quot;link-id0x1d929b98&quot;&gt;Jena&lt;/a&gt; and &lt;a href=&quot;http://sourceforge.net/projects/sesame/&quot; id=&quot;link-id0x1b7a9088&quot;&gt;Sesame&lt;/a&gt; frameworks for RDF.  ODBC now offers full support of RDF &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0xaf62aa0&quot;&gt;data&lt;/a&gt; types for &lt;a href=&quot;http://dbpedia.org/resource/C%2B%2B&quot; id=&quot;link-id0xa8784b0&quot;&gt;C&lt;/a&gt;-based clients.  A Redland-compatible API also exists, for use with Redland v1.0.8 and later. &lt;/p&gt;

&lt;p&gt;Especially for embedded use, we now allow restricting the listener to be a Unix socket, which allows client connections only from the localhost.&lt;/p&gt;

&lt;p&gt;Shipping an embedded Virtuoso is easy.  It just takes one executable and one configuration file.  Performance is generally comparable to &amp;quot;normal&amp;quot; mode, except that Lite will be somewhat less scalable on multicore systems.&lt;/p&gt;

&lt;p&gt;The Lite mode will be included in the next Virtuoso 5 Open Source release.&lt;/p&gt;</description></item><item><title>&quot;E Pluribus Unum&quot;, or &quot;Inversely Functional Identity&quot;, or &quot;Smooshing Without the Stickiness&quot; (re-updated)</title><guid>http://www.openlinksw.com/weblog/oerling/?date=2008-12-16#1498</guid><comments>http://www.openlinksw.com/weblog/oerling/?id=1498#comments</comments><pubDate>Tue, 16 Dec 2008 14:14:43 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-12-16T15:01:30-05:00</n0:modified><description>&lt;p&gt;What a terrible word, smooshing...  I have understood it to mean that when you have two names for one thing, you give each all the attributes of the other.  This smooshes them together, makes them interchangeable.&lt;/p&gt;

&lt;p&gt;This is complex, so I will begin with the point and the interested may read on for the details and implications.  Starting with soon to be released version 6, &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id15718cb8&quot;&gt;Virtuoso&lt;/a&gt; allows you to say that two things, if they share a uniquely identifying property, are the same.  Examples of uniquely identifying properties would be a book&amp;#39;s ISBN number, or a person&amp;#39;s social security plus full name.  In relational language this is a &lt;i&gt;unique key&lt;/i&gt;, and in &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id145ed998&quot;&gt;RDF&lt;/a&gt; parlance, an &lt;i&gt;inverse functional property&lt;/i&gt;.&lt;/p&gt;

&lt;p&gt;In most systems, such problems are dealt with as a preprocessing step before querying.  For example, all the items that are considered the same will get the same properties or at load time all identifiers will be normalized according to some application rules.  This is good if the rules are clear and understood.  This is so in closed situations, where things tend to have standard identifiers to begin with.  But on the open web this is not so clear cut.&lt;/p&gt;

&lt;p&gt;In this post, we show how to do these things &lt;i&gt;ad hoc&lt;/i&gt;, without materializing anything.  At the end, we also show how to materialize identity and what the consequences of this are with open web &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id11726358&quot;&gt;data&lt;/a&gt;.  We use real live web crawls from the &lt;a href=&quot;http://challenge.semanticweb.org/&quot; id=&quot;link-id14f40448&quot;&gt;Billion Triples Challenge&lt;/a&gt; data set.&lt;/p&gt;

&lt;p&gt;On the &lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id156e2b10&quot;&gt;linked data&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/Giant_Global_Graph&quot; id=&quot;link-id1106ce08&quot;&gt;web&lt;/a&gt;, there are independently arising descriptions of the same thing and thus arises the need to smoosh, if these are to be somehow integrated.  But this is only the beginning of the problems.&lt;/p&gt;

&lt;p&gt;To address these, we have added the option of specifying that some property will be considered inversely functional in a query.  This is done at run time and the property does not really have to be inversely functional in the pure sense.  &lt;code&gt;foaf:name&lt;/code&gt; will do for an example.  This simply means that for purposes of the query concerned, two subjects which have at least one &lt;code&gt;foaf:name&lt;/code&gt; in common are considered the same. In this way, we can join between FOAF files.  With the same database, a query about music preferences might consider having the same name as &amp;quot;same enough,&amp;quot; but a query about criminal prosecution would obviously need to be more precise about sameness.&lt;/p&gt;

&lt;p&gt;Our ontology is defined like this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;-- Populate a named graph with the triples you want to use in query time inferencing&lt;br /&gt;
ttlp ( &amp;#39;
        @prefix foaf: &amp;lt;xmlns=&amp;quot;http&amp;quot; xmlns.com=&amp;quot;xmlns.com&amp;quot; foaf=&amp;quot;foaf&amp;quot;&amp;gt;
                      &amp;lt;/&amp;gt;
        @prefix owl:  &amp;lt;xmlns=&amp;quot;http&amp;quot; www.w3.org=&amp;quot;www.w3.org&amp;quot; owl=&amp;quot;owl&amp;quot;&amp;gt;
                      &amp;lt;/&amp;gt;
        foaf:mbox_sha1sum  a  owl:InverseFunctionalProperty  .
        foaf:name          a  owl:InverseFunctionalProperty  .
       &amp;#39;,
       &amp;#39;xx&amp;#39;,
       &amp;#39;b3sifp&amp;#39;
     );&lt;br /&gt;
-- Declare that the graph contains an ontology for use in query time inferencing &lt;br /&gt;
rdfs_rule_set ( &amp;#39;http://example.com/rules/b3sifp#&amp;#39;,
                &amp;#39;b3sifp&amp;#39;
              );
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;Then use it:&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;sparql 
   DEFINE input:inference &amp;quot;http://example.com/rules/b3sifp#&amp;quot; 
   SELECT DISTINCT ?k ?f1 ?f2 
   WHERE { ?k   foaf:name     ?n                   . 
           ?n   bif:contains  &amp;quot;&amp;#39;Kjetil Kjernsmo&amp;#39;&amp;quot;  . 
           ?k   foaf:knows    ?f1                  . 
           ?f1  foaf:knows    ?f2 
         };&lt;br /&gt;
VARCHAR                                  VARCHAR                                           VARCHAR
______________________________________   _______________________________________________   ______________________________&lt;br /&gt;
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/dajobe
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/net_twitter
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/amyvdh
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/pom
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/mattb
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/davorg
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/distobj
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/perigrin
....
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;Without the inference, we get no matches.  This is because the data in question has one graph per FOAF file, and blank nodes for persons.  No graph references any person outside the ones in the graph.  So if somebody is mentioned as known, then without the inference there is no way to get to what that person&amp;#39;s FOAF file says, since the same individual will be a different blank node there.  The declaration in the context named &lt;code&gt;b3sifp&lt;/code&gt; just means that all things with a matching &lt;code&gt;foaf:name&lt;/code&gt; or &lt;code&gt;foaf:mbox_sha1sum&lt;/code&gt; are the same.&lt;/p&gt;

&lt;p&gt;Sameness means that two are the same for purposes of &lt;code&gt;DISTINCT&lt;/code&gt; or &lt;code&gt;GROUP BY&lt;/code&gt;, and if two are the same, then both have the &lt;code&gt;UNION&lt;/code&gt; of all of the properties of both.&lt;/p&gt;

&lt;p&gt;If this were a naive smoosh, then the individuals would have all the same properties but would not be the same for &lt;code&gt;DISTINCT&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If we have complex application rules for determining whether individuals are the same, then one can materialize &lt;code&gt;owl:sameAs&lt;/code&gt; triples and keep them in a separate graph.  In this way, the original data is not contaminated and the materialized volume stays reasonable â nothing like the blow-up of duplicating properties across instances.&lt;/p&gt;

&lt;p&gt;The pro-smoosh argument is that if every duplicate makes exactly the same statements, then there is no great blow-up.  Best and worst cases will always depend on the data.  In rough terms, the more &lt;i&gt;ad hoc&lt;/i&gt; the use, the less desirable the materialization.  If the usage pattern is really set, then a relational-style application-specific representation with identity resolved at load time will perform best.  We can do that too, but so can others.&lt;/p&gt;

&lt;p&gt;The principal point is about agility as concerns the inference.  Run time is more agile than materialization, and if the rules change or if different users have different needs, then materialization runs into trouble.  When talking web scale, having multiple users is a given; it is very uneconomical to give everybody their own copy, and the likelihood of a user accessing any significant part of the corpus is minimal.  Even if the queries were not limited, the user would typically not wait for the answer of a query doing a scan or aggregation over 1 billion &lt;a href=&quot;http://dbpedia.org/resource/Blog&quot; id=&quot;link-id1156a550&quot;&gt;blog&lt;/a&gt; posts or something of the sort.  So queries will typically be selective.  Selective means that they do not access all of the data, hence do not benefit from ready-made materialization for things they do not even look at. &lt;/p&gt;

&lt;p&gt;The exception is corpus-wide statistics queries.  But these will not be done in interactive time anyway, and will not be done very often. Plus, since these do not typically run all in memory, these are disk bound.  And when things are disk bound, size matters.  Reading extra entailment on the way is just a performance penalty.&lt;/p&gt;

&lt;p&gt;Enough talk. Time for an experiment.  We take the Yahoo and Falcon web crawls from the Billion Triples Challenge set, and do two things with the FOAF data in them:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Resolve identity at insert time.  We remove duplicate person URIs, and give the single &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Identifier&quot; id=&quot;link-id11317008&quot;&gt;URI&lt;/a&gt; all the properties of all the duplicate URIs.  We expect these to be most often repeats.  If a person references another person, we normalize this reference to go to the single URI of the referenced person.&lt;/li&gt;

&lt;li&gt;Give every duplicate URI of a person all the properties of all the duplicates.  If these are the same value, the data should not get much bigger, or so we think.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For the experiment, we will consider two people the same if they have the same &lt;code&gt;foaf:name&lt;/code&gt; and are both instances of &lt;code&gt;foaf:Person&lt;/code&gt;.  This gets some extra hits but should not be statistically significant.&lt;/p&gt;

&lt;p&gt;The following is a commented &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id110945b0&quot;&gt;SQL&lt;/a&gt; script performing the smoosh.  We play with internal IDs of things, thus some of these operations cannot be done in SPARQL alone.  We use SPARQL where possible for readability.  As the documentation states, &lt;code&gt;iri_to_id&lt;/code&gt; converts from the qualified name of an IRI to its ID and &lt;code&gt;id_to_iri&lt;/code&gt; does the reverse.&lt;/p&gt;

&lt;p&gt;We count the triples that enter into the smoosh:&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;-- the name is an existence because else we&amp;#39;d get several times more due to 
-- the names occurring in many graphs &lt;br /&gt;
sparql 
   SELECT COUNT(*) 
    WHERE { { SELECT DISTINCT ?person 
               WHERE { ?person a foaf:Person }
            } . 
            FILTER ( bif:exists ( SELECT (1) 
                                   WHERE { ?person foaf:name ?nn } 
                                )
                       ) . 
            ?person ?p ?o
          };&lt;br /&gt;
-- We get 3284674
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;We make a few tables for intermediate results.&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;-- For each distinct name, gather the properties and objects from 
-- all subjects with this name &lt;br /&gt;
CREATE TABLE name_prop 
   ( np_name  ANY, 
     np_p     IRI_ID_8, 
     np_o     ANY, 
     PRIMARY KEY ( np_name, 
                   np_p, 
                   np_o
                 )
   );
ALTER INDEX name_prop 
   ON name_prop 
   PARTITION ( np_name VARCHAR (-1, 0hexffff) );&lt;br /&gt;
-- Map from name to canonical IRI used for the name &lt;br /&gt;
CREATE TABLE name_iri ( ni_name  ANY PRIMARY KEY, 
                        ni_s     IRI_ID_8
                      );
ALTER INDEX name_iri 
   ON name_iri 
   PARTITION ( ni_name VARCHAR (-1, 0hexffff) );&lt;br /&gt;
-- Map from person IRI to canonical person IRI&lt;br /&gt;
CREATE TABLE pref_iri 
   ( i     IRI_ID_8, 
     pref  IRI_ID_8, 
     PRIMARY KEY ( i )
   );
ALTER INDEX pref_iri 
   ON pref_iri 
   PARTITION ( i INT (0hexffff00) );&lt;br /&gt;
-- a table for the materialization where all aliases get all properties of every other &lt;br /&gt;
CREATE TABLE smoosh_ct 
   ( s  IRI_ID_8, 
     p  IRI_ID_8, 
     o  ANY, 
     PRIMARY KEY ( s, 
                   p, 
                   o
                 ) 
   );
ALTER INDEX smoosh_ct 
   ON smoosh_ct 
   PARTITION ( s INT (0hexffff00) );&lt;br /&gt;
-- disable transaction log and enable row auto-commit.  This is necessary, otherwise 
-- bulk operations are done transactionally and they will run out of rollback space.&lt;br /&gt;
LOG_ENABLE (2);&lt;br /&gt;
-- Gather all the properties of all persons with a name under that name.  
-- INSERT SOFT means that duplicates are ignored &lt;br /&gt;
INSERT SOFT name_prop 
   SELECT &amp;quot;n&amp;quot;, &amp;quot;p&amp;quot;, &amp;quot;o&amp;quot; 
   FROM ( sparql 
          DEFINE output:valmode &amp;quot;LONG&amp;quot; 
          SELECT ?n ?p ?o 
          WHERE { ?x a foaf:Person . 
                 ?x foaf:name ?n . 
                 ?x ?p ?o
               }
        ) xx ;&lt;br /&gt;
-- Now choose for each name the canonical IRI &lt;br /&gt;
INSERT INTO name_iri 
   SELECT np_name, 
          ( SELECT MIN (s) 
              FROM rdf_quad 
             WHERE o = np_name 
                   AND p = IRI_TO_ID (&amp;#39;http://xmlns.com/foaf/0.1/name&amp;#39;)
          ) AS mini 
     FROM name_prop 
    WHERE np_p = IRI_TO_ID (&amp;#39;http://xmlns.com/foaf/0.1/name&amp;#39;) ;&lt;br /&gt;
-- For each person IRI, map to the canonical IRI of that person &lt;br /&gt;
INSERT SOFT pref_iri (i, pref) 
   SELECT s, 
          ni_s 
     FROM name_iri, 
          rdf_quad 
    WHERE o = ni_name 
          AND p = IRI_TO_ID (&amp;#39;http://xmlns.com/foaf/0.1/name&amp;#39;) ;&lt;br /&gt;
-- Make a graph where all persons have one iri with all the properties of all aliases 
-- and where person-to-person refs are canonicalized&lt;br /&gt;
INSERT SOFT rdf_quad (g,s,p,o) 
   SELECT IRI_TO_ID (&amp;#39;psmoosh&amp;#39;), 
          ni_s, 
          np_p, 
 COALESCE ( ( SELECT pref 
              FROM pref_iri 
              WHERE i = np_o
            ), 
            np_o 
          )
     FROM name_prop, 
          name_iri 
    WHERE ni_name = np_name 
   OPTION ( loop, quietcast ) ;&lt;br /&gt;
-- A little explanation:  The properties of names are copied into rdf_quad with the name 
-- replaced with its canonical IRI.  If the object has a canonical IRI, this is used as 
-- the object, else the object is unmodified.  This is the COALESCE with the sub-query.&lt;br /&gt;
-- This takes a little time.  To check on the progress, take another connection to the 
-- server and do &lt;br /&gt;
STATUS (&amp;#39;cluster&amp;#39;);&lt;br /&gt;
-- It will return something like 
-- Cluster 4 nodes, 35 s. 108 m/s 1001 KB/s  75% cpu 186%  read 12% clw threads 5r 0w 0i 
-- buffers 549481 253929 d 8 w 0 pfs&lt;br /&gt;
-- Now finalize the state; this makes it permanent.  Else the work will be lost on server 
-- failure, since there was no transaction log &lt;br /&gt;
CL_EXEC (&amp;#39;checkpoint&amp;#39;);&lt;br /&gt;
-- See what we got&lt;br /&gt;
sparql 
   SELECT COUNT (*) 
     FROM &amp;lt;psmoosh&amp;gt; 
     WHERE {?s ?p ?o};&lt;br /&gt;
-- This is 2253102&lt;br /&gt;
-- Now make the copy where all have the properties of all synonyms.  This takes so much 
-- space we do not insert it as RDF quads, but make a special table for it so that we can 
-- run some statistics.  This saves time.&lt;br /&gt;
INSERT SOFT smoosh_ct (s, p, o)  
   SELECT s, np_p, np_o 
     FROM name_prop, 
          rdf_quad 
    WHERE o = np_name 
          AND p = IRI_TO_ID (&amp;#39;http://xmlns.com/foaf/0.1/name&amp;#39;) ;&lt;br /&gt;
-- as above, INSERT SOFT so as to ignore duplicates &lt;br /&gt;
SELECT COUNT (*) 
   FROM smoosh_ct;&lt;br /&gt;
-- This is  167360324&lt;br /&gt;
-- Find out where the bloat comes from &lt;br /&gt;
SELECT TOP 20 COUNT (*), 
              ID_TO_IRI (p) 
   FROM smoosh_ct 
   GROUP BY p 
   ORDER BY 1 DESC;
&lt;/pre&gt;&lt;/blockquote&gt;
&lt;p&gt;The results are:&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;54728777          http://www.w3.org/2002/07/owl#sameAs
48543153          http://xmlns.com/foaf/0.1/knows
13930234          http://www.w3.org/2000/01/rdf-schema#seeAlso
12268512          http://xmlns.com/foaf/0.1/interest
11415867          http://xmlns.com/foaf/0.1/nick
6683963           http://xmlns.com/foaf/0.1/weblog
6650093           http://xmlns.com/foaf/0.1/depiction
4231946           http://xmlns.com/foaf/0.1/mbox_sha1sum
4129629           http://xmlns.com/foaf/0.1/homepage
1776555           http://xmlns.com/foaf/0.1/holdsAccount
1219525           http://xmlns.com/foaf/0.1/based_near
305522            http://www.w3.org/1999/02/22-rdf-syntax-ns#type
274965            http://xmlns.com/foaf/0.1/name
155131            http://xmlns.com/foaf/0.1/dateOfBirth
153001            http://xmlns.com/foaf/0.1/img
111130            http://www.w3.org/2001/vcard-rdf/3.0#ADR
52930             http://xmlns.com/foaf/0.1/gender
48517             http://www.w3.org/2004/02/skos/core#subject
45697             http://www.w3.org/2000/01/rdf-schema#label
44860             http://purl.org/vocab/bio/0.1/olb
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;Now compare with the predicate distribution of the smoosh with identities canonicalized &lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;sparql 
     SELECT COUNT (*) ?p 
       FROM &amp;lt;psmoosh&amp;gt; 
      WHERE { ?s ?p ?o } 
   GROUP BY ?p 
   ORDER BY 1 DESC 
      LIMIT 20;&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;Results are:&lt;/p&gt;
&lt;blockquote&gt;
&lt;pre&gt;748311            http://xmlns.com/foaf/0.1/knows
548391            http://xmlns.com/foaf/0.1/interest
140531            http://www.w3.org/2000/01/rdf-schema#seeAlso
105273            http://www.w3.org/1999/02/22-rdf-syntax-ns#type
78497             http://xmlns.com/foaf/0.1/name
48099             http://www.w3.org/2004/02/skos/core#subject
45179             http://xmlns.com/foaf/0.1/depiction
40229             http://www.w3.org/2000/01/rdf-schema#comment
38272             http://www.w3.org/2000/01/rdf-schema#label
37378             http://xmlns.com/foaf/0.1/nick
37186             http://dbpedia.org/property/abstract
34003             http://xmlns.com/foaf/0.1/img
26182             http://xmlns.com/foaf/0.1/homepage
23795             http://www.w3.org/2002/07/owl#sameAs
17651             http://xmlns.com/foaf/0.1/mbox_sha1sum
17430             http://xmlns.com/foaf/0.1/dateOfBirth
15586             http://xmlns.com/foaf/0.1/page
12869             http://dbpedia.org/property/reference
12497             http://xmlns.com/foaf/0.1/weblog
12329             http://blogs.yandex.ru/schema/foaf/school
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;We can drop the &lt;code&gt;owl:sameAs&lt;/code&gt; triples from the count, so the bloat is a bit less by that but it still is tens of times larger than the canonicalized copy or the initial state.&lt;/p&gt;

&lt;p&gt;Now, when we try using the psmoosh graph, we still get different results from the results with the original data.  This is because &lt;code&gt;foaf:knows&lt;/code&gt; relations to things with no &lt;code&gt;foaf:name&lt;/code&gt; are not represented in the smoosh.  The exist:&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;sparql 
SELECT COUNT (*) 
   WHERE { ?s foaf:knows ?thing . 
           FILTER ( !bif:exists ( SELECT (1) 
                                   WHERE { ?thing foaf:name ?nn }
                                )
                  ) 
         };&lt;br /&gt;
-- 1393940
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;So the smoosh graph is not an accurate rendition of the social network.  It would have to be smooshed further to be that, since the data in the sample is quite irregular.  But we do not go that far here.&lt;/p&gt;

&lt;p&gt;Finally, we calculate the smoosh blow up factors.  We do not include &lt;code&gt;owl:sameAs&lt;/code&gt; triples in the counts.&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;select (167360324 - 54728777) / 3284674.0;
34.290022997716059&lt;br /&gt;
select 2229307 / 3284674.0;
= 0.678699621332284
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;So, to get a smoosh that is not really the equivalent of the original, either multiply the original triple count by 34 or 0.68, depending on whether synonyms are collapsed or not.&lt;/p&gt;

&lt;p&gt;Making the smooshes does not take very long, some minutes for the small one.  Inserting the big one would be longer, a couple of hours maybe.  It was 33 minutes for filling the &lt;code&gt;smoosh_ct&lt;/code&gt; table.  The metrics were not with optimal tuning so the performance numbers just serve to show that smooshing takes time.  Probably more time than allowable in an interactive situation, no matter how the process is optimized.&lt;/p&gt;</description></item><item><title>Virtuoso - Are We Too Clever for Our Own Good? (updated)</title><guid>http://www.openlinksw.com/weblog/oerling/?date=2008-10-26#1465</guid><comments>http://www.openlinksw.com/weblog/oerling/?id=1465#comments</comments><pubDate>Sun, 26 Oct 2008 12:15:35 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-10-27T12:07:52-04:00</n0:modified><description>&lt;p&gt;&amp;quot;Physician, heal thyself,&amp;quot; it is said. We profess to say what the messaging of the &lt;a href=&quot;http://dbpedia.org/resource/Semantic_Web&quot; id=&quot;link-id0x1fa3da18&quot;&gt;semantic web&lt;/a&gt; ought to be, but is our own perfect?&lt;/p&gt;

&lt;p&gt;I will here engage in some critical introspection as well as amplify on some answers given to &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x1e1eecf0&quot;&gt;Virtuoso&lt;/a&gt;-related questions in recent times.&lt;/p&gt;

&lt;p&gt;I use some conversations from the &lt;a href=&quot;http://dbpedia.org/resource/Vienna&quot; id=&quot;link-id0x1ec0b2e0&quot;&gt;Vienna&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id0x2045ac10&quot;&gt;Linked Data&lt;/a&gt; Practitioners meeting as a starting point. These views are mine and are limited to the Virtuoso server. These do not apply to the &lt;a href=&quot;http://dbpedia.org/resource/OpenLink_Data_Spaces&quot; id=&quot;link-id0x2045ac38&quot;&gt;ODS&lt;/a&gt; (&lt;a href=&quot;http://dbpedia.org/resource/OpenLink_Data_Spaces&quot; id=&quot;link-id0x14f63c58&quot;&gt;OpenLink Data Spaces&lt;/a&gt;) applications line, &lt;a href=&quot;http://oat.openlinksw.com/&quot; id=&quot;link-id0x14f63c80&quot;&gt;OAT&lt;/a&gt; (&lt;a href=&quot;http://oat.openlinksw.com/&quot; id=&quot;link-id0x1e536928&quot;&gt;OpenLink Ajax Toolkit&lt;/a&gt;), or &lt;a href=&quot;http://ode.openlinksw.com/&quot; id=&quot;link-id0x1eaed7f8&quot;&gt;ODE&lt;/a&gt; (&lt;a href=&quot;http://ode.openlinksw.com/&quot; id=&quot;link-id0x1edfff88&quot;&gt;OpenLink Data Explorer&lt;/a&gt;).&lt;/p&gt;

&lt;h3&gt;&amp;quot;It is not always clear what the main thrust is, we get the impression that you are spread too thin,&amp;quot; said &lt;a href=&quot;http://www.informatik.uni-leipzig.de/~auer/foaf.rdf#me&quot; id=&quot;link-id0x1b8a9580&quot;&gt;SÃ¶ren Auer&lt;/a&gt;.&lt;/h3&gt;

&lt;p&gt;Well, personally, I am all for core competence. This is why I do not participate in all the online conversations and groups as much as I could, for example. Time and energy are critical resources and must be invested where they make a difference. In this case, the real core competence is running in the database race. This in itself, come to think of it, is a pretty broad concept.&lt;/p&gt;

&lt;p&gt;This is why we put a lot of emphasis on Linked Data and the &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x1b85fa38&quot;&gt;Data&lt;/a&gt; Web for now, as this is the emerging game. This is a deliberate choice, not an outside imperative or built-in limitation. More specifically, this means exposing any pre-existing relational data as linked data plus being the definitive &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x1f5b4468&quot;&gt;RDF&lt;/a&gt; store.&lt;/p&gt;

&lt;p&gt;We can do this because we own our database and &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x20076468&quot;&gt;SQL&lt;/a&gt; and data access middleware and have a history of connecting to any &lt;a href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id0x1ffd6f98&quot;&gt;RDBMS&lt;/a&gt; out there.&lt;/p&gt;

&lt;p&gt;The principal message we have been hearing from the RDF field is the call for scale of triple storage. This is even louder than the call for relational mapping. We believe that in time mapping will exceed triple storage as such, once we get some real production strength mappings deployed, enough to outperform RDF warehousing.&lt;/p&gt;

&lt;p&gt;There are also RDF middleware things like RDF-ization and demand-driven web harvesting (i.e, the so-called Sponger). These are &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x1316f720&quot;&gt;SPARQL&lt;/a&gt; options, thus accessed via standard interfaces. We have little desire to create our own languages or APIs, or to tell people how to program. This is why we recently introduced &lt;a href=&quot;http://sourceforge.net/projects/sesame/&quot; id=&quot;link-id0x20756a68&quot;&gt;Sesame&lt;/a&gt;- and &lt;a href=&quot;http://jena.sourceforge.net/&quot; id=&quot;link-id0x1ec01ac0&quot;&gt;Jena&lt;/a&gt;-compatible APIs to our RDF store. From what we hear, these work. On the other hand, we do not hesitate to move beyond the standards when there is obvious value or necessity. This is why we brought SPARQL up to and beyond SQL expressivity. It is not a case of E3 (Embrace, Extend, Extinguish).&lt;/p&gt;

&lt;p&gt;Now, this message could be better reflected in our material on the web. This &lt;a href=&quot;http://dbpedia.org/resource/Blog&quot; id=&quot;link-id0x2027b410&quot;&gt;blog&lt;/a&gt; is a rather informal step in this direction; more is to come. For now we concentrate on delivering.&lt;/p&gt;

&lt;p&gt;The conventional communications wisdom is to split the message by target audience. For this, we should split the RDF, relational, and web services messages from each other. We believe that a challenger, like the semantic web technology stack, must have a compelling message to tell for it to be interesting. This is not a question of research prototypes. The new technology cannot lack something the installed technology takes for granted.&lt;/p&gt;

&lt;p&gt;This is why we do not tend to show things like how to insert and query a few triples: No business out there will insert and query triples for the sake of triples. There must be a more compelling story â for example, turning the whole world into a database. This is why our examples start with things like turning the &lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id0x2051ff98&quot;&gt;TPC-H&lt;/a&gt; database into RDF, queries and all. Anything less is not interesting. Why would an enterprise that has business intelligence and integration issues way more complex than the rather stereotypical TPC-H even look at a technology that pretends to be all for integration and all for expressivity of queries, yet cannot answer the first question of the entry exam?&lt;/p&gt;

&lt;p&gt;The world out there is complex. But maybe we ought to make some simple tutorials? So, as a call to the people out there, tell us what a good tutorial would be. The question is more about figuring out what is out there and adapting these and making a sort of compatibility list.  Jena and Sesame stuff ought to run as is. We could offer a webinar to all the data web luminaries showing how to promote the data web message with Virtuoso. After all, why not show it on the best platform?&lt;/p&gt;

&lt;h3&gt;&amp;quot;You are arrogant. When I read your papers or documentation, the impression I get is that you say you are smart and the reader is stupid.&amp;quot;&lt;/h3&gt;

&lt;p&gt;We should answer in multiple  parts.&lt;/p&gt;

&lt;p&gt;For general collateral, like web sites and documentation:&lt;/p&gt;

&lt;p&gt;The web site gives a confused product image.  For the Virtuoso product, we should divide at the top into&lt;/p&gt;

&lt;ul&gt;  
&lt;li&gt; Data web and RDF - Host linked data, expose relational assets as linked data;&lt;/li&gt;
&lt;li&gt; Relational Database - Full function, high performance, open source, Federated/Virtual Relational DBMS, expose heterogeneous RDB assets through one point of contact for integration;&lt;/li&gt;
&lt;li&gt; Web Services - access all the above over standard protocols, dynamic web pages, web hosting.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For each point, one simple statement.  We all know what the above things mean?&lt;/p&gt;

&lt;p&gt;Then we add a new point about scalability that impacts all the above, namely the Virtuoso version 6 Cluster, meaning that you can do all these things at 10 to 1000 times the scale. This means this much more data or in some cases this much more requests per second. This too is clear.&lt;/p&gt;

&lt;p&gt;Far as I am concerned, hosting Java or .&lt;a href=&quot;http://dbpedia.org/resource/.NET_Framework&quot; id=&quot;link-id0x1f297540&quot;&gt;NET&lt;/a&gt; does not have to be on the front page. Also, we have no great interest in going against &lt;a href=&quot;http://dbpedia.org/resource/Apache&quot; id=&quot;link-id0x1ea29578&quot;&gt;Apache&lt;/a&gt; when it comes to a web server only situation. The fact that we have a web listener is important for some things but our claim to fame does not rest on this.&lt;/p&gt;

&lt;p&gt;Then for documentation and training materials: The documentation should be better. Specifically it should have more of a how-to dimension since nobody reads the whole thing anyhow. About online tutorials, the order of presentation should be different. They do not really reflect what is important at the present moment either.&lt;/p&gt;

&lt;p&gt;Now for conference papers: Since taking the data web as a focus area, we have submitted some papers and had some rejected because these do not have enough references and do not explain what is obvious to ourselves.&lt;/p&gt;

&lt;p&gt;I think that the communications failure in this case is that we want to talk about end to end solutions and the reviewers expect research. For us, the solution is interesting and exists only if there is an adequate functionality mix for addressing a specific use case. This is why we do not make a paper about query cost model alone because the cost model, while indispensable, is a thing that is taken for granted where we come from. So we mention RDF adaptations to cost model, as these are important to the whole but do not find these to be the justification for a whole paper. If we made papers on this basis, we would have to make five times as many. Maybe we ought to.&lt;/p&gt;

&lt;h3&gt;&amp;quot;Virtuoso is very big and very difficult&amp;quot;&lt;/h3&gt;

&lt;p&gt;One thing that is not obvious from the Virtuoso packaging is that the minimum installation is an executable under 10MB and a config file. Two files.&lt;/p&gt;

&lt;p&gt;This gives you SQL and SPARQL out of the box.  Adding &lt;a href=&quot;http://dbpedia.org/resource/Open_Database_Connectivity&quot; id=&quot;link-id0x20a2e7d0&quot;&gt;ODBC&lt;/a&gt; and &lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id0x1e4cceb8&quot;&gt;JDBC&lt;/a&gt; clients is as simple as it gets. After this, there is basic database functionality. Tuning is a matter of a few parameters that are explained on this blog and elsewhere. Also, the full scale installation is available as an Amazon EC2 image, so no installation required.&lt;/p&gt;

&lt;p&gt;Now for the difficult side:&lt;/p&gt;

&lt;p&gt;Use SQL and SPARQL; use stored procedures whenever there is server side business logic. For some time critical web pages, use VSP. Do not use VSPX. Otherwise, use whatever you are used to â &lt;a href=&quot;http://dbpedia.org/resource/PHP&quot; id=&quot;link-id0x20b03f08&quot;&gt;PHP&lt;/a&gt; or Java or anything else. For web services, simple is best. Stick to basics. &amp;quot;The engineer is one who can invent a simple thing.&amp;quot; Use SQL statements rather than admin UI.&lt;/p&gt;

&lt;p&gt;Know that you can start a server with no database file and you get an initial database with nothing extra. The demo database, the way it is produced by installers is cluttered.&lt;/p&gt;

&lt;p&gt;We should put this into a couple of use case oriented how-tos.&lt;/p&gt;

&lt;p&gt;Also, we should create a network of &amp;quot;friendly local virtuoso geeks&amp;quot; for providing basic training and services so we do not have to explain these things all the time. To all you data-web-ers out there â please sign up and we will provide instructions, etc. Contact YrjÃ¤nÃ¤ Rankka (ghard[at-sign]openlinksw.com), or go through the mailing lists; do not contact me directly.&lt;/p&gt;

&lt;h3&gt;&amp;quot;OK, we understand that you may be good at the large end of the spectrum but how do you reconcile this with the lightweight or embedded end, like the semantic desktop?&amp;quot;&lt;/h3&gt;

&lt;p&gt;Now, what is good for one end is usually good for the other. Namely, a database, no matter the scale, needs to have space efficient storage, fast index lookup, and correct query plans. Then there are things that occur only at the high-end, like clustering, but these are separate things. For embedding, the initial memory footprint needs to be small. With Virtuoso, this is accomplished by leaving out some 200 built-in tables and 100,000 lines of SQL procedures that are normally in by default, supporting things such as DAV and diverse other protocols. After all, if SPARQL is all one wants these are not needed.&lt;/p&gt;

&lt;p&gt;If one really wants to do one&amp;#39;s server logic (like web listener and thread dispatching) oneself, this is not impossible but requires some advice from us. On the other hand, if one wants to have logic for security close to the data, then using stored procedures is recommended; these execute right next to the data, and support inline SPARQL and SQL. Depending on the license status of the other code, some special licensing arrangements may apply.&lt;/p&gt;

&lt;p&gt;We are talking about such things with different parties at present.&lt;/p&gt;

&lt;h3&gt;&amp;quot;How webby are you?  What is webby?&amp;quot;&lt;/h3&gt;

&lt;p&gt;&amp;quot;Webby means distributed, heterogeneous, open; not monolithic consolidation of everything.&amp;quot;&lt;/p&gt;

&lt;p&gt;We are philosophically webby. We come from open standards; we are after all called OpenLink; our history consists of connecting things. We believe in choice â the user should be able to pick the best of breed for components and have them work together. We cannot and do not wish to force replacement of existing assets. Transforming data on the fly and connecting systems, leaving data where it originally resides, is the first preference. For the data web, the first preference is a federation of independent SPARQL end points. When there is harvesting, we prefer to do it on demand, as with our Sponger. With the immense amount of data out there we believe in finding what is relevant &lt;i&gt;when&lt;/i&gt; it is relevant, preferably close at hand, leveraging things like social networks. With a data web, many things which are now siloized, such as marketplaces and social networks, will return to the open.&lt;/p&gt;

&lt;p&gt;Google-style crawling of everything becomes less practical if one needs to run complex &lt;i&gt;ad hoc&lt;/i&gt; queries against the mass of data. For these types of scenarios, if one needs to warehouse, the data cloud will offer solutions where one pays for database on demand. While we believe in loosely coupled federation where possible, we have serious work on the scalability side for the data center and the compute-on-demand cloud.&lt;/p&gt;

&lt;h3&gt;&amp;quot;How does OpenLink see the next five years unfolding?&amp;quot;&lt;/h3&gt;

&lt;p&gt;Personally, I think we have the basics for the birth of a new inflection in the &lt;a href=&quot;http://dbpedia.org/resource/Knowledge&quot; id=&quot;link-id0x2018bd98&quot;&gt;knowledge&lt;/a&gt; economy. The &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Identifier&quot; id=&quot;link-id0x1ec110d8&quot;&gt;URI&lt;/a&gt; is the unit of exchange; its value and competitive edge lie in the data it links you with. A name without context is worth little, but as a name gets more use, more &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id0x1ecfba08&quot;&gt;information&lt;/a&gt; can be found through that name. This is anything from financial statistics, to legal precedents, to news reporting or government data. Right now, if the SEC just added one line of markup to the XBRL template, this would instantaneously make all SEC-mandated reporting into linked data via GRDDL.&lt;/p&gt;

&lt;p&gt;The URI is a carrier of brand. An information brand gets traffic and references, and this can be monetized in diverse ways. The key word is &lt;i&gt;context&lt;/i&gt;. Information overload is here to stay, and only better context offers the needed increase in productivity to stay ahead of the flood.&lt;/p&gt;

&lt;p&gt;Semantic technologies on the whole can help with this. Why these should be semantic web or data web technologies as opposed to just semantic is the linked data value proposition. Even smart islands are still islands. Agility, scale, and scope, depend on the possibility of combining things. Therefore common terminologies and dereferenceability and discoverability are important. Without these, we are at best dealing with closed systems even if they were smart. The expert systems of the 1980s are a case in point.&lt;/p&gt;

&lt;p&gt;Ever since the .com era, the &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Locator&quot; id=&quot;link-id0x1c4c9248&quot;&gt;URL&lt;/a&gt; has been a brand. Now it becomes a URI. Thus, entirely hiding the URI from the user experience is not always desirable. The URI is a sort of handle on the provenance and where more can be found; besides, people are already used to these.&lt;/p&gt;

&lt;p&gt;With linked data, information value-add products become easy to build and deploy. They can be basically just canned SPARQL queries combining data in a useful and insightful manner. And where there is traffic there can be monetization, whether by advertizing, subscription, or other means. Such possibilities are a natural adjunct to the blogosphere. To publish analysis, one no longer needs to be a think tank or media company. We could call this scenario the birth of a meshup economy.&lt;/p&gt;

&lt;p&gt;For OpenLink itself, this is our roadmap. The immediate future is about getting our high end offerings like clustered RDF storage generally available, both on the cloud and for private data centers. Ourselves, we will offer the whole &lt;a href=&quot;http://community.linkeddata.org/dataspace/organization/lod#this&quot; id=&quot;link-id0x20791bf0&quot;&gt;Linked Open Data&lt;/a&gt; cloud as a database. The single feature to come in version 2 of this is fully automatic partitioning and repartitioning for on-demand scale; now, you have to choose how many partitions you have.&lt;/p&gt;

&lt;p&gt;This makes some things possible that were hard thus far.&lt;/p&gt;

&lt;p&gt;On the mapping front, we go for real-scale data integration scenarios where we can show that SPARQL can unify terms and concepts across databases, yet bring no added cost for complex queries. Enterprises can use their existing warehouses and have an added level of abstraction, the possibility of cross systems interlinking, the advantages of using the same taxonomies and ontologies across systems, and so forth.&lt;/p&gt;

&lt;p&gt;Then there will be developments in the direction of smarter web harvesting on demand with the Virtuoso &lt;a href=&quot;http://virtuoso.openlinksw.com/Whitepapers/html/VirtSpongerWhitePaper.html&quot; id=&quot;link-id0x1f27e6d8&quot;&gt;Sponger&lt;/a&gt;, and federation of heterogeneous SPARQL end points. The federation is not so unlike clustering, except the time scales are 2 orders of magnitude longer. The work on SPARQL end point statistics and data set description and discovery is a good development in the community.&lt;/p&gt;

&lt;p&gt;Then there will be NLP integration, as exemplified by the Open Calais linked data wrapper and more.&lt;/p&gt;

&lt;p&gt;Can we pull this off or is this being spread too thin? We know from experience that all this can be accomplished. Scale is already here; we show it with the billion triples set. Mapping is here; we showed it last in the Berlin Benchmark. We will also show some TPC-H results after we get a little quiet after the ISWC event.  Then there is ongoing maintenance but with this we have shown a steady turnaround and quick time to fix for pretty much anything.&lt;/p&gt;</description></item><item><title>State of the Semantic Web, Part 1 - Sociology, Business, and Messaging (update 2)</title><guid>http://www.openlinksw.com/weblog/oerling/?date=2008-10-24#1459</guid><comments>http://www.openlinksw.com/weblog/oerling/?id=1459#comments</comments><pubDate>Fri, 24 Oct 2008 10:19:03 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-10-27T11:27:54-04:00</n0:modified><description>&lt;p&gt;I was in &lt;a href=&quot;http://dbpedia.org/resource/Vienna&quot; id=&quot;link-id0x28471870&quot;&gt;Vienna&lt;/a&gt; for the &lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id0x26f0ec28&quot;&gt;Linked Data&lt;/a&gt; Practitioners gathering this week. Danny Ayers asked me if I would &lt;a href=&quot;http://dbpedia.org/resource/Blog&quot; id=&quot;link-id0x26cf7678&quot;&gt;blog&lt;/a&gt; about the State of the &lt;a href=&quot;http://dbpedia.org/resource/Semantic_Web&quot; id=&quot;link-id0x273087e0&quot;&gt;Semantic Web&lt;/a&gt; or write the &lt;i&gt;This Week&amp;#39;s Semantic Web&lt;/i&gt; column. I don&amp;#39;t have the time to cover all that may have happened during the past week but I will editorialize about the questions that again were raised in Vienna. How these things relate to &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x264e11b8&quot;&gt;Virtuoso&lt;/a&gt; will be covered separately. This is about the overarching questions of the times, not the finer points of geek craft.&lt;/p&gt;
&lt;p&gt;
&lt;a href=&quot;http://www.informatik.uni-leipzig.de/~auer/foaf.rdf#me&quot; id=&quot;link-id0x2787de70&quot;&gt;SÃ¶ren Auer&lt;/a&gt; asked me to say a few things about relational to &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x280b12f8&quot;&gt;RDF&lt;/a&gt; mapping. I will cite some highlights from this, as they pertain to the general scene. There was an &amp;quot;open hacking&amp;quot; session Wednesday night featuring lightning talks. I will use some of these too as a starting point.&lt;/p&gt;
&lt;h3&gt;The messaging?&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;http://www.w3.org/2001/sw/sweo/&quot; id=&quot;link-id0x28078030&quot;&gt;SWEO&lt;/a&gt; (Semantic Web Education and Outreach) interest group of the W3C spent some time looking for an elevator pitch for the Semantic Web. It became &amp;quot;&lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x290a48c0&quot;&gt;Data&lt;/a&gt; Unleashed.&amp;quot; Why not? Let&amp;#39;s give this some context.&lt;/p&gt;
&lt;p&gt;So, if we are holding a &lt;i&gt;Semantic Web 101&lt;/i&gt; session, where should we begin? I hazard to guess that we should not begin by writing a FOAF file in Turtle by hand, as this is one thing that is not likely to happen in the real world.&lt;/p&gt;
&lt;p&gt;Of course, the social aspect of the Data Web is the most immediately engaging, so a demo might be to go make an account with &lt;a href=&quot;http://myopenlink.net/&quot; id=&quot;link-id0x272ed6d0&quot;&gt;myopenlink&lt;/a&gt;.&lt;a href=&quot;http://dbpedia.org/resource/.NET_Framework&quot; id=&quot;link-id0x277dbbd0&quot;&gt;net&lt;/a&gt; and see that after one has entered the data one normally enters for any social network, one has become a Data Web citizen. This means that one can be found, just like this, with a query against the set of data spaces hosted on the system. Then we just need a few pages that repurpose this data and relate it to other data. We show some samples of queries like this in our &lt;a href=&quot;http://challenge.semanticweb.org/&quot; id=&quot;link-id0x25fda5c8&quot;&gt;Billion Triples Challenge&lt;/a&gt; demo. We will make a webcast about this to make it all clearer.&lt;/p&gt;
&lt;p&gt;Behold: The Data Web is about the world becoming a database; writing &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x278c3878&quot;&gt;SPARQL&lt;/a&gt; queries or triples is incidental. You will write FOAF files by hand just as little as you now write &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x27e6be18&quot;&gt;SQL&lt;/a&gt; insert statements for filling in your account &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id0x2727a278&quot;&gt;information&lt;/a&gt; on Myspace.&lt;/p&gt;
&lt;p&gt;Every time there is a major shift in technology, this shift needs to be motivated by addressing a new class of problem. This means doing something that could not be done before. The last time this happened was when the relational database became the dominant IT technology. At that time, the questions involved putting the enterprise in the database and building a cluster of Line Of Business (LOB) applications around the database. The argument for the &lt;a href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id0x26020128&quot;&gt;RDBMS&lt;/a&gt; was that you did not have to constrain the set of queries that might later be made, when designing the database. In other words, it was making things more &lt;i&gt;ad hoc&lt;/i&gt;. This was opposed then on grounds of being less efficient than the hierarchical and network databases which the relational eventually replaced.&lt;/p&gt;
&lt;p&gt;Today, the point of the Data Web is that you do not have to constrain what your data can join or integrate with, when you design your database. The counter-argument is that this is slow and geeky and not scalable. See the similarity?&lt;/p&gt;
&lt;p&gt;A difference is that we are not specifically aiming at replacing the RDBMS. In fact, if you know exactly what you will query and have a well defined workload, a relational representation optimized for the workload will give you about 10x the performance of the equivalent RDF warehouse. OLTP remains a relational-only domain.&lt;/p&gt;
&lt;p&gt;However, when we are talking about doing queries and analytics against the Web, or even against more than a handful of relational systems, the things which make RDBMS good become problematic.&lt;/p&gt;
&lt;h3&gt;What is the business value of this?&lt;/h3&gt;
&lt;p&gt;The most reliable of human drives is the drive to make oneself known. This drives all, from any social scene to business communications to politics. Today, when you want to proclaim you exist, you do so first on the Web. The Web did not become the prevalent media because business loved it for its own sake, it became prevalent because business could not afford not to assert their presence there. If anything, the Web eroded the communications dominance of a lot of players, which was not welcome but still had to be dealt with, by embracing the Web.&lt;/p&gt;
&lt;p&gt;Today, in a world driven by data, the Data Web will be catalyzed by similar factors: If your data is not there, you will not figure in query results. Search engines will play some role there but also many social applications will have reports that are driven by published data. Also consider any e-commerce, any marketplace, and so forth. The Data Portability movement is a case in point: Users want to own their own content; silo operators want to capitalize on holding it. Right now, we see these things in silos; the Data Web will create bridges between these, and what is now in silo data centers will be increasingly available on an ad hoc basis with Open Data.&lt;/p&gt;
&lt;p&gt;Again, we see a movement from the specialized to the generic: What LinkedIn does in its data center can be done with ad hoc queries with &lt;a href=&quot;http://community.linkeddata.org/dataspace/organization/lod#this&quot; id=&quot;link-id0x261c7bc8&quot;&gt;linked open data&lt;/a&gt;. Of course, LinkedIn does these things somewhat more efficiently because their system is built just for this task, but the linked data approach has the built-in readiness to join with everything else at almost no cost, without making a new data warehouse for each new business question.&lt;/p&gt;
&lt;p&gt;We could call this the sociological aspect of the thing. Getting to more concrete business, we see an economy that, we could say, without being alarmists, is confronted with some issues. Well, generally when times are bad, this results in consolidation of property and power. Businesses fail and get split up and sold off in pieces, government adds controls and regulations and so forth. This means ad hoc data integration, as control without data is just pretense. If times are lean, this also means that there is little readiness to do wholesale replacement of systems, which will take years before producing anything. So we must play with what there is and make it deliver, in ways and conditions that were not necessarily anticipated. The agility of the Data Web, if correctly understood, can be of great benefit there, especially on the reporting and business intelligence side. Specifically mapping line-of-business systems into RDF on the fly will help with integration, making the specialized warehouse the slower and more expensive alternative. But this too is needed at times.&lt;/p&gt;
&lt;p&gt;But for the RDF community to be taken seriously there, the messaging must be geared in this direction. Writing FOAF files by hand is not where you begin the pitch. Well, what is more natural then having a global, queriable information space, when you have a global information driven economy?&lt;/p&gt;
&lt;p&gt;The Data Web is about making this happen. First with doing this in published generally available data; next with the enterprises having their private data for their own use but still linking toward the outside, even though private data stays private: You can still use standard terms and taxonomies, where they apply, when talking of proprietary information.&lt;/p&gt;
&lt;h3&gt;But let&amp;#39;s get back to more specific issues&lt;/h3&gt;
&lt;p&gt;At the lightning talks in Vienna, one participant said, &amp;quot;Man&amp;#39;s enemy is not the lion that eats men, it&amp;#39;s his own brother. Semantic Web&amp;#39;s enemy is the &lt;a href=&quot;http://dbpedia.org/resource/XML&quot; id=&quot;link-id0x26273118&quot;&gt;XML&lt;/a&gt; Web services stack that ate its lunch.&amp;quot; There is some truth to the first part. The second part deserves some comment. The Web services stack is about transactions. When you have a fixed, often repeating task, it is a natural thing to make this a Web service. Even though SOA is not really prevalent in enterprise IT, it has value in things like managing supply-chain logistics with partners, etc. Lots of standard messages with unambiguous meaning. To make a parallel with the database world: first there was OLTP; then there was business intelligence. Of course, you must first have the transactions, to have something to analyze.&lt;/p&gt;
&lt;p&gt;SOA is for the transactions; the Data Web is for integration, analysis, and discovery. It is the &lt;i&gt;ad hoc&lt;/i&gt; component of the real time enterprise, if you will. It is not a competitor against a transaction oriented SOA. In fact, RDF has no special genius for transactions. Another mistake that often gets made is stretching things beyond their natural niche. Doing transactions in RDF is this sort of over-stretching without real benefit.&lt;/p&gt;
&lt;p&gt;&amp;quot;I made an ontology and it really did solve a problem. How do I convince the enterprise people, the MBA who says it&amp;#39;s too complex, the developer who says it is not what he&amp;#39;s used to, and so on?&amp;quot;&lt;/p&gt;
&lt;p&gt;This is an education question. One of the findings of SWEO&amp;#39;s enterprise survey was that there was awareness that difficult problems existed. There were and are corporate ontologies and taxonomies, diversely implemented. Some of these needs are recognized. RDF based technologies offer to make these more open standards based. open standards have proven economical in the past. What we also hear is that major enterprises do not even know what their information and human resources assets are: Experts can&amp;#39;t be found even when they are in the next department, or reports and analysis gets buried in wikis, spreadsheets, and emails.&lt;/p&gt;
&lt;p&gt;Just as when SQL took off, we need vendors to do workshops on getting started with a technology. The affair in Vienna was a step in this direction. Another type of event specially focusing on vertical problems and their Data Web solutions is a next step. For example, one could do a workshop on integrating supply chain information with Data Web technologies. Or one on making enterprise &lt;a href=&quot;http://dbpedia.org/resource/Knowledge&quot; id=&quot;link-id0x260172a8&quot;&gt;knowledge&lt;/a&gt; bases from HR, CRM, office automation, wikis, etc. The good thing is that all these things are additions to, not replacements of, the existing mission-critical infrastructure. And better use of what you already have ought to be the theme of the day.&lt;/p&gt;</description></item><item><title>BSBM With Triples and Mapped Relational Data</title><guid>http://www.openlinksw.com/weblog/oerling/?date=2008-08-06#1409</guid><comments>http://www.openlinksw.com/weblog/oerling/?id=1409#comments</comments><pubDate>Wed, 06 Aug 2008 19:35:27 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-08-06T16:29:40-04:00</n0:modified><description>&lt;p&gt;The special contribution of the &lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id10039db0&quot;&gt;Berlin SPARQL Benchmark&lt;/a&gt; (&lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id106b2538&quot;&gt;BSBM&lt;/a&gt;) to the &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id101a75f8&quot;&gt;RDF&lt;/a&gt; world is to raise the question of doing OLTP with &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0xb230eb0&quot;&gt;RDF&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Of course, here we immediately hit the question of comparisons with relational databases.  To this effect, &lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id0xa832da8&quot;&gt;BSBM&lt;/a&gt; also specifies a relational schema and can generate the &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id1206c378&quot;&gt;data&lt;/a&gt; as either triples or &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id1667f040&quot;&gt;SQL&lt;/a&gt; inserts.&lt;/p&gt;

&lt;p&gt;The benchmark effectively simulates the case of exposing an existing &lt;a href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id10a93518&quot;&gt;RDBMS&lt;/a&gt; as RDF.  &lt;a href=&quot;http://www.openlinksw.com/dataspace/organization/openlink#this&quot; id=&quot;link-id13e46d80&quot;&gt;OpenLink Software&lt;/a&gt; calls this &lt;i&gt;RDF Views&lt;/i&gt;.  &lt;a href=&quot;http://dbpedia.org/resource/Oracle_Database&quot; id=&quot;link-id12027578&quot;&gt;Oracle&lt;/a&gt; is beginning to call this &lt;i&gt;semantic covers&lt;/i&gt;.  The &lt;a href=&quot;http://www.w3.org/2005/Incubator/rdb2rdf/&quot; id=&quot;link-id161dc678&quot;&gt;RDB2RDF XG&lt;/a&gt;, a W3C incubator group, has been active in this area since Spring, 2008.&lt;/p&gt;

&lt;h3&gt;But why an OLTP workload with RDF to begin with?&lt;/h3&gt;

&lt;p&gt;We believe this is relevant because RDF promises to be the interoperability factor between potentially all of traditional IS.  If &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0xabe48a0&quot;&gt;data&lt;/a&gt; is online for human consumption, it may be online via a &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id106a8908&quot;&gt;SPARQL&lt;/a&gt; end-point as well.  The economic justification will come from discoverability and from applications integrating multi-source structured data.  Online shopping is a fine use case.&lt;/p&gt;

&lt;p&gt;Warehousing all the world&amp;#39;s publishable data as RDF is not our first preference, nor would it be the publisher&amp;#39;s.  Considerations of duplicate infrastructure and maintenance are reason enough.  Consequently, we need to show that mapping can outperform an RDF warehouse, which is what we&amp;#39;ll do here.&lt;/p&gt;

&lt;h3&gt;What We Got &lt;/h3&gt;

&lt;p&gt;First, we found that &lt;a href=&quot;http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1400&quot; id=&quot;link-id150ea748&quot;&gt;making the query plan took much too long&lt;/a&gt; in proportion to the run time.  With BSBM this is an issue because the queries have lots of joins but access relatively little data.  So we made a faster compiler and along the way retouched the cost model a bit.&lt;/p&gt;

&lt;p&gt;But the really interesting part with BSBM is mapping relational data to RDF.  For us, BSBM is a great way of showing that mapping can outperform even the best triple store.  A relational row store is as good as unbeatable with the query mix.  And when there is a clear mapping, there is no reason the &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x96bb5e0&quot;&gt;SPARQL&lt;/a&gt; could not be directly translated.&lt;/p&gt;

&lt;p&gt;If Chris Bizer et al launched the mapping ship, we will be the ones to pilot it to harbor!&lt;/p&gt;

&lt;p&gt;We filled two &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id12dbdc70&quot;&gt;Virtuoso&lt;/a&gt; instances with a BSBM200000 data set, for 100M triples.  One was filled with physical triples; the other was filled with the equivalent relational data plus mapping to triples.  Performance figures are given in &amp;quot;query mixes per hour&amp;quot;.  (An update or follow-on to this post will provide elapsed times for each test run.)&lt;/p&gt;

&lt;p&gt;With the unmodified benchmark we got:&lt;/p&gt;
&lt;blockquote&gt;
&lt;table&gt;
&lt;tr&gt;
   &lt;td&gt;&lt;i&gt;Physical Triples:&lt;/i&gt;
   &lt;/td&gt;
    &lt;td&gt;Â  Â &lt;/td&gt;
    &lt;td&gt;1297 qmph&lt;/td&gt;
  &lt;/tr&gt;
&lt;tr&gt;
   &lt;td&gt;&lt;i&gt;Mapped Triples:&lt;/i&gt;
   &lt;/td&gt;
    &lt;td&gt;Â  Â &lt;/td&gt;
   &lt;td&gt;&lt;b&gt;3144 qmph&lt;/b&gt;
   &lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;
&lt;/blockquote&gt;
&lt;p&gt;In both cases, most of the time was spent on Q6, which looks for products with one of three words in the label.  We altered Q6  to use text index for the mapping, and altered the databases accordingly. (There is no such thing as an e-commerce site without a text index, so we are amply justified in making this change.)&lt;/p&gt;

&lt;p&gt;The following were measured on the second run of a 100 query mix series, single test driver, warm cache.&lt;/p&gt;
&lt;blockquote&gt;
&lt;table&gt;
&lt;tr&gt;
   &lt;td&gt;&lt;i&gt;Physical Triples:&lt;/i&gt;
   &lt;/td&gt;
    &lt;td&gt;Â  Â &lt;/td&gt;
    &lt;td&gt; 5746 qmph&lt;/td&gt;
  &lt;/tr&gt;
&lt;tr&gt;
   &lt;td&gt;&lt;i&gt;Mapped Triples:&lt;/i&gt;
   &lt;/td&gt;
    &lt;td&gt;Â  Â &lt;/td&gt;
   &lt;td&gt; &lt;b&gt;7525 qmph&lt;/b&gt;
   &lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;
&lt;/blockquote&gt;
&lt;p&gt;We then ran the same with 4 concurrent instances of the test driver. The qmph here is 400 / the longest run time.&lt;/p&gt;
&lt;blockquote&gt;
&lt;table&gt;
&lt;tr&gt;
   &lt;td&gt;&lt;i&gt;Physical Triples:&lt;/i&gt;
   &lt;/td&gt;
    &lt;td&gt;Â  Â &lt;/td&gt;
    &lt;td&gt; 19459 qmph&lt;/td&gt;
  &lt;/tr&gt;
&lt;tr&gt;
   &lt;td&gt;&lt;i&gt;Mapped Triples:&lt;/i&gt;
   &lt;/td&gt;
    &lt;td&gt;Â  Â &lt;/td&gt;
   &lt;td&gt; &lt;b&gt;24531 qmph&lt;/b&gt;
   &lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;
&lt;/blockquote&gt;

&lt;p&gt;The system used was 64-bit Linux, 2GHz dual-Xeon 5130 (8 cores) with 8G RAM.  The concurrent throughputs are a little under 4 times the single thread throughput, which is normal for SMP due to memory contention.  The numbers do not evidence significant overhead from thread synchronization.&lt;/p&gt;

&lt;p&gt;The query compilation represents about 1/3 of total server side CPU. In an actual online application of this type, queries would be parameterized, so the throughputs would be accordingly higher.  We used the &lt;code&gt;StopCompilerWhenXOverRunTime = 1&lt;/code&gt; option here to cut needless compiler overhead, the queries being straightforward enough.&lt;/p&gt;

&lt;p&gt;We also see that the advantage of mapping can be further increased by more compiler optimizations, so we expect in the end mapping will lead RDF warehousing by a factor of 4 or so.&lt;/p&gt;

&lt;h3&gt;Suggestions for BSBM&lt;/h3&gt;

&lt;ul&gt;
 &lt;li&gt;
  &lt;p&gt;
    &lt;b&gt;Reporting Rules.&lt;/b&gt; The benchmark spec should specify a form for disclosure of test run data, TPC style.  This includes things like configuration parameters and exact text of queries.  There should be accepted variants of query text, as with the TPC.&lt;/p&gt;
 &lt;/li&gt;

&lt;li&gt;
  &lt;p&gt;
    &lt;b&gt;Multiuser operation.&lt;/b&gt;  The test driver should get a stream number as parameter, so that each client makes a different query sequence. Also, disk performance in this type of benchmark can only be reasonably assessed with a naturally parallel multiuser workload.&lt;/p&gt;
&lt;/li&gt;

&lt;li&gt;
  &lt;p&gt;
    &lt;b&gt;Add business intelligence.&lt;/b&gt;  SPARQL has aggregates now, at least with &lt;a href=&quot;http://jena.sourceforge.net/&quot; id=&quot;link-id11a25ac0&quot;&gt;Jena&lt;/a&gt; and &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0xa83f490&quot;&gt;Virtuoso&lt;/a&gt;, so let&amp;#39;s use these.  The BSBM business intelligence metric should be a separate metric off the same data.  Adding synthetic sales figures would make more interesting queries possible.  For example, producing recommendations like &amp;quot;customers who bought this also bought xxx.&amp;quot;&lt;/p&gt;
&lt;/li&gt;

&lt;li&gt;
  &lt;p&gt;
    &lt;b&gt;For the SPARQL community&lt;/b&gt;, BSBM sends the message that one ought to support parameterized queries and stored procedures.  This would be a &lt;a href=&quot;http://www.w3.org/TR/rdf-sparql-protocol/&quot; id=&quot;link-id109e2448&quot;&gt;SPARQL protocol&lt;/a&gt; extension; the SPARUL syntax should also have a way of calling a procedure.  Something like &lt;code&gt;select proc (??, ??)&lt;/code&gt; would be enough, where &lt;code&gt;??&lt;/code&gt; is a parameter marker, like &lt;code&gt;?&lt;/code&gt; in &lt;a href=&quot;http://dbpedia.org/resource/Open_Database_Connectivity&quot; id=&quot;link-id13febf48&quot;&gt;ODBC&lt;/a&gt;/&lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id120416a8&quot;&gt;JDBC&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;

&lt;li&gt;
  &lt;p&gt;
    &lt;b&gt;Add transactions.&lt;/b&gt;Especially if we are contrasting mapping vs. storing triples, having an update flow is relevant.  In practice, this could be done by having the test driver send web service requests for order entry and the SUT could implement these as updates to the triples or a mapped relational store.  This could use stored procedures or logic in an app server.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;Comments on Query Mix&lt;/h3&gt;

&lt;p&gt;The time of most queries is less than linear to the scale factor.  Q6 is an exception if it is not implemented using a text index.  Without the text index, Q6 will inevitably come to dominate query time as the scale is increased, and thus will make the benchmark less relevant at larger scales.&lt;/p&gt;

&lt;h2&gt;Next&lt;/h2&gt;

&lt;p&gt;We include the sources of our RDF view definitions and other material for running BSBM with our forthcoming Virtuoso Open Source 5.0.8 release.  This also includes all the query optimization work done for BSBM.  This will be available in the coming days.&lt;/p&gt;</description></item><item><title>ESWC 2008</title><guid>http://www.openlinksw.com/weblog/oerling/?date=2008-06-09#1374</guid><comments>http://www.openlinksw.com/weblog/oerling/?id=1374#comments</comments><pubDate>Mon, 09 Jun 2008 13:49:15 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-06-11T13:15:11.000008-04:00</n0:modified><description>&lt;p&gt;YrjÃ¤nÃ¤ Rankka and I attended &lt;a href=&quot;http://www.eswc2008.org/&quot; id=&quot;link-id10b7a038&quot;&gt;ESWC2008&lt;/a&gt; on behalf of OpenLink.&lt;/p&gt;
&lt;p&gt;We were invited at the last minute to give a &lt;a href=&quot;http://community.linkeddata.org/dataspace/organization/lod#this&quot; id=&quot;link-id105df758&quot;&gt;Linked Open Data&lt;/a&gt; talk at Paolo Bouquet&amp;#39;s Identity and Reference workshop. We also had a demo of &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id12eacca0&quot;&gt;SPARQL&lt;/a&gt; BI (&lt;a href=&quot;http://virtuoso.openlinksw.com/wiki/main/Main/VirtPresentations/ESWC2008%20SPARQL%20BI%20OpenLink.ppt&quot; id=&quot;link-id10b43e58&quot;&gt;PPT&lt;/a&gt;); &lt;a href=&quot;http://virtuoso.openlinksw.com/wiki/main/Main/VirtPresentations&quot; id=&quot;link-id1116d8f0&quot;&gt;other formats coming soon&lt;/a&gt;), our business intelligence extensions to &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x1843a368&quot;&gt;SPARQL&lt;/a&gt; as well as joining between relational &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id10badc40&quot;&gt;data&lt;/a&gt; mapped to &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id108edaf8&quot;&gt;RDF&lt;/a&gt; and native &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x1843a3b0&quot;&gt;RDF&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x1843a3c8&quot;&gt;data&lt;/a&gt;. i was also speaking at the social networks panel chaired by Harry Halpin.&lt;/p&gt;
&lt;p&gt;I have gathered a few impressions that I will share in the next few posts (&lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1375&quot; id=&quot;link-id107298e0&quot;&gt;1 - RDF Mapping&lt;/a&gt;, &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1376&quot; id=&quot;link-id10b3a530&quot;&gt;2 - DARQ&lt;/a&gt;, &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1377&quot; id=&quot;link-id107290e0&quot;&gt;3 - voiD&lt;/a&gt;, &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1378&quot; id=&quot;link-id1071a950&quot;&gt;4 - Paradigmata&lt;/a&gt;). &lt;i&gt;Caveat: This is not meant to be complete or impartial press coverage of the event but rather some quick comments on issues of personal/OpenLink interest. The fact that I do not mention something does not mean that it is unimportant.&lt;/i&gt;
&lt;/p&gt;
&lt;h2&gt;The voiD Graph&lt;/h2&gt;
&lt;p&gt;
&lt;a href=&quot;http://community.linkeddata.org/dataspace/organization/lod#this&quot; id=&quot;link-id0x16c781e0&quot;&gt;Linked Open Data&lt;/a&gt; was well represented, with Chris Bizer, Tom Heath, ourselves and many others. The great advance for &lt;a href=&quot;http://community.linkeddata.org/dataspace/organization/lod#this&quot; id=&quot;link-id108f3c48&quot;&gt;LOD&lt;/a&gt; this time around is &lt;a href=&quot;http://community.linkeddata.org/MediaWiki/index.php?MetaLOD#Kick-off_meeting_at_ESWC08&quot; id=&quot;link-id10df9830&quot;&gt;voiD, the Vocabulary of Interlinked Datasets&lt;/a&gt;, a means to describe what in fact is inside the &lt;a href=&quot;http://community.linkeddata.org/dataspace/organization/lod#this&quot; id=&quot;link-id0x16c78228&quot;&gt;LOD&lt;/a&gt; cloud, how to join it with what and so forth. Big time important if there is to be a &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1377&quot; id=&quot;link-iddf74578&quot;&gt;web of federatable data sources&lt;/a&gt;, feeding directly into what we have been saying for a while about SPARQL end-point self-description and discovery. There is reasonable hope of having something by the date of &lt;a href=&quot;http://www.linkeddataplanet.com/&quot; id=&quot;link-id10dd0848&quot;&gt;Linked Data Planet&lt;/a&gt; in a couple of weeks.&lt;/p&gt;
&lt;h2&gt;Federating&lt;/h2&gt;
&lt;p&gt;Bastian Quilitz gave a talk about his &lt;a href=&quot;http://darq.sourceforge.net/&quot; id=&quot;link-id108746e8&quot;&gt;DARQ&lt;/a&gt;, a federated version of Jena&amp;#39;s ARQ.&lt;/p&gt;
&lt;p&gt;Something like &lt;a href=&quot;http://darq.sourceforge.net/&quot; id=&quot;link-id0x16c782e8&quot;&gt;DARQ&lt;/a&gt;&amp;#39;s optimization statistics should make their way into the &lt;a href=&quot;http://www.w3.org/TR/rdf-sparql-protocol/&quot; id=&quot;link-id10992348&quot;&gt;SPARQL protocol&lt;/a&gt; as well as the voiD data set description.&lt;/p&gt;
&lt;p&gt;We really need federation but more on this in &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1376&quot; id=&quot;link-id1059d688&quot;&gt;a separate post&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
&lt;a href=&quot;http://xsparql.deri.ie/&quot; id=&quot;link-id10314308&quot;&gt;XSPARQL&lt;/a&gt;
&lt;/h2&gt;
&lt;p&gt;Axel Polleres et al had a paper about &lt;a href=&quot;http://xsparql.deri.ie/&quot; id=&quot;link-id0x1a2d8458&quot;&gt;XSPARQL&lt;/a&gt;, a merge of &lt;a href=&quot;http://dbpedia.org/resource/XQuery&quot; id=&quot;link-id10b98e90&quot;&gt;XQuery&lt;/a&gt; and SPARQL. While visiting DERI a couple of weeks back and again at the conference, we talked about OpenLink implementing the spec. It is evident that the engines must be in the same process and not communicate via the &lt;a href=&quot;http://www.w3.org/TR/rdf-sparql-protocol/&quot; id=&quot;link-id0x1d99c1d0&quot;&gt;SPARQL protocol&lt;/a&gt; for this to be practical. We could do this. We&amp;#39;ll have to see when.&lt;/p&gt;
&lt;p&gt;Politically, using &lt;a href=&quot;http://dbpedia.org/resource/XQuery&quot; id=&quot;link-id0x1acae1f0&quot;&gt;XQuery&lt;/a&gt; to give expressions and XML synthesis to SPARQL would be fitting. These things are needed anyhow, as surely as aggregation and sub-queries but the latter would not so readily come from XQuery. Some rapprochement between RDF and XML folks is desirable anyhow.&lt;/p&gt;
&lt;h2&gt;Panel: Will the Sem Web Rise to the Challenge of the Social Web?&lt;/h2&gt;
&lt;p&gt;The social web panel presented the question of whether the sem web was ready for prime time with data portability.&lt;/p&gt;
&lt;p&gt;The main thrust was expressed in Harry Halpin&amp;#39;s rousing closing words: &amp;quot;Men will fight in a battle and lose a battle for a cause they believe in. Even if the battle is lost, the cause may come back and prevail, this time changed and under a different name. Thus, there may well come to be something like our &lt;a href=&quot;http://dbpedia.org/resource/Semantic_Web&quot; id=&quot;link-id122f4da0&quot;&gt;semantic web&lt;/a&gt;, but it may not be the one we have worked all these years to build if we do not rise to the occasion before us right now.&amp;quot;&lt;/p&gt;
&lt;p&gt;So, how to do this? Dan Brickley asked the audience how many supported, or were aware of, the latest Web 2.0 things, such as &lt;a href=&quot;http://dbpedia.org/page/OAuth&quot; id=&quot;link-idf300bc0&quot;&gt;OAuth&lt;/a&gt; and &lt;a href=&quot;http://dbpedia.org/page/OpenID&quot; id=&quot;link-id10ce7a40&quot;&gt;OpenID&lt;/a&gt;. A few were. The general idea was that research (after all, this was a research event) should be more integrated and open to the world at large, not living at the &amp;quot;outdated pace&amp;quot; of a 3 year funding cycle. Stefan Decker of DERI acquiesced in principle. Of course there is impedance mismatch between specialization and interfacing with everything.&lt;/p&gt;
&lt;p&gt;I said that triples and vocabularies existed, that OpenLink had &lt;a href=&quot;http://dbpedia.org/resource/OpenLink_Data_Spaces&quot; id=&quot;link-id1210dbf8&quot;&gt;ODS&lt;/a&gt; (&lt;a href=&quot;http://dbpedia.org/resource/OpenLink_Data_Spaces&quot; id=&quot;link-id11076be8&quot;&gt;OpenLink Data Spaces&lt;/a&gt;, &lt;a href=&quot;http://community.linkeddata.org/&quot; id=&quot;link-id10d46710&quot;&gt;Community LinkedData&lt;/a&gt;) for managing one&amp;#39;s data-web presence, but that scale would be the next thing. Rather large scale even, with 100 gigatriples (Gtriples) reached before one even noticed. It takes a lot of PCs to host this, maybe $400K worth at today&amp;#39;s prices, without replication. Count 16G ram and a few cores per Gtriple so that one is not waiting for disk all the time.&lt;/p&gt;
&lt;p&gt;The tricks that Web 2.0 silos do with app-specific data structures and app-specific partitioning do not really work for RDF without compromising the whole point of smooth schema evolution and tolerance of ragged data.&lt;/p&gt;
&lt;p&gt;So, simple vocabularies, minimal inference, minimal blank nodes. Besides, note that the inference will have to be done at run time, not forward-chained at load time, if only because users will not agree on what sameAs and other declarations they want for their queries. Not to mention spam or malicious sameAs declarations!&lt;/p&gt;
&lt;p&gt;As always, there was the question of business models for the open data web and for semantic technologies in general. As we see it, &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id108b7688&quot;&gt;information&lt;/a&gt; overload is the factor driving the demand. Better contextuality will justify semantic technologies. Due to the large volumes and complex processing, a data-as-service model will arise. The data may be open, but its query infrastructure, cleaning, and keeping up-to-date, can be monetized as services.&lt;/p&gt;
&lt;h2&gt;Identity and Reference&lt;/h2&gt;
&lt;p&gt;For the identity and reference workshop, the ultimate question is metaphysical and has no single universal answer, even though people, ever since the dawn of time and earlier, have occupied themselves with the issue. Consequently, I started with the Genesis quote where Adam called things by &lt;i&gt;nominibus suis&lt;/i&gt;, off-hand implying that things would have some intrinsic ontologically-due names. This would be among the older references to the question, at least in widely known sources.&lt;/p&gt;
&lt;p&gt;For present purposes, the consensus seemed to be that what would be considered the same as something else depended entirely on the application. What was similar enough to warrant a sameAs for cooking purposes might not warrant a sameAs for chemistry. In fact, complete and exact sameness for URIs would be very rare. So, instead of making generic weak similarity assertions like similarTo or seeAlso, one would choose a set of strong sameAs assertions and have these in effect for query answering if they were appropriate to the granularity demanded by the application.&lt;/p&gt;
&lt;p&gt;Therefore sameAs is our permanent companion, and there will in time be malicious and spam sameAs. So, nothing much should be materialized on the basis of sameAs assertions in an &lt;a href=&quot;http://dbpedia.org/resource/Open_world_assumption&quot; id=&quot;link-id10c4dfd0&quot;&gt;open world&lt;/a&gt;. For an app-specific warehouse, sameAs can be resolved at load time.&lt;/p&gt;
&lt;p&gt;There was naturally some apparent tension between the Occam camp of &lt;a href=&quot;http://dbpedia.org/resource/Entity&quot; id=&quot;link-id105fd240&quot;&gt;entity&lt;/a&gt; name services and the LOD camp. I would say that the issue is more a perceived polarity than a real one. People will, inevitably, continue giving things names regardless of any centralized authority. Just look at natural language. But having a dictionary that is commonly accepted for established domains of discourse is immensely helpful.&lt;/p&gt;
&lt;h2&gt;CYC and NLP&lt;/h2&gt;
&lt;p&gt;The semantic search workshop was interesting, especially CYC&amp;#39;s presentation. CYC is, as it were, the grand old man of &lt;a href=&quot;http://dbpedia.org/resource/Knowledge&quot; id=&quot;link-id10568158&quot;&gt;knowledge&lt;/a&gt; representation. Over the long term, I would have support of the CYC inference language inside a database query processor. This would mostly be for repurposing the huge &lt;a href=&quot;http://dbpedia.org/resource/Knowledge&quot; id=&quot;link-id0x17f7dd40&quot;&gt;knowledge&lt;/a&gt; base for helping in search type queries. If it is for transactions or financial reporting, then queries will be &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id130a0a80&quot;&gt;SQL&lt;/a&gt; and make little or no use of any sort of inference. If it is for summarization or finding things, the opposite holds. For scaling, the issue is just making correct cardinality guesses for query planning, which is harder when inference is involved. We&amp;#39;ll see.&lt;/p&gt;
&lt;p&gt;I will also have a closer look at natural language one of these days, quite inevitably, since &lt;a href=&quot;http://zitgist.com/about/&quot; id=&quot;link-id10795828&quot;&gt;Zitgist&lt;/a&gt; (for example) is into &lt;a href=&quot;http://dbpedia.org/resource/Entity&quot; id=&quot;link-id0x1a2c8bd0&quot;&gt;entity&lt;/a&gt; disambiguation.&lt;/p&gt;
&lt;h2&gt;Scale&lt;/h2&gt;
&lt;p&gt;Garlic gave a talk about their Data Patrol and QDOS. We agree that storing the data for these as triples instead of 1000 or so constantly changing relational tables could well make the difference between next-to-unmanageable and efficiently adaptive.&lt;/p&gt;
&lt;p&gt;Garlic probably has the largest triple collection in constant online use to date. We will soon join them with our hosting of the whole LOD cloud and &lt;a href=&quot;http://sindice.org/&quot; id=&quot;link-id0x1b383720&quot;&gt;Sindice&lt;/a&gt;/&lt;a href=&quot;http://zitgist.com/about/&quot; id=&quot;link-id0x1b383738&quot;&gt;Zitgist&lt;/a&gt; as triples.&lt;/p&gt;
&lt;h2&gt;Conclusions&lt;/h2&gt;
&lt;p&gt;There is a mood to deliver applications. Consequently, scale remains a central, even the principal topic. So for now we make bigger centrally-managed databases. At the next turn around the corner we will have to turn to federation. The point here is that a planetary-scale, centrally-managed, online system can be made when the workload is uniform and anticipatable, but if it is free-form queries and complex analysis, we have a problem. So we move in the direction of federating and charging based on usage whenever the workload is more complex than making simple lookups now and then.&lt;/p&gt;
&lt;p&gt;For the &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id1026ac28&quot;&gt;Virtuoso&lt;/a&gt; roadmap, this changes little. Next we make data sets available on Amazon EC2, as widely promised at ESWC. With big scale also comes rescaling and repartitioning, so this gets additional weight, as does further parallelizing of single user workloads. As it happens, the same medicine helps for both. At &lt;a href=&quot;http://www.linkeddataplanet.com/&quot; id=&quot;link-id0x1a2c7eb0&quot;&gt;Linked Data Planet&lt;/a&gt;, we will make more announcements.&lt;/p&gt;</description></item><item><title>More RDF scalability tests</title><guid>http://www.openlinksw.com/weblog/oerling/?date=2006-11-01#1074</guid><comments>http://www.openlinksw.com/weblog/oerling/?id=1074#comments</comments><pubDate>Wed, 01 Nov 2006 19:26:40 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-04-16T16:53:18-04:00</n0:modified><description>&lt;p&gt;We have lately been busy with &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x17524ab8&quot;&gt;RDF&lt;/a&gt; scalability. We work with the 8000 university LUBM &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0xd4ba910&quot;&gt;data&lt;/a&gt; set, a little over a billion triples. We can load it in 23h 46m on a box with 8G RAM. With 16G we probably could get it in 16h.&lt;/p&gt;
&lt;p&gt;The resulting database is 75G, 74 bytes per triple which is not bad. It will shrink a little more if explicitly compacted by merging adjacent partly filled pages. See &lt;a href=&quot;http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VOSBitmapIndexing&quot; id=&quot;link-id105e5cf8&quot;&gt;Advances in Virtuoso RDF Triple Storage&lt;/a&gt; for an in-depth treatment of the subject.&lt;/p&gt;
&lt;p&gt;The real question of RDF scalability is finding a way of having more than one CPU on the same index tree without them hitting the prohibitive penalty of waiting for a mutex. The sure solution is partitioning, would probably have to be by range of the whole key. but before we go to so much trouble, we&amp;#39;ll look at dropping a couple of critical sections from index random access. Also some kernel parameters may be adjustable, like a spin count before calling the scheduler when trying to get an occupied mutex. Still we should not waste too much time on platform specifics. We&amp;#39;ll see.&lt;/p&gt;
&lt;p&gt;We just updated the &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x189d64b8&quot;&gt;Virtuoso&lt;/a&gt; Open Source cut. The latest RDF refinements are not in, so maybe the cut will have to be refreshed shortly.&lt;/p&gt;
&lt;p&gt;We are also now applying the relational to RDF mapping discussed in &lt;a href=&quot;http://virtuoso.openlinksw.com/wiki/main/Main/VOSSQLRDF&quot; id=&quot;link-id10677bb8&quot;&gt;Declarative SQL Schema to RDF Ontology Mapping&lt;/a&gt; to the &lt;a href=&quot;http://dbpedia.org/resource/OpenLink_Data_Spaces&quot; id=&quot;link-id0xa0f5fde0&quot;&gt;ODS&lt;/a&gt; applications.&lt;/p&gt;
&lt;p&gt;There is a form of the mapping in the VOS cut on the net but it is not quite ready yet. We must first finish testing it through mapping all the relational schemas of the ODS apps before we can really recommend it. This is another reason for a VOS update in the near future.&lt;/p&gt;
&lt;p&gt;We will be looking at the query side of LUBM after the ISWC 2006 conference. So far, we find queries compile OK for many SIOC use cases with the cost model that there is now. A more systematic review of the cost model for &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x19b96630&quot;&gt;SPARQL&lt;/a&gt; will come when we get to the queries.&lt;/p&gt;
&lt;p&gt;We put some ideas about inferencing in the Advances in Triple Storage paper. The question is whether we should forward chain such things as class subsumption and subproperties. If we build these into the &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x19bbd098&quot;&gt;SQL&lt;/a&gt; engine used for running SPARQL, we probably can do these as unions at run time with good performance and better working set due to not storing trivial entailed triples. Some more thought and experimentation needs to go into this.&lt;/p&gt;
</description></item><item><title>More Thoughts on ORDBMS Clients, .NET and RDF</title><guid>http://www.openlinksw.com/weblog/oerling/?date=2006-07-17#1007</guid><comments>http://www.openlinksw.com/weblog/oerling/?id=1007#comments</comments><pubDate>Mon, 17 Jul 2006 11:47:30 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-04-16T16:13:10-04:00</n0:modified><description>&lt;p&gt;Continuing on from &lt;a href=&quot;http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1002&quot; id=&quot;link-id1064f0c8&quot;&gt;the previous post&lt;/a&gt;... If Microsoft opens the right interfaces for independent developers, we see many exciting possibilities for using &lt;a href=&quot;http://msdn2.microsoft.com/en-us/data/aa937699.aspx&quot; id=&quot;link-id10f3ab60&quot;&gt;ADO.NET&lt;/a&gt; 3.0 with &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x98d60b0&quot;&gt;Virtuoso&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Microsoft quite explicitly states that their thrust is to decouple the client side representation of &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x175112a8&quot;&gt;data&lt;/a&gt; as .NET objects from the relational schema on the database. This is a worthy goal.&lt;/p&gt;
&lt;p&gt;But we can also see other possible applications of the technology when we move away from strictly relational back ends. This can go in two directions: Towards object oriented database (OODBMS) and towards making applications for the &lt;a href=&quot;http://dbpedia.org/resource/Semantic_Web&quot; id=&quot;link-id0xdbba5b0&quot;&gt;semantic web&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In the OODBMS direction, we could equate Virtuoso table hierarchies with .NET classes and create a tighter coupling between client and database, going as it were in the other direction from Microsoft&amp;#39;s intended decoupling. For example, we could do typical OODBMS tricks such as pre-fetch of objects based on storage clustering. The simplest case of this is like virtual memory, where the request for one byte brings in the whole page or group of pages. The basic idea is that what is created together probably gets used together and if all objects are modeled as subclasses of (sub-tables) of a common superclass, then, regardless of instance type, what is created together (has consecutive IDs) will indeed tend to cluster on the same page. These tricks can deliver good results in very navigational applications like GIS or CAD. But these are rather specialized things and we do not see OODBMS making any great comeback.&lt;/p&gt;
&lt;p&gt;But what is more interesting and more topical in the present times is making clients for the &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0xe2c1e68&quot;&gt;RDF&lt;/a&gt; world. There, the OWL ontology could be used to make the .NET classes and the DBMS could, when returning URIs serving as subjects of triple include specified predicates on these subjects, enough to allow instantiating .NET instances as &amp;quot;proxies&amp;quot; of these RDF objects. Of course, only predicates for which the client has a representation are relevant, thus some client-server handshake is needed at the start. What data could be pre-fetched is like the intersection of a concise bounded description and what the client has classes for. The rest of the mapping would be very simple, with IRIs becoming pointers, multi-valued predicates lists, and so on. IRIs for which the RDF type is not known or inferable could be left out or represented as a special class with name-value pairs for its attributes, same with blank nodes.&lt;/p&gt;
&lt;p&gt;In this way, .NET&amp;#39;s considerable UI capabilities could directly be exploited for visualizing RDF data, only given that the data complies reasonably well with a known ontology.&lt;/p&gt;
&lt;p&gt;If a &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x16b86e90&quot;&gt;SPARQL&lt;/a&gt; query returned a result-set, IRI type columns would be returned as .NET instances and the server would pre-fetch enough data for filling them in. For a CONSTRUCT, a collection object could be returned with the objects materialized inside. If the interfaces allow passing an &lt;a href=&quot;http://dbpedia.org/resource/Entity&quot; id=&quot;link-id0x19a26180&quot;&gt;Entity&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x1d8ea998&quot;&gt;SQL&lt;/a&gt; string, these could possibly be specialized to allow for a SPARQL string instead. LINQ might have to be extended to allow for SPARQL type queries, though.&lt;/p&gt;
&lt;p&gt;Many of these questions will be better answerable as we get more details on Microsoft&amp;#39;s forthcoming &lt;a href=&quot;http://dbpedia.org/resource/ADO.NET&quot; id=&quot;link-id0xde74a60&quot;&gt;ADO&lt;/a&gt; .NET release. We hope that sufficient latitude exists for exploring all these interesting avenues of development.&lt;/p&gt;</description></item><item><title>Object Relational Rediscovered?</title><guid>http://www.openlinksw.com/weblog/oerling/?date=2006-07-13#1002</guid><comments>http://www.openlinksw.com/weblog/oerling/?id=1002#comments</comments><pubDate>Thu, 13 Jul 2006 11:38:41 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-04-16T16:13:08.000001-04:00</n0:modified><description>&lt;p&gt;I have recently read some of Microsoft&amp;#39;s &lt;a href=&quot;http://dbpedia.org/resource/ADO.NET&quot; id=&quot;link-id0x151feed0&quot;&gt;ADO&lt;/a&gt; .NET 3 papers. I am reminded of the distant past when I designed Kubl, which later became OpenLink &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0xd1e5a50&quot;&gt;Virtuoso&lt;/a&gt;. So I will reminisce and speculate a little.&lt;/p&gt;
&lt;p&gt;So now is the time when polymorphic queries and mixing relational style joins and object style navigation become politically acceptable and even recommended and there finally is a workable solution to having a foreign key in the database and a pointer or set of pointers in the client application. Not to mention change tracking so as to be able to update in-memory &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0xd6877c0&quot;&gt;data&lt;/a&gt; structures and commit a delta against the database without explicit update statements.&lt;/p&gt;
&lt;p&gt;All these questions existed already in the mid 90s and earlier. Since I was coming from OO and LISP into the database world, I even felt these questions to be important. The solution in the earliest Kubl was to have inheritance between tables, what became the &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0xc703880&quot;&gt;SQL&lt;/a&gt; 2K &lt;code&gt;UNDER&lt;/code&gt; clause, and a virtual column called &lt;code&gt;_ROW&lt;/code&gt; that would select a serialization of the primary key entry. Then there was the function &lt;code&gt;row_key()&lt;/code&gt;, which when applied to a &lt;code&gt;_ROW&lt;/code&gt; virtual column would return a database-wide unique identifier of the row, containing the key info and the key part values plus which subtable of the table was at hand. Then there was a function for dereferencing a &lt;code&gt;row_key&lt;/code&gt; for getting the &lt;code&gt;_ROW&lt;/code&gt;. And one could store &lt;code&gt;row_keys&lt;/code&gt; into columns and dereference these in queries. Within SQL, one could use the &lt;code&gt;row_column&lt;/code&gt; function to extract individual column values from a &lt;code&gt;row_key&lt;/code&gt; or &lt;code&gt;_ROW&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;This was all fine server side. But we also had a client for Franz Inc.&amp;#39;s Allegro Common Lisp that talked to Kubl&amp;#39;s &lt;a href=&quot;http://dbpedia.org/resource/Open_Database_Connectivity&quot; id=&quot;link-id0xdbceda0&quot;&gt;ODBC&lt;/a&gt; listener. This client had the basic statements and prepared statements and result sets, parameters and array parameters, a little like &lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id0x181a27b0&quot;&gt;JDBC&lt;/a&gt; does now. But the extra was that we could do a mapping between a Lisp struct or object and a database key, so the &lt;code&gt;_ROW&lt;/code&gt; would automatically materialize into the Lisp struct or class instance. And the mapping between these materializations and the &lt;code&gt;row_keys&lt;/code&gt; identifying them in the database were kept in a thread environment called object space. Updates could be relational-style &lt;code&gt;UPDATEs&lt;/code&gt; or consist of putting a &lt;code&gt;_ROW&lt;/code&gt; serialization in database format back into the Kubl store with a single SQL function.&lt;/p&gt;
&lt;p&gt;This was different from just storing object serializations into LOB columns, as is often done, insofar as the object classes and data members were really database tables and columns, thus native to the DBMS, not just opaque data to be processed client-side only.&lt;/p&gt;
&lt;p&gt;So it was then possible to program a little like is shown in the ADO .NET 3 demos today, some ten years later.&lt;/p&gt;
&lt;p&gt;Some of these functions still exist in Virtuoso, albeit in a deprecated state, and there is no client that can use these to any advantage. Indeed, we dropped this line of work when Kubl became Virtuoso, mostly because there was no standard and no client applications that would use such features. Instead, we concentrated on virtual &lt;a href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id0xd06fd40&quot;&gt;RDBMS&lt;/a&gt;, transparently accessing any third party data via ODBC.&lt;/p&gt;
&lt;p&gt;Now however, as objects, both native SQL and Java and .NET, have become mainstream citizens of relational databases in general, Virtuoso and otherwise, and as Microsoft has legitimized accessing whole objects and not only scalar columns in result sets as part of ADO .NET 3, these things might be worth a second look.&lt;/p&gt;</description></item>
</channel>
</rss>
