<?xml version="1.0" encoding="UTF-8" ?>
<!--RDF based XML document generated By OpenLink Virtuoso-->
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
 <rss:channel xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/">
  <rss:title>OpenLink Virtuoso (Product Blog)</rss:title>
  <rss:link>http://www.openlinksw.com/blog/vdb/blog/</rss:link>
  <rss:description>A great place to track Virtuoso&#39;s rapid evolution.</rss:description>
  <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Virtuso Data Space Bot &lt;kidehen@openlinksw.com&gt;</dc:creator>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-07-27T09:40:07Z</dc:date>
  <dc:rights xmlns:dc="http://purl.org/dc/elements/1.1/">OpenLink Software 1998-2006</dc:rights>
  <dc:language xmlns:dc="http://purl.org/dc/elements/1.1/">en-us</dc:language>
  <rss:items>
   <rdf:Seq>
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?id=1393" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?id=1383" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?id=1382" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?id=1381" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?id=1380" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?id=1379" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?id=1369" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?id=1359" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?id=1354" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?id=1350" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?id=1349" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?id=1348" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?id=1340" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?id=1339" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?id=1338" />
   </rdf:Seq>
  </rss:items>
 </rss:channel>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?id=1393">
  <rss:title>Virtuoso 5.0.7 Release, Now With Jena and Sesame APIs</rss:title>
  <rss:link>http://www.openlinksw.com/blog/vdb/blog/?id=1393</rss:link>
  <wfw:comment xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://www.openlinksw.com/mt-tb/Http/comments?id=1393</wfw:comment>
  <wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://www.openlinksw.com/blog/vdb/blog/gems/rsscomment.xml?:id=1393</wfw:commentRss>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-07-17T17:18:09Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Virtuoso 5.0.7 Release, Now With Jena and Sesame APIs Improvements Full operation with Jena and Sesame RDF Frameworks. This fully replaces any previous attempts at interop, and introduces samples and test suites. Better support for alternate RDF indexing schemes Parallel operation of the RDF Sponger, importing multiple sources concurrently. New data formats supported for on-demand RDF-ization in the Sponger More efficient support for inference of subclass and sub-property; now capable of efficiently handling taxonomies of tens of thousands of classes OWL equivalentClass and equivalentProperty support. Dynamic IRI host part support for mapped data and for metadata of local resources. Renaming the host or using multiple virtual hosts will accept URIs with the right host part and refer to the same thing, no duplicate storage required. SPARQL optimizations for LIMIT and OFFSET Documentation How to read query plans and how to use the key performance meters How to diagnose SPARQL queries and how to decide what indexing scheme is right for each RDF use case How to debug RDF views Better documentation of SPARQL extensions and options A sample of correct RDF view usage with the Northwind demo data Bug Fixes Generally improved safety of built-in functions, better argument checking. Verified UTF8 international character support in all RDF use cases, SQL client/SPARQL protocol/all data formats.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<div style="display:none;">Virtuoso 5.0.7 Release, Now With Jena and Sesame APIs</div>
<h2>Improvements</h2>
<ul>
<li>
  <a href="http://docs.openlinksw.com:80/virtuoso/rdfnativestorageproviders.html" id="link-id13e54d98">Full operation</a> with <a href="http://jena.sourceforge.net/" id="link-id0x11a3d360">Jena</a> and <a href="http://sourceforge.net/projects/sesame/" id="link-id0x1108d428">Sesame</a> <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1288aa00">RDF</a> Frameworks. This fully replaces any previous attempts at interop, and introduces samples and test suites.</li>
<li>Better support for alternate RDF indexing schemes</li>
<li>Parallel operation of the RDF Sponger, importing multiple
sources concurrently.</li>
<li>New <a href="http://dbpedia.org/resource/Data" id="link-id0x128a9810">data</a> formats supported for on-demand RDF-ization in the
Sponger</li>
<li>More efficient support for inference of subclass and
sub-property; now capable of efficiently handling taxonomies of tens
of thousands of classes</li>
<li>
    <a href="http://dbpedia.org/resource/Web_Ontology_Language" id="link-id0x6af0678">OWL</a> <a href="http://docs.openlinksw.com:80/virtuoso/rdfsparqlrule.html#rdfsparqlruleintro" id="link-id104d58d8">equivalentClass and equivalentProperty</a> support.</li>
<li>
    <a href="http://docs.openlinksw.com:80/virtuoso/rdfdatarepresentation.html#rdfdynamiclocal" id="link-id109606a8">Dynamic IRI host part</a> support for mapped data and for metadata of local resources. Renaming the host or using multiple virtual hosts will accept URIs with the right host part and refer to the same thing, no duplicate storage required.</li>
<li>
    <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x12e0cc38">SPARQL</a> optimizations for <code>LIMIT</code> and <code>OFFSET</code>
</li>
</ul>
<h2>Documentation</h2>
<ul>
<li>
    <a href="http://docs.openlinksw.com:80/virtuoso/perfdiag.html#perfdiagqueryplans" id="link-id10a56dd0">How to read query plans and how to use the key performance meters</a>
  </li>
<li>
    <a href="http://docs.openlinksw.com:80/virtuoso/rdfperformancetuning.html#rdfperfcost" id="link-id106cb5c0">How to diagnose SPARQL queries and how to decide what indexing scheme is right for each RDF use case</a>
  </li>
<li>How to debug RDF views</li>
<ul>
  <li>
    <a href="http://docs.openlinksw.com:80/virtuoso/sparqldebug.html" id="link-id133b4420">Better documentation of SPARQL extensions and options</a>
  </li>
<li>
    <a href="http://docs.openlinksw.com:80/virtuoso/rdfviews.html#rdfviewnorthwindexample1" id="link-id1060fdd8">A sample of correct RDF view usage with the Northwind demo data</a>
  </li>
</ul>
</ul>
<h2>Bug Fixes</h2>
<ul>
<li>Generally improved safety of built-in functions, better
argument checking.</li>
<li>Verified UTF8 international character support in all RDF use
cases, <a href="http://dbpedia.org/resource/SQL" id="link-id0x12839fd0">SQL</a> client/<a href="http://www.w3.org/TR/rdf-sparql-protocol/" id="link-id0x1288f350">SPARQL protocol</a>/all data formats.</li>
</ul>
</div>]]></content:encoded>
  <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Virtuso Data Space Bot &lt;kidehen@openlinksw.com&gt;</dc:creator>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?id=1383">
  <rss:title>De Paradigmata and The Foundational Issues</rss:title>
  <rss:link>http://www.openlinksw.com/blog/vdb/blog/?id=1383</rss:link>
  <wfw:comment xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://www.openlinksw.com/mt-tb/Http/comments?id=1383</wfw:comment>
  <wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://www.openlinksw.com/blog/vdb/blog/gems/rsscomment.xml?:id=1383</wfw:commentRss>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-06-09T14:02:21Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">De Paradigmata and The Foundational Issues I thought that we had talked ourselves to exhaustion and beyond over the issue of the semantic web layer cake. Apparently not. There was a paper called Functional Architecture for the Semantic Web by Aurona Gerber et al at ESWC2008. The thrust of the matter was that for newcomers the layer cake was confusing and did not clearly indicate the architecture. Why, sure. My point is that no rearranging of the boxes will cut it for the general case. Any diagram containing the boxes of the layer cake (i.e., URI, XML, SPARQL, OWL, RIF, Crypto, etc., etc.) in whatever order or arrangement can at best be a sort of overview of how these standards reference each other. Such diagrams are a little like saying that a car combines the combustion properties of fuel/air mixes with the tension and compression resistance properties of metals and composites for producing motion and secondly links to Newton&#39;s laws of motion and to aerodynamics. Not false. But it does not say that a car is good for economical commute or showing off at the strip or any number of niches that a mature industry has grown to serve. Now, talking of software engineering, modules and interfaces are good and even necessary. The trick is to know where to put the interface. Such a thing cannot possibly be inferred from the standards&#39; inter-reference picture. APIs, especially if these are Web service APIs, should go where there is low data volume and tolerance for latency. For example, either inference is a preprocessing step or it is embedded right inside a SPARQL engine. Such a thing cannot be seen from the picture. Same for trust. Trust is not an after-thought at the top of the picture, except maybe in the sense of referring to the other parts. We hear it over and over. Scale and speed are critical. Arrange the blocks of any real system as makes sense for data flow; do not confuse literature references with control or data structure. The even-more foundational issue is the promotion of the general concept of a Web of Data. The core idea that the Web would be a query-able collection of data with meaningful reference between data of different provenance cannot be inferred from the picture, even though this should be its primary message. Or it is better to say that the first picture shown should stress this idea and then one could leave the layer cake, in whatever version, for explaining the standards&#39; order of evolution or inter-reference. So, the value proposition: Why? Explosion of data volume, increased need of keeping up-to-date, increasing opportunity cost of not keeping in real time. What? An architecture that is designed for unanticipated joining and evolution of data across heterogeneous sources, either at Web or enterprise scale. How? URI everything and everything is cool, or, give things global names. Use RDF. Reuse names or ontologies where can. (An ontology is a set of classes and property names plus some more.) Map relational data on the fly or store as RDF, whichever works. Query with SPARQL, easier than SQL. So, my challenge for the graphics people would be to make an illustration of the above. Forget the alphabet soup. Show the layer cake as a historical reference or literature guide. Do not imply that this proliferation of boxes equates to an equal proliferation of Web services, for example.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<div style="display:none;">De Paradigmata and The Foundational Issues</div>
<p>I thought that we had talked ourselves to exhaustion and beyond over the issue of the <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x1dd07c68">semantic web</a> layer cake. Apparently not. There was a paper called <i>Functional Architecture for the Semantic Web</i> by <a href="http://gerberaj.googlepages.com/" id="link-id106b8130">Aurona Gerber</a> et al at <a href="http://www.eswc2008.org/" id="link-id0x17137300">ESWC2008</a>.</p>
<p>The thrust of the matter was that for newcomers the layer cake was confusing and did not clearly indicate the architecture. Why, sure. My point is that no rearranging of the boxes will cut it for the general case.</p>
<p>Any diagram containing the boxes of the layer cake (i.e., <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id0x1a9138c0">URI</a>, <a href="http://dbpedia.org/resource/XML" id="link-id0x1cc4a8d8">XML</a>, <a href="http://dbpedia.org/resource/SPARQL" id="link-id0xa21c1308">SPARQL</a>, <a href="http://dbpedia.org/resource/Web_Ontology_Language" id="link-id0x1aa28050">OWL</a>, <a href="http://dbpedia.org/resource/Rule_Interchange_Format" id="link-id0x137268d0">RIF</a>, Crypto, etc., etc.) in whatever order or arrangement can at best be a sort of overview of how these standards reference each other.</p>
<p>Such diagrams are a little like saying that a car combines the combustion properties of fuel/air mixes with the tension and compression resistance properties of metals and composites for producing motion and secondly links to Newton&#39;s laws of motion and to aerodynamics.</p>
<p>Not false. But it does not say that a car is good for economical commute or showing off at the strip or any number of niches that a mature industry has grown to serve.</p>
<p>Now, talking of software engineering, modules and interfaces are good and even necessary. The trick is to know where to put the interface.</p>
<p>Such a thing cannot possibly be inferred from the standards&#39; inter-reference picture. APIs, especially if these are Web service APIs, should go where there is low <a href="http://dbpedia.org/resource/Data" id="link-id0x196fcba0">data</a> volume and tolerance for latency. For example, either inference is a preprocessing step or it is embedded right inside a SPARQL engine. Such a thing cannot be seen from the picture. Same for trust. Trust is not an after-thought at the top of the picture, except maybe in the sense of referring to the other parts.</p>
<p>We hear it over and over. Scale and speed are critical. Arrange the blocks of any real system as makes sense for data flow; do not confuse literature references with control or data structure.</p>
<p>The even-more foundational issue is the promotion of the general concept of a Web of Data.</p>
<p>The core idea that the Web would be a query-able collection of data with meaningful reference between data of different provenance cannot be inferred from the picture, even though this should be its primary message. Or it is better to say that the first picture shown should stress this idea and then one could leave the layer cake, in whatever version, for explaining the standards&#39; order of evolution or inter-reference.</p>
<p>So, the value proposition:</p>
<p>Why? Explosion of data volume, increased need of keeping up-to-date, increasing opportunity cost of not keeping in real time.</p>
<p>What? An architecture that is designed for unanticipated joining and evolution of data across heterogeneous sources, either at Web or enterprise scale.</p>
<p>How? URI everything and everything is cool, or, give things global names. Use <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x13700d00">RDF</a>. Reuse names or ontologies where can. (An ontology is a set of classes and property names plus some more.) Map relational data on the fly or store as RDF, whichever works. Query with SPARQL, easier than <a href="http://dbpedia.org/resource/SQL" id="link-id0x17865208">SQL</a>.</p>
<p>So, my challenge for the graphics people would be to make an illustration of the above. Forget the alphabet soup. Show the layer cake as a historical reference or literature guide. Do not imply that this proliferation of boxes equates to an equal proliferation of Web services, for example.</p>
</div>]]></content:encoded>
  <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Virtuso Data Space Bot &lt;kidehen@openlinksw.com&gt;</dc:creator>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?id=1382">
  <rss:title>voiD, or Will the LOD Cloud Bring Rain?</rss:title>
  <rss:link>http://www.openlinksw.com/blog/vdb/blog/?id=1382</rss:link>
  <wfw:comment xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://www.openlinksw.com/mt-tb/Http/comments?id=1382</wfw:comment>
  <wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://www.openlinksw.com/blog/vdb/blog/gems/rsscomment.xml?:id=1382</wfw:commentRss>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-06-09T14:02:20Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">voiD, or Will the LOD Cloud Bring Rain? At ESWC2008, we saw the Linked Open Data Cloud condense its first drops of precipitation. voiD, Vocabulary of Interlinked Datasets, is an idea whose time has clearly come. By the end of the conference, many speakers had already adopted the meme. The point is to describe what is inside the data sets. People may know this from having worked with the sets or from putting them together but to an outsider this is not evident. The Semantic Sitemap says where there are files or end points for access. But it does not say what is inside these. Also for federation, it is important to be able to determine whether it makes sense to send a particular query to a particular end point. If we play this right, this is what voiD will provide. I have to think of Dan Simmons&#39; flamboyant Hyperion sci-fi series where the &quot;void which binds&quot; was a sort of hyperspace containing the thoughts of entities, past and present and even provided teleportation. So what does the voiD hold, aside infinite potentialities? The obvious part is DC-like provenance, version, authorship, license and such data set wide information. Also the subject matter could be classified by reference to UMBEL or the Yago classification of DBpedia. More is needed, though. The simple part is listing the ontologies, if any. Also a set of namespaces would be an idea but this could be very large. So let us look at what we&#39;d like to be able to answer with the voiD set. The below could be a sample of voiD questions? What subjects are in the LOD cloud? Given this URI, what set in the LOD cloud can tell me more? This is divided into asking a text index like Sindice for the location, getting the namespace or data set and then querying voiD. What need I federate/load in order to combine all that is reachable from a given vocabulary? There could be for example a graph showing the data sets and edges between them, edges being qualified by a set of same as assertions, itself a voiD described set, if translations were needed. What sets are from the same or equally trusted publisher as this one? These things are roughly divided into description of the set and then some details on how it is stored on a given end point. Given this set, in which other sets will I find use of the same URIs? For example, if I have language version x, I wish to know that language version y will have the same URIs insofar the things meant are the same. Given this set, which sets of same as assertions will I have for mapping to which other sets? For example, if I have Geonames, I wish to know that set x will map at least some of the URIs in Geonames to DBpedia URIs. Let me further point out that it is increasingly clear to the community that universal sameAs is dubious, hence sameAs assertions ought to be kept separate and included or excluded depending on the usage context. Given this set, what are the interesting queries I can do? This is a sort of advertisement for human consumption. This is not a list of queries for crashing the end point. Denial of service can be done in SPARQL without knowing the end point content anyhow, so this is not an added risk exposer. Vocabularies used. This is a reference to the OWL or RDFS resources giving the applicable ontologies, if present. Also, a complete list of classes whose direct instances actually occur in the set is useful. Ballpark cardinality. Something like a DARQ optimization profile would be a good idea. I would say that there should be a possibility of just including a DARQ description file as is. This is a sort of baseline and since it already exist, we are spared the committee trouble of figuring out what it ought to contain and what not. If we start defining this from scratch, it will take long. Further, let this be optional. Quite Independently of this, query processors may make optimization related queries to remote end points insofar the specific end point supports these. This will come in time. For now, just the basics. Along with this, LOD SPARQL end points could adopt a couple of basic conventions. The simplest would be to agree that each would host a graph with a given URI that would contain the voiD descriptions of the data sets contained, along with the graph URI used for each set, if different from the publisher&#39;s URI for the graph. There is a point to this since an end point may load multiple data sets into one graph. We hope to have a good idea of the matter in a couple of weeks, certainly a general statement of direction to be published at Linked Data Planet in a couple of weeks.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<div style="display:none;">voiD, or Will the LOD Cloud Bring Rain?</div>
<p>At <a href="http://www.eswc2008.org/" id="link-id0x1c3bec48">ESWC2008</a>, we saw the <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x1f0db270">Linked Open Data</a> Cloud condense its first drops of precipitation.</p>
<p>
<a href="http://community.linkeddata.org/MediaWiki/index.php?MetaLOD#Kick-off_meeting_at_ESWC08" id="link-id106ee858">voiD, Vocabulary of Interlinked Datasets</a>, is an idea whose time has clearly come. By the end of the conference, many speakers had already adopted the <a href="http://dbpedia.org/resource/Meme" id="link-id0x16c99ad0">meme</a>.</p>
<p>The point is to describe what is inside the <a href="http://dbpedia.org/resource/Data" id="link-id0x1c540958">data</a> sets. People may know this from having worked with the sets or from putting them together but to an outsider this is not evident.</p>
<p>The Semantic Sitemap says where there are files or end points for access. But it does not say what is inside these. Also for federation, it is important to be able to determine whether it makes sense to send a particular query to a particular end point.</p>
<p>If we play this right, this is what voiD will provide. I have to think of Dan Simmons&#39; flamboyant Hyperion sci-fi series where the &quot;void which binds&quot; was a sort of hyperspace containing the thoughts of entities, past and present and even provided teleportation.</p>
<p>So what does the voiD hold, aside infinite potentialities?</p>
<p>The obvious part is DC-like provenance, version, authorship, license and such data set wide <a href="http://dbpedia.org/resource/Information" id="link-id0x16c05280">information</a>. Also the subject matter could be classified by reference to <a href="http://umbel.org/about/" id="link-id0x1abf1558">UMBEL</a> or the <a href="http://www.mpi-inf.mpg.de/~suchanek/downloads/yago/" id="link-id0x1b49ee78">Yago</a> classification of <a href="http://dbpedia.org/resource/DBpedia" id="link-id0x184dea28">DBpedia</a>.</p>
<p>More is needed, though. The simple part is listing the ontologies, if any. Also a set of namespaces would be an idea but this could be very large.</p>
<p>So let us look at what we&#39;d like to be able to answer with the voiD set.</p>
<p>The below could be a sample of voiD questions?</p>
<ul>
 <li>
  <p>
    <i>What subjects are in the <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x1bbac318">LOD</a> cloud?</i>
  </p>
 </li>
<li>
  <p>
    <i>Given this <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id0x1f74c7e8">URI</a>, what set in the LOD cloud can tell me more?</i> This is divided into asking a text index like <a href="http://sindice.org/" id="link-id0x1d57a8f8">Sindice</a> for the location, getting the namespace or data set and then querying voiD.</p>
</li>
<li>
  <p>
    <i>What need I federate/load in order to combine all that is reachable from a given vocabulary?</i> There could be for example a graph showing the data sets and edges between them, edges being qualified by a set of same as assertions, itself a voiD described set, if translations were needed.</p>
</li>
<li>
  <p>
    <i>What sets are from the same or equally trusted publisher as this one?</i>
  </p>
</li>
</ul>
<p>These things are roughly divided into description of the set and then some details on how it is stored on a given end point.</p>
<ul>
 <li>
  <p>
    <i>Given this set, in which other sets will I find use of the same URIs?</i> For example, if I have language version x, I wish to know that language version y will have the same URIs insofar the things meant are the same.</p>
 </li>
<li>
  <p>
    <i>Given this set, which sets of same as assertions will I have for mapping to which other sets?</i> For example, if I have <a href="http://www.geonames.org/" id="link-id0x1b372140">Geonames</a>, I wish to know that set x will map at least some of the URIs in Geonames to DBpedia URIs.</p>
</li>
</ul>
<p>Let me further point out that it is increasingly clear to the community that universal sameAs is dubious, hence sameAs assertions ought to be kept separate and included or excluded depending on the usage context.</p>
<ul>
 <li>
  <p>
    <i>Given this set, what are the interesting queries I can do?</i> This is a sort of advertisement for human consumption. This is not a list of queries for crashing the end point. Denial of service can be done in <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1b25dea8">SPARQL</a> without knowing the end point content anyhow, so this is not an added risk exposer.</p>
 </li>
<li>
  <p>
    <i>Vocabularies used.</i> This is a reference to the OWL or RDFS resources giving the applicable ontologies, if present. Also, a complete list of classes whose direct instances actually occur in the set is useful.</p>
</li>
<li>
  <p>
    <i>Ballpark cardinality.</i> Something like a <a href="http://darq.sourceforge.net/" id="link-id0x1ed8f580">DARQ</a> optimization profile would be a good idea. I would say that there should be a possibility of just including a DARQ description file as is. This is a sort of baseline and since it already exist, we are spared the committee trouble of figuring out what it ought to contain and what not. If we start defining this from scratch, it will take long. Further, let this be optional. Quite Independently of this, query processors may make optimization related queries to remote end points insofar the specific end point supports these. This will come in time. For now, just the basics.</p>
</li>
</ul>
<p>Along with this, LOD SPARQL end points could adopt a couple of basic conventions. The simplest would be to agree that each would host a graph with a given URI that would contain the voiD descriptions of the data sets contained, along with the graph URI used for each set, if different from the publisher&#39;s URI for the graph. There is a point to this since an end point may load multiple data sets into one graph.</p>
<p>We hope to have a good idea of the matter in a couple of weeks, certainly a general statement of direction to be published at <a href="http://www.linkeddataplanet.com/" id="link-id0x1b049830">Linked Data Planet</a> in a couple of weeks.</p>
</div>]]></content:encoded>
  <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Virtuso Data Space Bot &lt;kidehen@openlinksw.com&gt;</dc:creator>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?id=1381">
  <rss:title>The DARQ Matter of Federation</rss:title>
  <rss:link>http://www.openlinksw.com/blog/vdb/blog/?id=1381</rss:link>
  <wfw:comment xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://www.openlinksw.com/mt-tb/Http/comments?id=1381</wfw:comment>
  <wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://www.openlinksw.com/blog/vdb/blog/gems/rsscomment.xml?:id=1381</wfw:commentRss>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-06-09T14:02:19Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">The DARQ Matter of Federation Astronomers propose that the universe is held together, so to speak, by the gravity of invisible &quot;dark matter&quot; spread in interstellar and intergalactic space. For the data web, it will be held together by federation, also an invisible factor. As in Minkowski space, so in cyberspace. To take the astronomical analogy further, putting too much visible stuff in one place makes a black hole, whose chief properties are that it is very heavy, can only get heavier and that nothing comes out. DARQ is Bastian Quilitz&#39;s federated extension of the Jena ARQ SPARQL processor. It has existed for a while and was also presented at ESWC2008. There is also SPARQL FED from Andy Seaborne, an explicit means of specifying which end point will process which fragment of a distributed SPARQL query. Still, for federation to deliver in an open, decentralized world, it must be transparent. For a specific application, with a predictable workload, it is of course OK to partition queries explicitly. Bastian had split DBpedia among five Virtuoso servers and was querying this set with DARQ. The end result was that there was a rather frightful cost of federation as opposed to all the data residing in a single Virtuoso. The other result was that if selectivity of predicates was not correctly guessed by the federation engine, the proposition was a non-starter. With correct join order it worked, though. Yet, we really want federation. Looking further down the road, we simply must make federation work. This is just as necessary as running on a server cluster for mid-size workloads. Since we are convinced of the cause, let&#39;s talk about the means. For DARQ as it now stands, there&#39;s probably an order of magnitude or even more to gain from a couple of simple tricks. If going to a SPARQL end point that is not the outermost in the loop join sequence, batch the requests together in one HTTP/1.1 message. So, if the query is &quot;get me my friends living in cities of over a million people,&quot; there will be the fragment &quot;get city where x lives&quot; and later &quot;ask if population of x greater than 1000000&quot;. If I have 100 friends, I send the 100 requests in a batch to each eligible server. Further, if running against a server of known brand, use a client-server connection and prepared statements with array parameters. This can well improve the processing speed at the remote end point by another order of magnitude. This gain may however not be as great as the latency savings from message batching. We will provide a sample of how to do this with Virtuoso over JDBC so Bastian can try this if interested. These simple things will give a lot of mileage and may even decide whether federation is an option in specific applications. For the open web however, these measures will not yet win the day. When federating SQL, colocation of data is sort of explicit. If two tables are joined and they are in the same source, then the join can go to the source. For SPARQL this is also so but with a twist: If a foaf:Person is found on a given server, this does not mean that the Person&#39;s geek code or email hash will be on the same server. Thus {?p name &quot;Johnny&quot; . ?p geekCode ?g . ?p emailHash ?h } does not necessarily denote a colocated join if many servers serve items of the vocabulary. However, in most practical cases, for obtaining a rapid answer, treating this as a colocated fragment will be appropriate. Thus, it may be necessary to be able to declare that geek codes will be assumed colocated with names. This will save a lot of message passing and offer decent, if not theoretically total recall. For search style applications, starting with such assumptions will make sense. If nothing is found, then we can partition each join step separately for the unlikely case that there were a server that gave geek codes but not names. For Virtuoso, we find that a federated query&#39;s asynchronous, parallel evaluation model is not so different from that on a local cluster. So the cluster version could have the option of federated query. The difference is that a cluster is local and tightly coupled and predictably partitioned but a federated setting is none of these. For description, we would take DARQ&#39;s description model and maybe extend it a little where needed. Also we would enhance the protocol to allow just asking for the query cost estimate given a query with literals specified. We will do this eventually. We would like to talk to Bastian about large improvements to DARQ, specially when working with Virtuoso. We&#39;ll see. Of course, one mode of federating is the crawl-as-you-go approach of the Virtuoso Sponger. This will bring in fragments following seeAlso or sameAs declarations or other references. This will however not have the recall of a warehouse or federation over well described SPARQL end-points. But up to a certain volume it has the speed of local storage. The emergence of voiD (Vocabulary of Interlinked Data) is a step in the direction of making federation a reality. There is a separate post about this.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<div style="display:none;">The DARQ Matter of Federation</div>
<p>Astronomers propose that the universe is held together, so to speak, by the gravity of invisible &quot;dark matter&quot; spread in interstellar and intergalactic space.</p>
<p>For the <a href="http://dbpedia.org/resource/Data" id="link-id0x19dbf410">data</a> web, it will be held together by federation, also an invisible factor. As in Minkowski space, so in <a href="http://dbpedia.org/resource/Cyberspace" id="link-id0x9fc13ff8">cyberspace</a>.</p>
<p>To take the astronomical analogy further, putting too much visible stuff in one place makes a black hole, whose chief properties are that it is very heavy, can only get heavier and that nothing comes out.</p>
<p>
  <a href="http://darq.sourceforge.net/" id="link-id0x1d06bd88">DARQ</a> is Bastian Quilitz&#39;s federated extension of the <a href="http://jena.sourceforge.net/" id="link-id0x1cf28f70">Jena</a> <a href="http://jena.sourceforge.net/ARQ/" id="link-id0x1cba22c8">ARQ</a> <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x171c7dc8">SPARQL</a> processor. It has existed for a while and was also presented at <a href="http://www.eswc2008.org/" id="link-id0x1ed53cd0">ESWC2008</a>. There is also SPARQL FED from Andy Seaborne, an explicit means of specifying which end point will process which fragment of a distributed SPARQL query. Still, for federation to deliver in an open, decentralized world, it must be transparent. For a specific application, with a predictable workload, it is of course OK to partition queries explicitly.</p>
<p>Bastian had split <a href="http://dbpedia.org/resource/DBpedia" id="link-id0x1ce846c0">DBpedia</a> among five <a href="http://virtuoso.openlinksw.com" id="link-id0x1cad0640">Virtuoso</a> servers and was querying this set with DARQ. The end result was that there was a rather frightful cost of federation as opposed to all the data residing in a single Virtuoso. The other result was that if selectivity of predicates was not correctly guessed by the federation engine, the proposition was a non-starter. With correct join order it worked, though.</p>
<p>Yet, we really want federation. Looking further down the road, we simply must make federation work. This is just as necessary as running on a server cluster for mid-size workloads.</p>
<p>Since we are convinced of the cause, let&#39;s talk about the means.</p>
<p>For DARQ as it now stands, there&#39;s probably an order of magnitude or even more to gain from a couple of simple tricks. If going to a SPARQL end point that is not the outermost in the loop join sequence, batch the requests together in one <a href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x19a48280">HTTP</a>/1.1 message. So, if the query is &quot;get me my friends living in cities of over a million people,&quot; there will be the fragment &quot;get city where x lives&quot; and later &quot;ask if population of x greater than 1000000&quot;. If I have 100 friends, I send the 100 requests in a batch to each eligible server.</p>
<p>Further, if running against a server of known brand, use a client-server connection and prepared statements with array parameters. This can well improve the processing speed at the remote end point by another order of magnitude. This gain may however not be as great as the latency savings from message batching. We will provide a sample of how to do this with Virtuoso over <a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id0x1cf18278">JDBC</a> so Bastian can try this if interested.</p>
<p>These simple things will give a lot of mileage and may even decide whether federation is an option in specific applications. For the open web however, these measures will not yet win the day.</p>
<p>When federating <a href="http://dbpedia.org/resource/SQL" id="link-id0x1cf7d0e8">SQL</a>, colocation of data is sort of explicit. If two tables are joined and they are in the same source, then the join can go to the source. For SPARQL this is also so but with a twist:</p>
<p>If a foaf:Person is found on a given server, this does not mean that the Person&#39;s geek code or email hash will be on the same server. Thus <code>{?p name &quot;Johnny&quot; . ?p geekCode ?g . ?p emailHash ?h }</code> does not necessarily denote a colocated join if many servers serve items of the vocabulary.</p>
<p>However, in most practical cases, for obtaining a rapid answer, treating this as a colocated fragment will be appropriate. Thus, it may be necessary to be able to declare that geek codes will be assumed colocated with names. This will save a lot of message passing and offer decent, if not theoretically total recall. For search style applications, starting with such assumptions will make sense. If nothing is found, then we can partition each join step separately for the unlikely case that there were a server that gave geek codes but not names.</p>
<p>For Virtuoso, we find that a federated query&#39;s asynchronous, parallel evaluation model is not so different from that on a local cluster. So the cluster version could have the option of federated query. The difference is that a cluster is local and tightly coupled and predictably partitioned but a federated setting is none of these.</p>
<p>For description, we would take DARQ&#39;s description model and maybe extend it a little where needed. Also we would enhance the protocol to allow just asking for the query cost estimate given a query with literals specified. We will do this eventually.</p>
<p>We would like to talk to Bastian about large improvements to DARQ, specially when working with Virtuoso. We&#39;ll see.</p>
<p>Of course, one mode of federating is the crawl-as-you-go approach of the Virtuoso <a href="http://virtuoso.openlinksw.com/Whitepapers/html/VirtSpongerWhitePaper.html" id="link-id0x1e163140">Sponger</a>. This will bring in fragments following seeAlso or sameAs declarations or other references. This will however not have the recall of a warehouse or federation over well described SPARQL end-points. But up to a certain volume it has the speed of local storage.</p>
<p>The emergence of voiD (Vocabulary of Interlinked Data) is a step in the direction of making federation a reality. There is <a href="http://www.openlinksw.com/weblog/oerling/?id=1377" id="link-id1109a4c8">a separate post</a> about this.</p>
</div>]]></content:encoded>
  <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Virtuso Data Space Bot &lt;kidehen@openlinksw.com&gt;</dc:creator>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?id=1380">
  <rss:title>Aspects of RDF to RDF Mapping</rss:title>
  <rss:link>http://www.openlinksw.com/blog/vdb/blog/?id=1380</rss:link>
  <wfw:comment xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://www.openlinksw.com/mt-tb/Http/comments?id=1380</wfw:comment>
  <wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://www.openlinksw.com/blog/vdb/blog/gems/rsscomment.xml?:id=1380</wfw:commentRss>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-06-09T14:02:18Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Aspects of RDF to RDF Mapping The W3C has recently launched an incubator group about mapping relational data to RDF. From participating in the group for the few initial sessions, I get the following impressions. There is a segment of users, for example from the biomedical community, who do heavy duty data integration and look to RDF for managing complexity. Unifying heterogeneous data under OWL ontologies, reasoning, and data integrity, are points of interest. There is another segment that is concerned with semantifying the document web, which topic includes initiatives such as Triplify and semantic web search such as Sindice. The emphasis there is on minimizing entry cost and creating critical mass. The next one to come will clean up the semantics, if these need be cleaned up at all. (Some cleanup is taking place with Yago and Zitgist, but this is a matter for a different post.) Thus, technically speaking, the mapping landscape is diverse, but ETL (extract-transform-load) seems to predominate. The biomedical people make data warehouses for answering specific questions. The web people are interested in putting data out in the expectation that the next player will warehouse it and allow running complex meshups against the whole of the RDF-ized web. As one would expect, these groups see different issues and needs. Roughly speaking, one is about quality and structure and the other is about volume. Where do we stand? We are with the research data warehousers in saying that the mapping question is very complex and that it would indeed be nice to bypass ETL and go to the source RDBMS(s) on demand. Projects in this direction are ongoing. We are with the web people in building large RDF stores with scalable query answering for arbitrary RDF, for example, hosting a lot of the Linking Open Data sets, and working with Zitgist. These things are somewhat different. At present, both the research warehousers and the web scalers predominantly go for ETL. This is fine by us as we definitely are in the large RDF store race. Still, mapping has its point. A relational store will perform quite a bit faster than a quad store if it has the right covering indices or application-specific compressed columnar layout. Thus, there is nothing to block us from querying analytics in SPARQL, once the obviously necessary extensions of sub-query, expressions and aggregation are in place. To cite an example, the Ordnance Survey of the UK has a GIS system running on Oracle with an entry pretty much for each mailbox, lamp post, and hedgerow in the country. According to Ordnance Survey, this would be 1 petatriple, 1e15 triples. &quot;Such a big server farm that we&#39;d have to put it on our map,&quot; as Jenny Harding put it at ESWC2008. I&#39;d add that an even bigger map entry would be the power plant needed to run the 100,000 or so PCs this would take. This is counting 10 gigatriples per PC, which would not even give very good working sets. So, on-the-fly RDBMS-to-RDF mapping in some cases is simply necessary. Still, the benefits of RDF for integration can be preserved if the translation middleware is smart enough. Specifically, this entails knowing what tables can be joined with what other tables and pushing maximum processing to the RDBMS(s) involved in the query. You can download the slide set I used for the Virtuoso presentation for the RDB to RDF mapping incubator group (PPT; other formats coming soon). The main point is that real integration is hard and needs smart query splitting and optimization, as well as real understanding of the databases and subject matter from the information architect. Sometimes in the web space it can suffice to put data out there with trivial RDF translation and hope that a search engine or such will figure out how to join this with something else. For the enterprise, things are not so. Benefits are clear if one can navigate between disjoint silos but making this accurate enough for deriving business conclusions, as well as efficient enough for production, is a soluble and non-trivial question. We will show the basics of this with the TPC-H mapping, and by joining this with physical triples. We will also make a set of TPC-H format table sets, make mappings between keys in one to keys in the other, and show joins between the two. The SPARQL querying of one such data store is a done deal, including the SPARQL extensions for this. There is even a demo paper, Business Intelligence Extensions for SPARQL (PDF; other formats coming soon), by us on the subject in the ESWC 2008 proceedings. If there is an issue left, it is just the technicality of always producing SQL that looks hand-crafted and hence is better understood by the target RDBMS(s). For example, Oracle works better if one uses an IN sub-query instead of the equivalent existence test. Follow this blog for more on the topic; published papers are always a limited view on the matter.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<div style="display:none;">Aspects of RDF to RDF Mapping</div>
<p>The W3C has recently launched an <a href="http://www.w3.org/2005/Incubator/rdb2rdf/" id="link-idd763f48">incubator group about mapping relational data to RDF</a>.</p>
<p>From participating in the group for the few initial sessions, I get the following impressions.</p>
<p>There is a segment of users, for example from the biomedical community, who do heavy duty <a href="http://dbpedia.org/resource/Data" id="link-id0x17f9e6f8">data</a> integration and look to <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x17eabf48">RDF</a> for managing complexity. Unifying heterogeneous data under OWL ontologies, reasoning, and data integrity, are points of interest.</p>
<p>There is another segment that is concerned with semantifying the document web, which topic includes initiatives such as <a href="http://triplify.org/" id="link-id0x1a25cd28">Triplify</a> and <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x182c41e8">semantic web</a> search such as <a href="http://sindice.org/" id="link-id0x1a29c5e8">Sindice</a>. The emphasis there is on minimizing entry cost and creating critical mass. The next one to come will clean up the semantics, if these need be cleaned up at all.</p>
<p>(Some cleanup is taking place with <a href="http://www.mpi-inf.mpg.de/~suchanek/downloads/yago/" id="link-id0x17fd2b70">Yago</a> and <a href="http://zitgist.com/about/" id="link-id0x17e6ab88">Zitgist</a>, but this is a matter for a different post.)</p>
<p>Thus, technically speaking, the mapping landscape is diverse, but ETL (extract-transform-load) seems to predominate. The biomedical people make data warehouses for answering specific questions. The web people are interested in putting data out in the expectation that the next player will warehouse it and allow running complex meshups against the whole of the RDF-ized web.</p>
<p>As one would expect, these groups see different issues and needs. Roughly speaking, one is about quality and structure and the other is about volume.</p>
<p>Where do we stand?</p>
<p>We are with the research data warehousers in saying that the mapping question is very complex and that it would indeed be nice to bypass ETL and go to the source <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x182acd68">RDBMS</a>(s) on demand. Projects in this direction are ongoing.</p>
<p>We are with the web people in building large RDF stores with scalable query answering for arbitrary RDF, for example, hosting a lot of the Linking Open Data sets, and working with Zitgist.</p>
<p>These things are somewhat different.</p>
<p>At present, both the research warehousers and the web scalers predominantly go for ETL.</p>
<p>This is fine by us as we definitely are in the large RDF store race.</p>
<p>Still, mapping has its point. A relational store will perform quite a bit faster than a quad store if it has the right covering indices or application-specific compressed columnar layout. Thus, there is nothing to block us from querying analytics in <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x16c91438">SPARQL</a>, once the obviously necessary extensions of sub-query, expressions and aggregation are in place.</p>
<p>To cite an example, the Ordnance Survey of the UK has a GIS system running on <a href="http://dbpedia.org/resource/Oracle_Database" id="link-id0x17ee37c8">Oracle</a> with an entry pretty much for each mailbox, lamp post, and hedgerow in the country. According to Ordnance Survey, this would be 1 petatriple, 1e15 triples. &quot;Such a big server farm that we&#39;d have to put it on our map,&quot; as Jenny Harding put it at <a href="http://www.eswc2008.org/" id="link-id0x1cab6330">ESWC2008</a>. I&#39;d add that an even bigger map entry would be the power plant needed to run the 100,000 or so PCs this would take. This is counting 10 gigatriples per PC, which would not even give very good working sets.</p>
<p>So, on-the-fly RDBMS-to-RDF mapping in some cases is simply necessary. Still, the benefits of RDF for integration can be preserved if the translation middleware is smart enough. Specifically, this entails knowing what tables can be joined with what other tables and pushing maximum processing to the RDBMS(s) involved in the query.</p>
<p>You can download the slide set I used for the <a href="http://virtuoso.openlinksw.com" id="link-id0xa1fb7e8">Virtuoso</a> presentation for the RDB to RDF mapping incubator group (<a href="http://virtuoso.openlinksw.com/wiki/main/Main/VirtPresentations/Relational2RDF.ppt" id="link-id106f9e88">PPT</a>; <a href="http://virtuoso.openlinksw.com/wiki/main/Main/VirtPresentations" id="link-id10a8dc90">other formats</a> coming soon). The main point is that real integration is hard and needs smart query splitting and optimization, as well as real understanding of the databases and subject matter from the <a href="http://dbpedia.org/resource/Information" id="link-id0x17ee38a0">information</a> architect. Sometimes in the web space it can suffice to put data out there with trivial RDF translation and hope that a search engine or such will figure out how to join this with something else. For the enterprise, things are not so. Benefits are clear if one can navigate between disjoint silos but making this accurate enough for deriving business conclusions, as well as efficient enough for production, is a soluble and non-trivial question.</p>
<p>We will show the basics of this with the <a href="http://dbpedia.org/resource/TPC-H" id="link-id0x1844d718">TPC-H</a> mapping, and by joining this with physical triples. We will also make a set of TPC-H format table sets, make mappings between keys in one to keys in the other, and show joins between the two. The SPARQL querying of one such data store is a done deal, including the SPARQL extensions for this. There is even a demo paper, Business Intelligence Extensions for SPARQL (<a href="http://virtuoso.openlinksw.com/wiki/main/Main/VirtPresentations/RDFAndMapped_BI.pdf" id="link-id12ea4b18">PDF</a>; <a href="http://virtuoso.openlinksw.com/wiki/main/Main/VirtPresentations" id="link-id106e1810">other formats</a> coming soon), by us on the subject in the ESWC 2008 proceedings. If there is an issue left, it is just the technicality of always producing <a href="http://dbpedia.org/resource/SQL" id="link-id0x17fc8d60">SQL</a> that looks hand-crafted and hence is better understood by the target RDBMS(s). For example, Oracle works better if one uses an <code>IN</code> sub-query instead of the equivalent existence test.</p>
<p>Follow this <a href="http://dbpedia.org/resource/Blog" id="link-id0xa9bcef8">blog</a> for more on the topic; published papers are always a limited view on the matter.</p>
</div>]]></content:encoded>
  <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Virtuso Data Space Bot &lt;kidehen@openlinksw.com&gt;</dc:creator>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?id=1379">
  <rss:title>ESWC 2008</rss:title>
  <rss:link>http://www.openlinksw.com/blog/vdb/blog/?id=1379</rss:link>
  <wfw:comment xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://www.openlinksw.com/mt-tb/Http/comments?id=1379</wfw:comment>
  <wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://www.openlinksw.com/blog/vdb/blog/gems/rsscomment.xml?:id=1379</wfw:commentRss>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-06-09T14:02:16Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">ESWC 2008 Yrjänä Rankka and I attended ESWC2008 on behalf of OpenLink. We were invited at the last minute to give a Linked Open Data talk at Paolo Bouquet&#39;s Identity and Reference workshop. We also had a demo of SPARQL BI (PPT); other formats coming soon), our business intelligence extensions to SPARQL as well as joining between relational data mapped to RDF and native RDF data. i was also speaking at the social networks panel chaired by Harry Halpin. I have gathered a few impressions that I will share in the next few posts (1 - RDF Mapping, 2 - DARQ, 3 - voiD, 4 - Paradigmata). Caveat: This is not meant to be complete or impartial press coverage of the event but rather some quick comments on issues of personal/OpenLink interest. The fact that I do not mention something does not mean that it is unimportant. The voiD Graph Linked Open Data was well represented, with Chris Bizer, Tom Heath, ourselves and many others. The great advance for LOD this time around is voiD, the Vocabulary of Interlinked Datasets, a means to describe what in fact is inside the LOD cloud, how to join it with what and so forth. Big time important if there is to be a web of federatable data sources, feeding directly into what we have been saying for a while about SPARQL end-point self-description and discovery. There is reasonable hope of having something by the date of Linked Data Planet in a couple of weeks. Federating Bastian Quilitz gave a talk about his DARQ, a federated version of Jena&#39;s ARQ. Something like DARQ&#39;s optimization statistics should make their way into the SPARQL protocol as well as the voiD data set description. We really need federation but more on this in a separate post. XSPARQL Axel Polleres et al had a paper about XSPARQL, a merge of XQuery and SPARQL. While visiting DERI a couple of weeks back and again at the conference, we talked about OpenLink implementing the spec. It is evident that the engines must be in the same process and not communicate via the SPARQL protocol for this to be practical. We could do this. We&#39;ll have to see when. Politically, using XQuery to give expressions and XML synthesis to SPARQL would be fitting. These things are needed anyhow, as surely as aggregation and sub-queries but the latter would not so readily come from XQuery. Some rapprochement between RDF and XML folks is desirable anyhow. Panel: Will the Sem Web Rise to the Challenge of the Social Web? The social web panel presented the question of whether the sem web was ready for prime time with data portability. The main thrust was expressed in Harry Halpin&#39;s rousing closing words: &quot;Men will fight in a battle and lose a battle for a cause they believe in. Even if the battle is lost, the cause may come back and prevail, this time changed and under a different name. Thus, there may well come to be something like our semantic web, but it may not be the one we have worked all these years to build if we do not rise to the occasion before us right now.&quot; So, how to do this? Dan Brickley asked the audience how many supported, or were aware of, the latest Web 2.0 things, such as OAuth and OpenID. A few were. The general idea was that research (after all, this was a research event) should be more integrated and open to the world at large, not living at the &quot;outdated pace&quot; of a 3 year funding cycle. Stefan Decker of DERI acquiesced in principle. Of course there is impedance mismatch between specialization and interfacing with everything. I said that triples and vocabularies existed, that OpenLink had ODS (OpenLink Data Spaces, Community LinkedData) for managing one&#39;s data-web presence, but that scale would be the next thing. Rather large scale even, with 100 gigatriples (Gtriples) reached before one even noticed. It takes a lot of PCs to host this, maybe $400K worth at today&#39;s prices, without replication. Count 16G ram and a few cores per Gtriple so that one is not waiting for disk all the time. The tricks that Web 2.0 silos do with app-specific data structures and app-specific partitioning do not really work for RDF without compromising the whole point of smooth schema evolution and tolerance of ragged data. So, simple vocabularies, minimal inference, minimal blank nodes. Besides, note that the inference will have to be done at run time, not forward-chained at load time, if only because users will not agree on what sameAs and other declarations they want for their queries. Not to mention spam or malicious sameAs declarations! As always, there was the question of business models for the open data web and for semantic technologies in general. As we see it, information overload is the factor driving the demand. Better contextuality will justify semantic technologies. Due to the large volumes and complex processing, a data-as-service model will arise. The data may be open, but its query infrastructure, cleaning, and keeping up-to-date, can be monetized as services. Identity and Reference For the identity and reference workshop, the ultimate question is metaphysical and has no single universal answer, even though people, ever since the dawn of time and earlier, have occupied themselves with the issue. Consequently, I started with the Genesis quote where Adam called things by nominibus suis, off-hand implying that things would have some intrinsic ontologically-due names. This would be among the older references to the question, at least in widely known sources. For present purposes, the consensus seemed to be that what would be considered the same as something else depended entirely on the application. What was similar enough to warrant a sameAs for cooking purposes might not warrant a sameAs for chemistry. In fact, complete and exact sameness for URIs would be very rare. So, instead of making generic weak similarity assertions like similarTo or seeAlso, one would choose a set of strong sameAs assertions and have these in effect for query answering if they were appropriate to the granularity demanded by the application. Therefore sameAs is our permanent companion, and there will in time be malicious and spam sameAs. So, nothing much should be materialized on the basis of sameAs assertions in an open world. For an app-specific warehouse, sameAs can be resolved at load time. There was naturally some apparent tension between the Occam camp of entity name services and the LOD camp. I would say that the issue is more a perceived polarity than a real one. People will, inevitably, continue giving things names regardless of any centralized authority. Just look at natural language. But having a dictionary that is commonly accepted for established domains of discourse is immensely helpful. CYC and NLP The semantic search workshop was interesting, especially CYC&#39;s presentation. CYC is, as it were, the grand old man of knowledge representation. Over the long term, I would have support of the CYC inference language inside a database query processor. This would mostly be for repurposing the huge knowledge base for helping in search type queries. If it is for transactions or financial reporting, then queries will be SQL and make little or no use of any sort of inference. If it is for summarization or finding things, the opposite holds. For scaling, the issue is just making correct cardinality guesses for query planning, which is harder when inference is involved. We&#39;ll see. I will also have a closer look at natural language one of these days, quite inevitably, since Zitgist (for example) is into entity disambiguation. Scale Garlic gave a talk about their Data Patrol and QDOS. We agree that storing the data for these as triples instead of 1000 or so constantly changing relational tables could well make the difference between next-to-unmanageable and efficiently adaptive. Garlic probably has the largest triple collection in constant online use to date. We will soon join them with our hosting of the whole LOD cloud and Sindice/Zitgist as triples. Conclusions There is a mood to deliver applications. Consequently, scale remains a central, even the principal topic. So for now we make bigger centrally-managed databases. At the next turn around the corner we will have to turn to federation. The point here is that a planetary-scale, centrally-managed, online system can be made when the workload is uniform and anticipatable, but if it is free-form queries and complex analysis, we have a problem. So we move in the direction of federating and charging based on usage whenever the workload is more complex than making simple lookups now and then. For the Virtuoso roadmap, this changes little. Next we make data sets available on Amazon EC2, as widely promised at ESWC. With big scale also comes rescaling and repartitioning, so this gets additional weight, as does further parallelizing of single user workloads. As it happens, the same medicine helps for both. At Linked Data Planet, we will make more announcements.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<div style="display:none;">ESWC 2008</div>
<p>Yrjänä Rankka and I attended <a href="http://www.eswc2008.org/" id="link-id10b7a038">ESWC2008</a> on behalf of OpenLink.</p>
<p>We were invited at the last minute to give a <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id105df758">Linked Open Data</a> talk at Paolo Bouquet&#39;s Identity and Reference workshop. We also had a demo of <a href="http://dbpedia.org/resource/SPARQL" id="link-id12eacca0">SPARQL</a> BI (<a href="http://virtuoso.openlinksw.com/wiki/main/Main/VirtPresentations/ESWC2008%20SPARQL%20BI%20OpenLink.ppt" id="link-id10b43e58">PPT</a>); <a href="http://virtuoso.openlinksw.com/wiki/main/Main/VirtPresentations" id="link-id1116d8f0">other formats coming soon</a>), our business intelligence extensions to <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x16c9bfc8">SPARQL</a> as well as joining between relational <a href="http://dbpedia.org/resource/Data" id="link-id10badc40">data</a> mapped to <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id108edaf8">RDF</a> and native <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x181a5ed8">RDF</a> <a href="http://dbpedia.org/resource/Data" id="link-id0x17e69910">data</a>. i was also speaking at the social networks panel chaired by Harry Halpin.</p>
<p>I have gathered a few impressions that I will share in the next few posts (<a href="http://www.openlinksw.com/weblog/oerling/?id=1375" id="link-id107298e0">1 - RDF Mapping</a>, <a href="http://www.openlinksw.com/weblog/oerling/?id=1376" id="link-id10b3a530">2 - DARQ</a>, <a href="http://www.openlinksw.com/weblog/oerling/?id=1377" id="link-id107290e0">3 - voiD</a>, <a href="http://www.openlinksw.com/weblog/oerling/?id=1378" id="link-id1071a950">4 - Paradigmata</a>). <i>Caveat: This is not meant to be complete or impartial press coverage of the event but rather some quick comments on issues of personal/OpenLink interest. The fact that I do not mention something does not mean that it is unimportant.</i>
</p>
<h2>The voiD Graph</h2>
<p>
  <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x1a87f110">Linked Open Data</a> was well represented, with Chris Bizer, Tom Heath, ourselves and many others. The great advance for <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id108f3c48">LOD</a> this time around is <a href="http://community.linkeddata.org/MediaWiki/index.php?MetaLOD#Kick-off_meeting_at_ESWC08" id="link-id10df9830">voiD, the Vocabulary of Interlinked Datasets</a>, a means to describe what in fact is inside the <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x1a089980">LOD</a> cloud, how to join it with what and so forth. Big time important if there is to be a <a href="http://www.openlinksw.com/weblog/oerling/?id=1377" id="link-iddf74578">web of federatable data sources</a>, feeding directly into what we have been saying for a while about SPARQL end-point self-description and discovery. There is reasonable hope of having something by the date of <a href="http://www.linkeddataplanet.com/" id="link-id10dd0848">Linked Data Planet</a> in a couple of weeks.</p>
<h2>Federating</h2>
<p>Bastian Quilitz gave a talk about his <a href="http://darq.sourceforge.net/" id="link-id108746e8">DARQ</a>, a federated version of Jena&#39;s ARQ.</p>
<p>Something like <a href="http://darq.sourceforge.net/" id="link-id0x1a2d9860">DARQ</a>&#39;s optimization statistics should make their way into the <a href="http://www.w3.org/TR/rdf-sparql-protocol/" id="link-id10992348">SPARQL protocol</a> as well as the voiD data set description.</p>
<p>We really need federation but more on this in <a href="http://www.openlinksw.com/weblog/oerling/?id=1376" id="link-id1059d688">a separate post</a>.</p>
<h2>
<a href="http://xsparql.deri.ie/" id="link-id10314308">XSPARQL</a>
</h2>
<p>Axel Polleres et al had a paper about <a href="http://xsparql.deri.ie/" id="link-id0x1ad77490">XSPARQL</a>, a merge of <a href="http://dbpedia.org/resource/XQuery" id="link-id10b98e90">XQuery</a> and SPARQL. While visiting DERI a couple of weeks back and again at the conference, we talked about OpenLink implementing the spec. It is evident that the engines must be in the same process and not communicate via the <a href="http://www.w3.org/TR/rdf-sparql-protocol/" id="link-id0x17e75190">SPARQL protocol</a> for this to be practical. We could do this. We&#39;ll have to see when.</p>
<p>Politically, using <a href="http://dbpedia.org/resource/XQuery" id="link-id0x18a9bf10">XQuery</a> to give expressions and XML synthesis to SPARQL would be fitting. These things are needed anyhow, as surely as aggregation and sub-queries but the latter would not so readily come from XQuery. Some rapprochement between RDF and XML folks is desirable anyhow.</p>
<h2>Panel: Will the Sem Web Rise to the Challenge of the Social Web?</h2>
<p>The social web panel presented the question of whether the sem web was ready for prime time with data portability.</p>
<p>The main thrust was expressed in Harry Halpin&#39;s rousing closing words: &quot;Men will fight in a battle and lose a battle for a cause they believe in. Even if the battle is lost, the cause may come back and prevail, this time changed and under a different name. Thus, there may well come to be something like our <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id122f4da0">semantic web</a>, but it may not be the one we have worked all these years to build if we do not rise to the occasion before us right now.&quot;</p>
<p>So, how to do this? Dan Brickley asked the audience how many supported, or were aware of, the latest Web 2.0 things, such as <a href="http://dbpedia.org/page/OAuth" id="link-idf300bc0">OAuth</a> and <a href="http://dbpedia.org/page/OpenID" id="link-id10ce7a40">OpenID</a>. A few were. The general idea was that research (after all, this was a research event) should be more integrated and open to the world at large, not living at the &quot;outdated pace&quot; of a 3 year funding cycle. Stefan Decker of DERI acquiesced in principle. Of course there is impedance mismatch between specialization and interfacing with everything.</p>
<p>I said that triples and vocabularies existed, that OpenLink had <a href="http://dbpedia.org/resource/OpenLink_Data_Spaces" id="link-id1210dbf8">ODS</a> (<a href="http://dbpedia.org/resource/OpenLink_Data_Spaces" id="link-id11076be8">OpenLink Data Spaces</a>, <a href="http://community.linkeddata.org/" id="link-id10d46710">Community LinkedData</a>) for managing one&#39;s data-web presence, but that scale would be the next thing. Rather large scale even, with 100 gigatriples (Gtriples) reached before one even noticed. It takes a lot of PCs to host this, maybe $400K worth at today&#39;s prices, without replication. Count 16G ram and a few cores per Gtriple so that one is not waiting for disk all the time.</p>
<p>The tricks that Web 2.0 silos do with app-specific data structures and app-specific partitioning do not really work for RDF without compromising the whole point of smooth schema evolution and tolerance of ragged data.</p>
<p>So, simple vocabularies, minimal inference, minimal blank nodes. Besides, note that the inference will have to be done at run time, not forward-chained at load time, if only because users will not agree on what sameAs and other declarations they want for their queries. Not to mention spam or malicious sameAs declarations!</p>
<p>As always, there was the question of business models for the open data web and for semantic technologies in general. As we see it, <a href="http://dbpedia.org/resource/Information" id="link-id108b7688">information</a> overload is the factor driving the demand. Better contextuality will justify semantic technologies. Due to the large volumes and complex processing, a data-as-service model will arise. The data may be open, but its query infrastructure, cleaning, and keeping up-to-date, can be monetized as services.</p>
<h2>Identity and Reference</h2>
<p>For the identity and reference workshop, the ultimate question is metaphysical and has no single universal answer, even though people, ever since the dawn of time and earlier, have occupied themselves with the issue. Consequently, I started with the Genesis quote where Adam called things by <i>nominibus suis</i>, off-hand implying that things would have some intrinsic ontologically-due names. This would be among the older references to the question, at least in widely known sources.</p>
<p>For present purposes, the consensus seemed to be that what would be considered the same as something else depended entirely on the application. What was similar enough to warrant a sameAs for cooking purposes might not warrant a sameAs for chemistry. In fact, complete and exact sameness for URIs would be very rare. So, instead of making generic weak similarity assertions like similarTo or seeAlso, one would choose a set of strong sameAs assertions and have these in effect for query answering if they were appropriate to the granularity demanded by the application.</p>
<p>Therefore sameAs is our permanent companion, and there will in time be malicious and spam sameAs. So, nothing much should be materialized on the basis of sameAs assertions in an <a href="http://dbpedia.org/resource/Open_world_assumption" id="link-id10c4dfd0">open world</a>. For an app-specific warehouse, sameAs can be resolved at load time.</p>
<p>There was naturally some apparent tension between the Occam camp of <a href="http://dbpedia.org/resource/Entity" id="link-id105fd240">entity</a> name services and the LOD camp. I would say that the issue is more a perceived polarity than a real one. People will, inevitably, continue giving things names regardless of any centralized authority. Just look at natural language. But having a dictionary that is commonly accepted for established domains of discourse is immensely helpful.</p>
<h2>CYC and NLP</h2>
<p>The semantic search workshop was interesting, especially CYC&#39;s presentation. CYC is, as it were, the grand old man of <a href="http://dbpedia.org/resource/Knowledge" id="link-id10568158">knowledge</a> representation. Over the long term, I would have support of the CYC inference language inside a database query processor. This would mostly be for repurposing the huge <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x1acff9d0">knowledge</a> base for helping in search type queries. If it is for transactions or financial reporting, then queries will be <a href="http://dbpedia.org/resource/SQL" id="link-id130a0a80">SQL</a> and make little or no use of any sort of inference. If it is for summarization or finding things, the opposite holds. For scaling, the issue is just making correct cardinality guesses for query planning, which is harder when inference is involved. We&#39;ll see.</p>
<p>I will also have a closer look at natural language one of these days, quite inevitably, since <a href="http://zitgist.com/about/" id="link-id10795828">Zitgist</a> (for example) is into <a href="http://dbpedia.org/resource/Entity" id="link-id0x18a12918">entity</a> disambiguation.</p>
<h2>Scale</h2>
<p>Garlic gave a talk about their Data Patrol and QDOS. We agree that storing the data for these as triples instead of 1000 or so constantly changing relational tables could well make the difference between next-to-unmanageable and efficiently adaptive.</p>
<p>Garlic probably has the largest triple collection in constant online use to date. We will soon join them with our hosting of the whole LOD cloud and <a href="http://sindice.org/" id="link-id0x17f18a38">Sindice</a>/<a href="http://zitgist.com/about/" id="link-id0x184e9e90">Zitgist</a> as triples.</p>
<h2>Conclusions</h2>
<p>There is a mood to deliver applications. Consequently, scale remains a central, even the principal topic. So for now we make bigger centrally-managed databases. At the next turn around the corner we will have to turn to federation. The point here is that a planetary-scale, centrally-managed, online system can be made when the workload is uniform and anticipatable, but if it is free-form queries and complex analysis, we have a problem. So we move in the direction of federating and charging based on usage whenever the workload is more complex than making simple lookups now and then.</p>
<p>For the <a href="http://virtuoso.openlinksw.com" id="link-id1026ac28">Virtuoso</a> roadmap, this changes little. Next we make data sets available on Amazon EC2, as widely promised at ESWC. With big scale also comes rescaling and repartitioning, so this gets additional weight, as does further parallelizing of single user workloads. As it happens, the same medicine helps for both. At <a href="http://www.linkeddataplanet.com/" id="link-id0x17ff5c20">Linked Data Planet</a>, we will make more announcements.</p>
</div>]]></content:encoded>
  <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Virtuso Data Space Bot &lt;kidehen@openlinksw.com&gt;</dc:creator>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?id=1369">
  <rss:title>Virtuoso Cluster Paper</rss:title>
  <rss:link>http://www.openlinksw.com/blog/vdb/blog/?id=1369</rss:link>
  <wfw:comment xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://www.openlinksw.com/mt-tb/Http/comments?id=1369</wfw:comment>
  <wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://www.openlinksw.com/blog/vdb/blog/gems/rsscomment.xml?:id=1369</wfw:commentRss>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-05-30T10:02:04Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Virtuoso Cluster Paper We have a new article on Virtuoso cluster, submitted to ISWC 2008. Right now we are working on hosting the billion triples challenge data set at Amazon EC2 using Virtuoso Cluster. This will be the first publicly available instance of Virtuoso Cluster and all interested may then instantiate their own copy on the EC2 infrastructure. Towards Web Scale RDF Integrating Open Sources and Relational Data with SPARQL Business Intelligence Extensions for SPARQL Look for a separate announcement in the near future.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<div style="display:none;">Virtuoso Cluster Paper</div>
<div> <div>We have a new article on <a href="http://virtuoso.openlinksw.com" id="link-id10424890">Virtuoso</a> cluster, submitted to ISWC 2008.</div> <div>  Right now we are working on hosting the billion triples challenge <a href="http://dbpedia.org/resource/Data" id="link-id1077f800">data</a> set at Amazon EC2 using <a href="http://virtuoso.openlinksw.com" id="link-id102117f0">Virtuoso</a> Cluster.  This will be the first publicly available instance of <a href="http://virtuoso.openlinksw.com" id="link-id0x20387e80">Virtuoso</a> Cluster and all interested may then instantiate their own copy on the EC2 infrastructure. </div> <br /> <div>  <a href="http://www.openlinksw.com/weblog/oerling/2008iswc_webscale_rdf.pdf" id="link-id10af2f30">Towards Web Scale RDF</a>   <br /> <a href="http://www.openlinksw.com/weblog/oerling/RDFAndMapped_BI.pdf" id="link-idfedf9f0">Integrating Open Sources and Relational Data  with SPARQL</a>   <br /> <a href="http://www.openlinksw.com/weblog/oerling/bisparql2.pdf" id="link-id106e5418">Business Intelligence Extensions for SPARQL</a>   <br /> </div> <br /> <div>  Look for a separate announcement in the near future. </div>  </div>

</div>]]></content:encoded>
  <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Virtuso Data Space Bot &lt;kidehen@openlinksw.com&gt;</dc:creator>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?id=1359">
  <rss:title>DBpedia Benchmark Revisited</rss:title>
  <rss:link>http://www.openlinksw.com/blog/vdb/blog/?id=1359</rss:link>
  <wfw:comment xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://www.openlinksw.com/mt-tb/Http/comments?id=1359</wfw:comment>
  <wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://www.openlinksw.com/blog/vdb/blog/gems/rsscomment.xml?:id=1359</wfw:commentRss>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-05-09T19:33:42Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">DBpedia Benchmark Revisited We ran the DBpedia benchmark queries again with different configurations of Virtuoso. I had not studied the details of the matter previously but now did have a closer look at the queries. Comparing numbers given by different parties is a constant problem. In the case reported here, we loaded the full DBpedia 3, all languages, with about 198M triples, onto Virtuoso v5 and Virtuoso Cluster v6, all on the same 4 core 2GHz Xeon with 8G RAM. All databases were striped on 6 disks. The Cluster configuration was with 4 processes in the same box. We ran the queries in two variants: With graph specified in the SPARQL FROM clause, using the default indices. With no graph specified anywhere, using an alternate indexing scheme. The times below are for the sequence of 5 queries; individual query times are not reported. I did not do a line-by-line review of the execution plans since they seem to run well enough. We could get some extra mileage from cost model tweaks, especially for the numeric range conditions, but we will do this when somebody comes up with better times. First, about Virtuoso v5: Because there is a query in the set that specifies no condition on S or O and only P, this simply cannot be done with the default indices. With Virtuoso Cluster v6 it sort-of can, because v6 is more space efficient. So we added the index: create bitmap index rdf_quad_pogs on rdf_quad (p, o, g, s);   Virtuoso v5 with gspo, ogps, pogs Virtuoso Cluster v6 with gspo, ogps Virtuoso Cluster v6 with gspo, ogps, pogs cold 210 s 136 s 33.4 s warm 0.600 s 4.01 s 0.628 s OK, so now let us do it without a graph being specified. For all platforms, we drop any existing indices, and -- create table r2 (g iri_id_8, s, iri_id_8, p iri_id_8, o any, primary key (s, p, o, g)) alter index R2 on R2 partition (s int (0hexffff00)); log_enable (2); insert into r2 (g, s, p, o) select g, s, p, o from rdf_quad; drop table rdf_quad; alter table r2 rename RDF_QUAD; create bitmap index rdf_quad_opgs on rdf_quad (o, p, g, s) partition (o varchar (-1, 0hexffff)); create bitmap index rdf_quad_pogs on rdf_quad (p, o, g, s) partition (o varchar (-1, 0hexffff)); create bitmap index rdf_quad_gpos on rdf_quad (g, p, o, s) partition (o varchar (-1, 0hexffff)); The code is identical for v5 and v6, except that with v5 we use iri_id (32 bit) for the type, not iri_id_8 (64 bit). We note that we run out of IDs with v5 around a few billion triples, so with v6 we have double the ID length and still manage to be vastly more space efficient. With the above 4 indices, we can query the data pretty much in any combination without hitting a full scan of any index. We note that all indices that do not begin with s end with s as a bitmap. This takes about 60% of the space of a non-bitmap index for data such as DBpedia. If you intend to do completely arbitrary RDF queries in Virtuoso, then chances are you are best off with the above index scheme.   Virtuoso v5 with gspo, ogps, pogs Virtuoso Cluster v6 with spog, pogs, opgs, gpos warm 0.595 s 0.617 s The cold times were about the same as above, so not reproduced. Graph or No Graph? It is in the SPARQL spirit to specify a graph and for pretty much any application, there are entirely sensible ways of keeping the data in graphs and specifying which ones are concerned by queries. This is why Virtuoso is set up for this by default. On the other hand, for the open web scenario, dealing with an unknown large number of graphs, enumerating graphs is not possible and questions like which graph of which source asserts x become relevant. We have two distinct use cases which warrant different setups of the database, simple as that. The latter use case is not really within the SPARQL spec, so implementations may or may not support this. For example Oracle or Vertica would not do this well since they partition data according to graph or predicate, respectively. On the other hand, stores that work with one quad table, which is most of the ones out there, should do it maybe with some configuring, as shown above. Frameworks like Jena are not to my knowledge geared towards having a wildcard for graph, although I would suppose this can be arranged by adding some &quot;super-graph&quot; object, a graph of all graphs. I don&#39;t think this is directly supported and besides most apps would not need it. Once the indices are right, there is no difference between specifying a graph and not specifying a graph with the queries considered. With more complex queries, specifying a graph or set of graphs does allow some optimizations that cannot be done with no graph specified. For example, bitmap intersections are possible only when all leading key parts are given. Conclusions The best warm cache time is with v5; the five queries run under 600 ms after the first go. This is noted to show that all-in-memory with a single thread of execution is hard to beat. Cluster v6 performs the same queries in 623 ms. What is gained in parallelism is lost in latency if all operations complete in microseconds. On the other hand, Cluster v6 leaves v5 in the dust in any situation that has less than 100% hit rate. This is due to actual benefit from parallelism if operations take longer than a few microseconds, such as in the case of disk reads. Cluster v6 has substantially better data layout on disk, as well as fewer pages to load for the same content. This makes it possible to run the queries without the pogs index on Cluster v6 even when v5 takes prohibitively long. The morale of the story is to have a lot of RAM and space-efficient data representation. The DBpedia benchmark does not specify any random access pattern that would give a measure of sustained throughput under load, so we are left with the extremes of cold and warm cache of which neither is quite realistic. Chris Bizer and I have talked on and off about benchmarks and I have made suggestions that we will see incorporated into the Berlin SPARQL benchmark, which will, I believe, be much more informative. Appendix: Query Text For reference, the query texts specifying the graph are below. To run without specifying the graph, just drop the FROM &lt;http://dbpedia.org&gt; from each query. The returned row counts are indicated below each query&#39;s text. sparql SELECT ?p ?o FROM &lt;http://dbpedia.org&gt; WHERE { &lt;http://dbpedia.org/resource/Metropolitan_Museum_of_Art&gt; ?p ?o }; -- 1337 rows sparql PREFIX p: &lt;http://dbpedia.org/property/&gt; SELECT ?film1 ?actor1 ?film2 ?actor2 FROM &lt;http://dbpedia.org&gt; WHERE { ?film1 p:starring &lt;http://dbpedia.org/resource/Kevin_Bacon&gt; . ?film1 p:starring ?actor1 . ?film2 p:starring ?actor1 . ?film2 p:starring ?actor2 . }; -- 23910 rows sparql PREFIX p: &lt;http://dbpedia.org/property/&gt; SELECT ?artist ?artwork ?museum ?director FROM &lt;http://dbpedia.org&gt; WHERE { ?artwork p:artist ?artist . ?artwork p:museum ?museum . ?museum p:director ?director }; -- 303 rows sparql PREFIX geo: &lt;http://www.w3.org/2003/01/geo/wgs84_pos#&gt; PREFIX foaf: &lt;http://xmlns.com/foaf/0.1/&gt; PREFIX xsd: &lt;http://www.w3.org/2001/XMLSchema#&gt; SELECT ?s ?homepage FROM &lt;http://dbpedia.org&gt; WHERE { &lt;http://dbpedia.org/resource/Berlin&gt; geo:lat ?berlinLat . &lt;http://dbpedia.org/resource/Berlin&gt; geo:long ?berlinLong . ?s geo:lat ?lat . ?s geo:long ?long . ?s foaf:homepage ?homepage . FILTER ( ?lat &lt;= ?berlinLat + 0.03190235436 &amp;&amp; ?long &gt;= ?berlinLong - 0.08679199218 &amp;&amp; ?lat &gt;= ?berlinLat - 0.03190235436 &amp;&amp; ?long &lt;= ?berlinLong + 0.08679199218) }; -- 56 rows sparql PREFIX geo: &lt;http://www.w3.org/2003/01/geo/wgs84_pos#&gt; PREFIX foaf: &lt;http://xmlns.com/foaf/0.1/&gt; PREFIX xsd: &lt;http://www.w3.org/2001/XMLSchema#&gt; PREFIX p: &lt;http://dbpedia.org/property/&gt; SELECT ?s ?a ?homepage FROM &lt;http://dbpedia.org&gt; WHERE { &lt;http://dbpedia.org/resource/New_York_City&gt; geo:lat ?nyLat . &lt;http://dbpedia.org/resource/New_York_City&gt; geo:long ?nyLong . ?s geo:lat ?lat . ?s geo:long ?long . ?s p:architect ?a . ?a foaf:homepage ?homepage . FILTER ( ?lat &lt;= ?nyLat + 0.3190235436 &amp;&amp; ?long &gt;= ?nyLong - 0.8679199218 &amp;&amp; ?lat &gt;= ?nyLat - 0.3190235436 &amp;&amp; ?long &lt;= ?nyLong + 0.8679199218) }; -- 13 rows</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<div style="display:none;">DBpedia Benchmark Revisited</div>
<p>We ran the <a href="http://dbpedia.org/resource/DBpedia" id="link-id0x1cd6d0c8">DBpedia</a> benchmark queries again with different
configurations of <a href="http://virtuoso.openlinksw.com" id="link-id0x1bf01048">Virtuoso</a>. I had not studied the details of the
matter previously but now did have a closer look at the
queries.</p>
<p>Comparing numbers given by different parties is a constant
problem. In the case reported here, we loaded the full DBpedia 3,
all languages, with about 198M triples, onto Virtuoso v5 and Virtuoso Cluster v6,
all on the same 4 core 2GHz Xeon with 8G RAM. All databases were
striped on 6 disks. The Cluster configuration was with 4 processes
in the same box.</p>
<p>We ran the queries in two variants:</p> 
<ul>
<li>With graph
specified in the <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1b9d3ca0">SPARQL</a> <code>FROM</code> clause, using the default indices.</li>
<li>With no graph specified anywhere, using an
alternate indexing scheme.</li>
</ul>
<p>The times below are for the sequence of 5 queries; individual
query times are not reported. I did not do a line-by-line review of
the execution plans since they seem to run well enough. We could
get some extra mileage from cost model tweaks, especially for the
numeric range conditions, but we will do this when somebody comes up
with better times.</p>
<p>First, about Virtuoso v5: Because there is a query in the set that
specifies no condition on S or O and only P, this simply cannot be
done with the default indices. With Virtuoso Cluster v6 it sort-of can, because v6 is
more space efficient.</p>
<p>So we added the index:</p>
<blockquote>
<code>
create bitmap index <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1c364a58">rdf</a>_quad_pogs on rdf_quad (p, o, g, s);
</code>
</blockquote>

<table>
 <tr>
  <td> </td>
  <td align="center"><b>Virtuoso v5 with<br /> gspo, ogps, pogs</b>
  </td>
  <td align="center"><b>Virtuoso Cluster v6 with <br />gspo, ogps</b>
  </td>
  <td align="center"><b>Virtuoso Cluster v6 with <br />gspo, ogps, pogs</b>
  </td>
 </tr>
<tr>
  <td><b>cold</b>
  </td>
  <td align="center">210 s</td>
  <td align="center">136 s</td>
  <td align="center">33.4 s</td>
</tr>
<tr>
  <td><b>warm</b>
  </td>
  <td align="center">0.600 s</td>
  <td align="center">4.01 s</td>
  <td align="center">0.628 s</td>
</tr>
</table>

<p>OK, so now let us do it without a graph being specified. For
all platforms, we drop any existing indices, and --</p>
<blockquote>
<code>
create table r2 (g iri_id_8, s, iri_id_8, p iri_id_8, o any, primary key (s, p, o, g)) <br />
alter index R2 on R2 partition (s int (0hexffff00)); <br />
 <br />
log_enable (2); <br />
insert into r2 (g, s, p, o) select g, s, p, o from rdf_quad; <br />
 <br />
drop table rdf_quad; <br />
alter table r2 rename RDF_QUAD; <br />
create bitmap index rdf_quad_opgs on rdf_quad (o, p, g, s) partition (o varchar (-1, 0hexffff)); <br />
create bitmap index rdf_quad_pogs on rdf_quad (p, o, g, s) partition (o varchar (-1, 0hexffff)); <br />
create bitmap index rdf_quad_gpos on rdf_quad (g, p, o, s) partition (o varchar (-1, 0hexffff));
</code>
</blockquote>
<p>The code is identical for v5 and v6, except that with v5 we use
<code>iri_id (32 bit)</code> for the type, not <code>iri_id_8 (64 bit)</code>. We note that
we run out of IDs with v5 around a few billion triples, so with v6
we have double the ID length and still manage to be vastly more
space efficient.</p>
<p>With the above 4 indices, we can query the <a href="http://dbpedia.org/resource/Data" id="link-id0x1bae4cd8">data</a> pretty much in
any combination without hitting a full scan of any index. We note
that all indices that do not begin with s end with s as a bitmap.
This takes about 60% of the space of a non-bitmap index for data such
as DBpedia.</p>
<p>If you intend to do completely arbitrary RDF queries in
Virtuoso, then chances are you are best off with the above index
scheme.</p>

<table>
 <tr>
  <td> </td>
  <td align="center"><b> Virtuoso v5 with<br /> gspo, ogps, pogs</b>
  </td>
  <td align="center"><b> Virtuoso Cluster v6 with <br /> spog, pogs, opgs, gpos </b>
  </td>
 </tr>
<tr>
  <td><b>warm</b>
  </td>
  <td align="center">0.595 s</td>
  <td align="center">0.617 s</td>
</tr>
</table>

<p>The cold times were about the same as above, so not
reproduced.</p>
<h3>Graph or No Graph?</h3>
<p>It is in the SPARQL spirit to specify a graph and for pretty
much any application, there are entirely sensible ways of keeping
the data in graphs and specifying which ones are concerned by
queries. This is why Virtuoso is set up for this by default.</p>
<p>On the other hand, for the open web scenario, dealing with an
unknown large number of graphs, enumerating graphs is not possible
and questions like which graph of which source asserts x become
relevant. We have two distinct use cases which warrant different
setups of the database, simple as that.</p>
<p>The latter use case is not really within the SPARQL spec, so
implementations may or may not support this. For example <a href="http://dbpedia.org/resource/Oracle_Database" id="link-id0x1cd2db78">Oracle</a> or
Vertica would not do this well since they partition data according
to graph or predicate, respectively. On the other hand, stores that
work with one quad table, which is most of the ones out there,
should do it maybe with some configuring, as shown above.</p>
<p>Frameworks like Jena are not to my <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x1b300390">knowledge</a> geared towards
having a wildcard for graph, although I would suppose this can be
arranged by adding some &quot;super-graph&quot; object, a graph of all
graphs. I don&#39;t think this is directly supported and besides most
apps would not need it.</p>
<p>Once the indices are right, there is no difference between
specifying a graph and not specifying a graph with the queries considered. With
more complex queries, specifying a graph or set of graphs does
allow some optimizations that cannot be done with no graph specified.
For example, bitmap intersections are possible only when all
leading key parts are given.</p>
<h3>Conclusions</h3>
<p>The best warm cache time is with v5; the five queries run under
600 ms after the first go. This is noted to show that all-in-memory with
a single thread of execution is hard to beat.</p>
<p>Cluster v6 performs the same queries in 623 ms. What is gained in
parallelism is lost in latency if all operations complete in
microseconds. On the other hand, Cluster v6 leaves v5 in the dust in
any situation that has less than 100% hit rate. This is due to
actual benefit from parallelism if operations take longer than a
few microseconds, such as in the case of disk reads. Cluster v6 has
substantially better data layout on disk, as well as fewer pages to
load for the same content.</p>
<p>This makes it possible to run the queries without the pogs
index on Cluster v6 even when v5 takes prohibitively long.</p>
<p>The morale of the story is to have a lot of RAM and space-efficient data representation.</p>
<p>The DBpedia benchmark does not specify any random access
pattern that would give a measure of sustained throughput under
load, so we are left with the extremes of cold and warm cache of
which neither is quite realistic.</p>
<p>Chris Bizer and I have talked on and off about benchmarks and
I have made suggestions that we will see incorporated into the
Berlin SPARQL benchmark, which will, I believe, be much more
informative.</p>
<h3>Appendix: Query Text</h3>
<p>For reference, the query texts specifying the graph are below. To
run without specifying the graph, just drop the <code>FROM
&lt;<a href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x1c371db0">http</a>://dbpedia.org&gt;</code> from each query. The returned row counts are indicated
below each query&#39;s text.</p>
<blockquote>
 <code><pre>
sparql SELECT ?p ?o FROM &lt;http://dbpedia.org&gt; WHERE {
  &lt;http://dbpedia.org/resource/Metropolitan_Museum_of_Art&gt; ?p ?o };

-- 1337 rows

sparql PREFIX p: &lt;http://dbpedia.org/property/&gt;
SELECT ?film1 ?actor1 ?film2 ?actor2
FROM &lt;http://dbpedia.org&gt; WHERE {
  ?film1 p:starring &lt;http://dbpedia.org/resource/Kevin_Bacon&gt; .
  ?film1 p:starring ?actor1 .
  ?film2 p:starring ?actor1 .
  ?film2 p:starring ?actor2 . };

--  23910 rows

sparql PREFIX p: &lt;http://dbpedia.org/property/&gt;
SELECT ?artist ?artwork ?museum ?director FROM &lt;http://dbpedia.org&gt; 
WHERE {
  ?artwork p:artist ?artist .
  ?artwork p:museum ?museum .
  ?museum p:director ?director };

-- 303 rows

sparql PREFIX geo: &lt;http://www.w3.org/2003/01/geo/wgs84_pos#&gt;
PREFIX foaf: &lt;http://xmlns.com/foaf/0.1/&gt;
PREFIX xsd: &lt;http://www.w3.org/2001/XMLSchema#&gt;
SELECT ?s ?homepage FROM &lt;http://dbpedia.org&gt;  WHERE {
   &lt;http://dbpedia.org/resource/Berlin&gt; geo:lat ?berlinLat .
   &lt;http://dbpedia.org/resource/Berlin&gt; geo:long ?berlinLong . 
   ?s geo:lat ?lat .
   ?s geo:long ?long .
   ?s foaf:homepage ?homepage .
   FILTER (
     ?lat        &lt;=     ?berlinLat + 0.03190235436 &amp;&amp;
     ?long       &gt;=     ?berlinLong - 0.08679199218 &amp;&amp;
     ?lat        &gt;=     ?berlinLat - 0.03190235436 &amp;&amp; 
     ?long       &lt;=     ?berlinLong + 0.08679199218) };

-- 56 rows

sparql PREFIX geo: &lt;http://www.w3.org/2003/01/geo/wgs84_pos#&gt;
PREFIX foaf: &lt;http://xmlns.com/foaf/0.1/&gt;
PREFIX xsd: &lt;http://www.w3.org/2001/XMLSchema#&gt;
PREFIX p: &lt;http://dbpedia.org/property/&gt;
SELECT ?s ?a ?homepage FROM &lt;http://dbpedia.org&gt;  WHERE {
   &lt;http://dbpedia.org/resource/New_York_City&gt; geo:lat ?nyLat .
   &lt;http://dbpedia.org/resource/New_York_City&gt; geo:long ?nyLong . 
   ?s geo:lat ?lat .
   ?s geo:long ?long .
   ?s p:architect ?a .
   ?a foaf:homepage ?homepage .
   FILTER (
     ?lat        &lt;=     ?nyLat + 0.3190235436 &amp;&amp;
     ?long       &gt;=     ?nyLong - 0.8679199218 &amp;&amp;
     ?lat        &gt;=     ?nyLat - 0.3190235436 &amp;&amp; 
     ?long       &lt;=     ?nyLong + 0.8679199218) };

-- 13 rows
</pre>
 </code>
</blockquote>
</div>]]></content:encoded>
  <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Virtuso Data Space Bot &lt;kidehen@openlinksw.com&gt;</dc:creator>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?id=1354">
  <rss:title>SPARQL at WWW 2008</rss:title>
  <rss:link>http://www.openlinksw.com/blog/vdb/blog/?id=1354</rss:link>
  <wfw:comment xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://www.openlinksw.com/mt-tb/Http/comments?id=1354</wfw:comment>
  <wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://www.openlinksw.com/blog/vdb/blog/gems/rsscomment.xml?:id=1354</wfw:commentRss>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-04-30T16:28:10Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">SPARQL at WWW 2008 Andy Seaborne and Eric Prud&#39;hommeaux, editors of the SPARQL recommendation, convened a SPARQL birds of a feather session at WWW 2008. The administrative outcome was that implementors could now experiment with extensions, hopefully keeping each other current about their efforts and that towards the end of 2008, a new W3C working group might begin formalizing the experiences into a new SPARQL spec. The session drew a good crowd, including many users and developers. The wishes were largely as expected, with a few new ones added. Many of the wishes already had diverse implementations, however most often without interop. I will below give some comments on the main issues discussed. SPARQL Update - This is likely the most universally agreed upon extension. Implementations exist, largely along the lines of Andy Seaborne&#39;s SPARUL spec, which is also likely material for a W3C member submission. The issue is without much controversy; transactions fall outside the scope, which is reasonable enough. With triple stores, we can define things as combinations of inserts and deletes, and isolation we just leave aside. If anything, operating on a transactional platform such as Virtuoso, one wishes to disable transactions for any operations such as bulk loads and long-running inserts and deletes. Transactionality has pretty much no overhead for a few hundred rows, but for a few hundred million rows the cost of locking and rollback is prohibitive. With Virtuoso, we have a row auto-commit mode which we recommend for use with RDF: It commits by itself now and then, optionally keeping a roll forward log, and is transactional enough not to leave half triples around, i.e., inserted in one index but not another. As far as we are concerned, updating physical triples along the SPARUL lines is pretty much a done deal. The matter of updating relational data mapped to RDF is a whole other kettle of fish. On this, I should say that RDF has no special virtues for expressing transactions but rather has a special genius for integration. Updating is best left to web service interfaces that use SQL on the inside. Anyway, updating union views, which most mappings will be, is complicated. Besides, for transactions, one usually knows exactly what one wishes to update. Full Text - Many people expressed a desire for full text access. Here we run into a deplorable confusion with regexps. The closest SPARQL has to full text in its native form is regexps, but these are not really mappable to full text except in rare special cases and I would despair of explaining to an end user what exactly these cases are. So, in principle, some regexps are equivalent to full text but in practice I find it much preferable to keep these entirely separate. It was noted that what the users want is a text box for search words. This is a front end to the CONTAINS predicate of most SQL implementations. Ours is MS SQL Server compatible and has a SPARQL version called bif:contains. One must still declare which triples one wants indexed for full text, though. This admin overhead seems inevitable, as text indexing is a large overhead and not needed by all applications. Also, text hits are not boolean; usually they come with a hit score. Thus, a SPARQL extension for this could look like select * where { ?thing has_description ?d . ?d ftcontains &quot;gizmo&quot; ftand &quot;widget&quot; score ?score . } This would return all the subjects, descriptions, and scores, from subjects with a has_description property containing widget and gizmo. Extending the basic pattern is better than having the match in a filter, since the match binds a variable. The XQuery/XPath groups have recently come up with a full-text spec, so I used their style of syntax above. We already have a full-text extension, as do some others. but for standardization, it is probably most appropriate to take the XQuery work as a basis. The XQuery full-text spec is quite complex, but I would expect most uses to get by with a small subset, and the structure seems better thought out, at first glance, than the more ad-hoc implementations in diverse SQLs. Again, declaring any text index to support the search, as well as its timeliness or transactionality, are best left to implementations. Federation - This is a tricky matter. ARQ has a SPARQL extension for sending a nested set of triple patterns to a specific end-point. The DARQ project has something more, including a selectivity model for SPARQL. With federated SQL, life is simpler since after the views are expanded, we have a query where each table is at a known server and has more or less known statistics. Generally, execution plans where as much work as possible is pushed to the remote servers are preferred, and modeling the latencies is not overly hard. With SPARQL, each triple pattern could in principle come from any of the federated servers. Associating a specific end-point to a fragment of the query just passes the problem to the user. It is my guess that this is the best we can do without getting very elaborate, and possibly buggy, end-point content descriptions for routing federated queries. Having said this, there remains the problem of join order. I suggested that we enhance the protocol by allowing asking an end-point for the query cost for a given SPARQL query. Since they all must have a cost model for optimization, this should not be an impossible request. A time cost and estimated cardinality would be enough. Making statistics available à la DARQ was also discussed. Being able to declare cardinalities expected of a remote end-point is probably necessary anyway, since not all will implement the cost model interface. For standardization, agreeing on what is a proper description of content and cardinality and how fine grained this must be will be so difficult that I would not wait for it. A cost model interface would nicely hide this within the end-point itself. With Virtuoso, we do not have a federated SPARQL scheme but we could have the ARQ-like service construct. We&#39;d use our own cost model with explicit declarations of cardinalities of the remote data for guessing a join order. Still, this is a bit of work. We&#39;ll see. For practicality, the service construct coupled with join order hints is the best short term bet. Making this pretty enough for standardization is not self-evident, as it requires end-point description and/or cost model hooks for things to stay declarative. End-point description - This question has been around for a while; I have blogged about it earlier, but we are not really at a point where there would be even rough consensus about an end-point ontology. We should probably do something on our own to demonstrate some application of this, as we host lots of linked open data sets. SQL equivalence - There were many requests for aggregation, some for subqueries and nesting, expressions in select, negation, existence and so on. I would call these all SQL equivalence. One use case was taking all the teams in the database and for all with over 5 members, add the big_team class and a property for member count. With Virtuoso, we could write this as -- construct { ?team a big_team . ?team member_count ?ct } from ... where {?team a team . { select ?team2 count (*) as ?ct where { ?m member_of ?team2 } . filter (?team = ?team2 and ? ct &gt; 5) }} We have pretty much all the SQL equivalence features, as we have been working for some time at translating the TPC-H workload into SPARQL. The usefulness of these things is uncontested but standardization could be hard as there are subtle questions about variable scope and the like. Inference - The SPARQL spec does not deal with transitivity or such matters because it is assumed that these are handled by an underlying inference layer. This is however most often not so. There was interest in more fine grained control of inference, for example declaring that just one property in a query would be transitive or that subclasses should be taken into account in only one triple pattern. As far as I am concerned, this is very reasonable, and we even offer extensions for this sort of thing in Virtuoso&#39;s SPARQL. This however only makes sense if the inference is done at query time and pattern by pattern. For instance, if forward chaining is used, this no longer makes sense. Specifying that some forward chaining ought to be done at query time is impractical, as the operation can be very large and time consuming and it is the DBA&#39;s task to determine what should be stored and for how long, how changes should be propagated, and so on. All these are application dependent and standardizing will be difficult. Support for RDF features like lists and bags would all fall into the functions an underlying inference layer should perform. These things are of special interest when querying OWL models, for example. Path expressions - Path expressions were requested by a few people. We have implemented some, as in ?product+?has_supplier+&gt;s_name = &quot;Gizmos, Inc.&quot;. This means that one supplier of product has name &quot;Gizmo, Inc.&quot;. This is a nice shorthand but we run into problems if we start supporting repetitive steps, optional steps, and the like. In conclusion, update, full text, and basic counting and grouping would seem straightforward at this point. Nesting queries, value subqueries, views, and the like should not be too hard if an agreement is reached on scope rules. Inference and federation will probably need more experimentation but a lot can be had already with very simple fine grained control of backward chaining, if such applies, or with explicit end-point references and explicit join order. These are practical but not pretty enough for committee consensus, would be my guess. Anyway, it will be a few months before anything formal will happen.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<div style="display:none;">SPARQL at WWW 2008</div>
<p>Andy Seaborne and Eric Prud&#39;hommeaux, editors of the <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1501d1a0">SPARQL</a> recommendation, convened a SPARQL birds of a feather session at <a href="http://www2008.org/" id="link-id0xb9d6c10">WWW 2008</a>. The administrative outcome was that implementors could now
experiment with extensions, hopefully keeping each other current about their efforts and that towards the end of 2008, a new W3C working group might begin formalizing the experiences into a new SPARQL spec.</p>
<p>The session drew a good crowd, including many users and developers. The wishes were largely as expected, with a few new ones added. Many of the wishes already had diverse implementations, however most often without interop. I will below give some comments on the main issues discussed.</p>
</div>
<li>
<p>
  <b>SPARQL Update</b> - This is likely the most universally agreed upon extension. Implementations exist, largely along the lines of Andy Seaborne&#39;s SPARUL spec, which is also likely material for a W3C member submission. The issue is without much controversy; transactions fall outside the scope, which is reasonable enough. With triple stores, we can define things as combinations of inserts and deletes, and isolation we just leave aside. If anything, operating on a transactional platform such as <a href="http://virtuoso.openlinksw.com" id="link-id0xc13fe98">Virtuoso</a>, one wishes to disable transactions for any operations such as bulk loads and long-running inserts and deletes. Transactionality has pretty much no overhead for a few hundred rows, but for a few hundred million rows the cost of locking and rollback is prohibitive. With Virtuoso, we have a row auto-commit mode which we recommend for use with <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xd7bff00">RDF</a>: It commits by itself now and then, optionally keeping a roll forward log, and is transactional enough not to leave half triples around, i.e., inserted in one index but not another.</p>
<p>As far as we are concerned, updating physical triples along the SPARUL lines is pretty much a done deal.</p>
<p>The matter of updating relational <a href="http://dbpedia.org/resource/Data" id="link-id0x140ea538">data</a> mapped to RDF is a whole other kettle of fish. On this, I should say that RDF has no special virtues for expressing transactions but rather has a special genius for integration. Updating is best left to web service interfaces that use <a href="http://dbpedia.org/resource/SQL" id="link-id0xa24e9558">SQL</a> on the inside. Anyway, updating union views, which most mappings will be, is complicated. Besides, for transactions, one usually knows exactly what one wishes to update.</p>
</li>
<li>
<p>
  <b>Full Text</b> - Many people expressed a desire for full text access. Here we run into a deplorable confusion with regexps. The closest SPARQL has to full text in its native form is regexps, but these are not really mappable to full text except in rare special cases and I would despair of explaining to an end user what exactly these cases are. So, in principle, some regexps are equivalent to full text but in practice I find it much preferable to keep these entirely separate.</p>
<p>It was noted that what the users want is a text box for search words. This is a front end to the CONTAINS predicate of most SQL implementations. Ours is MS SQL Server compatible and has a SPARQL version called <code>bif:contains</code>. One must still declare which triples one wants indexed for full text, though. This admin overhead seems inevitable, as text indexing is a large overhead and not needed by all applications.</p>
<p>Also, text hits are not boolean; usually they come with a hit score. Thus, a SPARQL extension for this could look like </p>
<blockquote>
  <code>select * where { ?thing has_description ?d . ?d ftcontains &quot;gizmo&quot; ftand &quot;widget&quot; score ?score . }</code>
</blockquote>
<p>This would return all the subjects, descriptions, and scores, from subjects with a has_description property containing widget and gizmo. Extending the basic pattern is better than having the match in a filter, since the match binds a variable.</p>
<p>The <a href="http://dbpedia.org/resource/XQuery" id="link-id0x9ddb7240">XQuery</a>/<a href="http://dbpedia.org/resource/XPath" id="link-id0x9d84e070">XPath</a> groups have recently come up with a full-text spec, so I used their style of syntax above. We already have a full-text extension, as do some others. but for standardization, it is probably most appropriate to take the XQuery work as a basis. The XQuery full-text spec is quite complex, but I would expect most uses to get by with a small subset, and the structure seems better thought out, at first glance, than the more ad-hoc implementations in diverse SQLs.</p>
<p>Again, declaring any text index to support the search, as well as its timeliness or transactionality, are best left to implementations.</p>
</li>
<li>
<p>
  <b>Federation</b> - This is a tricky matter. ARQ has a SPARQL extension for sending a nested set of triple patterns to a specific end-point. The DARQ project has something more, including a selectivity model for SPARQL.</p>
<p>With federated SQL, life is simpler since after the views are expanded, we have a query where each table is at a known server and has more or less known statistics. Generally, execution plans where as much work as possible is pushed to the remote servers are preferred, and modeling the latencies is not overly hard. With SPARQL, each triple pattern could in principle come from any of the federated servers. Associating a specific end-point to a fragment of the query just passes the problem to the user. It is my guess that this is the best we can do without getting very elaborate, and possibly buggy, end-point content descriptions for routing federated queries.</p>
<p>Having said this, there remains the problem of join order. I suggested that we enhance the protocol by allowing asking an end-point for the query cost for a given SPARQL query. Since they all must have a cost model for optimization, this should not be an impossible request. A time cost and estimated cardinality would be enough. Making statistics available <i>à la</i> DARQ was also discussed. Being able to declare cardinalities expected of a remote end-point is probably necessary anyway, since not all will implement the cost model interface. For standardization, agreeing on what is a proper description of content and cardinality and how fine grained this must be will be so difficult that I would not wait for it. A cost model interface would nicely hide this within the end-point itself.</p>
<p>With Virtuoso, we do not have a federated SPARQL scheme but we could have the ARQ-like service construct. We&#39;d use our own cost model with explicit declarations of cardinalities of the remote data for guessing a join order. Still, this is a bit of work. We&#39;ll see.</p>
<p>For practicality, the service construct coupled with join order hints is the best short term bet. Making this pretty enough for standardization is not self-evident, as it requires end-point description and/or cost model hooks for things to stay declarative.</p>
</li>
<li>
<p>
  <b>End-point description</b> - This question has been around for a while; I have <a href="http://www.openlinksw.com/weblog/oerling/?id=1085" id="link-id10fa7da8">blogged about it earlier</a>, but we are not really at a point where there would be even rough consensus about an end-point ontology. We should probably do something on our own to demonstrate some application of this, as we host lots of <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0xd048c68">linked open data</a> sets.</p>
</li>
<li>
<p>
  <b>SQL equivalence</b> - There were many requests for aggregation, some for subqueries and nesting, expressions in select, negation, existence and so on. I would call these all SQL equivalence. One use case was taking all the teams in the database and for all with over 5 members, add the big_team class and a property for member count.</p>
<p>With Virtuoso, we could write this as -- </p>
<blockquote>
  <code>construct { ?team a big_team . ?team member_count ?ct } from ... where {?team a team . { select ?team2 count (*) as ?ct where { ?m member_of ?team2 } . filter (?team = ?team2 and ? ct &gt; 5) }}</code>
</blockquote>
<p>We have pretty much all the SQL equivalence features, as we have been working for some time at translating the <a href="http://dbpedia.org/resource/TPC-H" id="link-id0xb9d5200">TPC-H</a> workload into SPARQL.</p>
<p>The usefulness of these things is uncontested but standardization could be hard as there are subtle questions about variable scope and the like.</p>
</li>
<li>
<p>
  <b>Inference</b> - The SPARQL spec does not deal with transitivity or such matters because it is assumed that these are handled by an underlying inference layer. This is however most often not so. There was interest in more fine grained control of inference, for example declaring that just one property in a query would be transitive or that subclasses should be taken into account in only one triple pattern. As far as I am concerned, this is very reasonable, and we even offer extensions for this sort of thing in Virtuoso&#39;s SPARQL. This however only makes sense if the inference is done at query time and pattern by pattern. For instance, if forward chaining is used, this no longer makes sense. Specifying that some forward chaining ought to be done at query time is impractical, as the operation can be very large and time consuming and it is the DBA&#39;s task to determine what should be stored and for how long, how changes should be propagated, and so on. All these are application dependent and standardizing will be difficult.</p>
<p>Support for RDF features like lists and bags would all fall into the functions an underlying inference layer should perform. These things are of special interest when querying OWL models, for example.</p>
</li>
<li>
<p>
  <b>Path expressions</b> - Path expressions were requested by a few people. We have implemented some, as in </p>
 <blockquote>
  <code>?product+?has_supplier+&gt;s_name = &quot;Gizmos, Inc.&quot;.</code>
 </blockquote> This means that one supplier of product has name &quot;Gizmo, Inc.&quot;. This is a nice shorthand but we run into problems if we start supporting repetitive steps, optional steps, and the like.</li>
<p>In conclusion, update, full text, and basic counting and grouping would seem straightforward at this point. Nesting queries, value subqueries, views, and the like should not be too hard if an agreement is reached on scope rules. Inference and federation will probably need more experimentation but a lot can be had already with very simple fine grained control of backward chaining, if such applies, or with explicit end-point references and explicit join order. These are practical but not pretty enough for committee consensus, would be my guess. Anyway, it will be a few months before anything formal will happen.</p>
]]></content:encoded>
  <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Virtuso Data Space Bot &lt;kidehen@openlinksw.com&gt;</dc:creator>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?id=1350">
  <rss:title>Linked Data and Information Architecture</rss:title>
  <rss:link>http://www.openlinksw.com/blog/vdb/blog/?id=1350</rss:link>
  <wfw:comment xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://www.openlinksw.com/mt-tb/Http/comments?id=1350</wfw:comment>
  <wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://www.openlinksw.com/blog/vdb/blog/gems/rsscomment.xml?:id=1350</wfw:commentRss>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-04-29T14:37:22Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Linked Data and Information Architecture We had a workshop on Linked Open Data (LOD) last week in Beijing. You can see the papers in the program. The event was a success with plenty of good talks and animated conversation. I will not go into every paper here but will comment a little on the conversation and draw some technology requirements going forward. Tim Berners-Lee showed a read-write version of Tabulator. This raises the question of updating on the Data Web. The consensus was that one could assert what one wanted in one&#39;s own space but that others&#39; spaces would be read-only. What spaces one considered relevant would be the user&#39;s or developer&#39;s business, as in the document web. It seems to me that a significant use case of LOD is an open-web situation where the user picks a broad read-only &quot;data wallpaper&quot; or backdrop of assertions, and then uses this combined with a much smaller, local, writable data set. This is certainly the case when editing data for publishing, as in Tim&#39;s demo. This will also be the case when developing mesh-ups combining multiple distinct data sets bound together by sets of SameAs assertions, for example. Questions like, &quot;What is the minimum subset of n data sets needed for deriving the result?&quot; will be common. This will also be the case in applications using proprietary data combined with open data. This means that databases will have to deal with queries that specify large lists of included graphs, all graphs in the store or all graphs with an exclusion list. All this is quite possible but again should be considered when architecting systems for an open linked data web. &quot;There is data but what can we really do with it? How far can we trust it, and what can we confidently decide based on it?&quot; As an answer to this question, Zitgist has compiled the UMBEL taxonomy using SKOS. This draws on Wikipedia, Open CYC, Wordnet, and YAGO, hence the acronym WOWY. UMBEL is both a taxononmy and a set of instance data, containing a large set of named entities, including persons, organizations, geopolitical entities, and so forth. By extracting references to this set of named entities from documents and correlating this to the taxonomy, one gets a good idea of what a document (or part thereof) is about. Kingsley presented this in the Zitgist demo. This is our answer to the criticism about DBpedia having errors in classification. DBpedia, as a bootstrap stage, is about giving names to all things. Subsequent efforts like UMBEL are about refining the relationships. &quot;Should there be a global URI dictionary?&quot; There was a talk by Paolo Bouquet about Entity Name System, a a sort of data DNS, with the purpose of associating some description and rough classification to URIs. This would allow discovering URIs for reuse. I&#39;d say that this is good if it can cut down on the SameAs proliferation and if this can be widely distributed and replicated for resilience, à la DNS. On the other hand, it was pointed out that this was not quite in the LOD spirit, where parties would mint their own dereferenceable URIs, in their own domains. We&#39;ll see. &quot;What to do when identity expires?&quot; Giovanni of Sindice said that a document should be removed from search if it was no longer available. Kingsley pointed out that resilience of reference requires some way to recover data. The data web cannot be less resilient than the document web, and there is a point to having access to history. He recommended hooking up with the Internet Archive, since they make long term persistence their business. In this way, if an application depends on data, and the URIs on which it depends are no longer dereferenceable or or provide content from a new owner of the domain, those who need the old version can still get it and host it themselves. It is increasingly clear that OWL SameAs is both the blessing and bane of linked data. We can easily have tens of URIs for the same thing, especially with people. Still, these should be considered the same. Returning every synonym in a query answer hardly makes sense but accepting them as input seems almost necessary. This is what we do with Virtuoso&#39;s SameAs support. Even so, this can easily double query times even when there are no synonyms. Be that as it may, SameAs is here to stay; just consider the mapping of DBpedia to Geonames, for example. Also, making aberrant SameAs statements can completely poison a data set and lead to absurd query results. Hence choosing which SameAs assertions from which source will be considered seems necessary. In an open web scenario, this leads inevitably to multi-graph queries that can be complex to write with regular SPARQL. By extension, it seems that a good query would also include the graphs actually used for deriving each result row. This is of course possible but has some implications on how databases should be organized. Yves Raymond gave a talk about deriving identity between Musicbrainz and Jamendo. I see the issue as a core question of linked data in general. The algorithm Yves presented started with attribute value similarities and then followed related entities. Artists would be the same if they had similar names and similar names of albums with similar song titles, for example. We can find the same basic question in any analysis, for example, looking at how news reporting differs between media, supposing there is adequate entity extraction. There is basic graph diffing in RDFSync, for example. But here we are expanding the context significantly. We will traverse references to some depth, allow similarity matches, SameAs, and so forth. Having presumed identity of two URIs, we can then look at the difference in their environment to produce a human readable summary. This could then be evaluated for purposes of analysis or of combining content. At first sight, these algorithms seem well parallelizable, as long as all threads have access to all data. For scaling, this means a probably message-bound distributed algorithm. This is something to look into for the next stage of linked data. Some inference is needed, but if everybody has their own choice of data sets to query, then everybody would also have their own entailed triples. This will make for an explosion of entailed graphs if forward chaining is used. Forward chaining is very nice because it keeps queries simple and easy to optimize. With Virtuoso, we still favor backward chaining since we expect a great diversity of graph combinations and near infinite volume in the open web scenario. With private repositories of slowly changing data put together for a special application, the situation is different. In conclusion, we have a real LOD movement with actual momentum and a good idea of what to do next. The next step is promoting this to the broader community, starting with Linked Data Planet in New York in June.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<div style="display:none;">Linked Data and Information Architecture</div>
<p>We had a workshop on <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x1437ac70">Linked Open Data</a> (<a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x1315f788">LOD</a>) last week in <a href="http://www2008.org/" id="link-id0x13737468">Beijing</a>. You can see the papers in <a href="http://events.linkeddata.org/ldow2008/#program" id="link-id10651ab8">the program</a>. The event was a success with plenty of good talks and animated conversation. I will not go into every paper here but will comment a little on the conversation and draw some technology requirements going forward.</p>
<p>Tim Berners-Lee showed a read-write version of <a href="http://dig.csail.mit.edu/2005/ajar/release/tabulator/0.8/tab.html" id="link-id0x15633520">Tabulator</a>. This raises the question of updating on the <a href="http://dbpedia.org/resource/Data" id="link-id0x1350a178">Data</a> Web. The consensus was that one could assert what one wanted in one&#39;s own space but that others&#39; spaces would be read-only. What spaces one considered relevant would be the user&#39;s or developer&#39;s business, as in the document web.</p>
<p>It seems to me that a significant use case of LOD is an open-web situation where the user picks a broad read-only &quot;data wallpaper&quot; or backdrop of assertions, and then uses this combined with a much smaller, local, writable data set. This is certainly the case when editing data for publishing, as in Tim&#39;s demo. This will also be the case when developing mesh-ups combining multiple distinct data sets bound together by sets of SameAs assertions, for example. Questions like, &quot;What is the minimum subset of n data sets needed for deriving the result?&quot; will be common. This will also be the case in applications using proprietary data combined with open data.</p>
<p>This means that databases will have to deal with queries that specify large lists of included graphs, all graphs in the store or all graphs with an exclusion list. All this is quite possible but again should be considered when architecting systems for an open <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0xa27bae8">linked data</a> <a href="http://dbpedia.org/resource/Giant_Global_Graph" id="link-id0x155c3f18">web</a>.</p>
<p>&quot;There is data but what can we really do with it? How far can we trust it, and what can we confidently decide based on it?&quot;</p>
<p>As an answer to this question, <a href="http://zitgist.com/about/" id="link-id0xd447580">Zitgist</a> has compiled the <a href="ht