<?xml version="1.0" encoding="UTF-8" ?>
<!--RDF based XML document generated By OpenLink Virtuoso-->
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
 <rss:channel xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/">
  <rss:title>Orri Erling&#39;s Weblog</rss:title>
  <rss:link>http://www.openlinksw.com/weblog/oerling/</rss:link>
  <rss:description />
  <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">oerling@openlinksw.com</dc:creator>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2013-05-18T16:19:01Z</dc:date>
  <rss:items>
   <rdf:Seq>
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2009-09-01#1572" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2009-04-30#1551" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2009-04-01#1540" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2009-03-25#1537" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2009-03-24#1535" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-12-16#1498" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-12-11#1494" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-11-27#1487" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-10-24#1459" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-09-30#1445" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-06-09#1376" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-05-09#1358" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2007-11-08#1269" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2006-08-10#1024" />
   </rdf:Seq>
  </rss:items>
 </rss:channel>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2009-09-01#1572">
  <rss:title>Provenance and Reification in Virtuoso</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-09-01T14:44:08Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">These days, data provenance is a big topic across the board, ranging from the linked data web, to RDF in general, to any kind of data integration, with or without RDF. Especially with scientific data we encounter the need for metadata and provenance, repeatability of experiments, etc. Data without context is worthless, yet the producers of said data do not always have a model or budget for metadata. And if they do, the approach is often a proprietary relational schema with web services in front. RDF and linked data principles could evidently be a great help. This is a large topic that goes into the culture of doing science and will deserve a more extensive treatment down the road. For now, I will talk about possible ways of dealing with provenance annotations in Virtuoso at a fairly technical level. If data comes many-triples-at-a-time from some source (e.g., library catalogue, user of a social network), then it is often easiest to put the data from each source/user into its own graph. Annotations can then be made on the graph. The graph IRI will simply occur as the subject of a triple in the same or some other graph. For example, all such annotations could go into a special annotations graph. On the query side, having lots of distinct graphs does not have to be a problem if the index scheme is the right one, i.e., the 4 index scheme discussed in the Virtuoso documentation. If the query does not specify a graph, then triples in any graph will be considered when evaluating the query. One could write queries like â SELECT ?pub WHERE { GRAPH ?g { ?person foaf:knows ?contact } ?contact foaf:name &quot;Alice&quot; . ?g xx:has_publisher ?pub } This would return the publishers of graphs that assert that somebody knows Alice. Of course, the RDF reification vocabulary can be used as-is to say things about single triples. It is however very inefficient and is not supported by any specific optimization. Further, reification does not seem to get used very much; thus there is no great pressure to specially optimize it. If we have to say things about specific triples and this occurs frequently (i.e., for more than 10% or so of the triples), then modifying the quad table becomes an option. For all its inefficiency, the RDF reification vocabulary is applicable if reification is a rarity. Virtuoso&#39;s RDF_QUAD table can be altered to have more columns. The problem with this is that space usage is increased and the RDF loading and query functions will not know about the columns. A SQL update statement can be used to set values for these additional columns if one knows the G,S,P,O. Suppose we annotated each quad with the user who inserted it and a timestamp. These would be columns in the RDF_QUAD table. The next choice would be whether these were primary key parts or dependent parts. If primary key parts, these would be non-NULL and would occur on every index. The same quad would exist for each distinct user and time this quad had been inserted. For loading functions to work, these columns would need a default. In practice, we think that having such metadata as a dependent part is more likely, so that G,S,P,O are the unique identifier of the quad. Whether one would then include these columns on indices other than the primary key would depend on how frequently they were accessed. In SPARQL, one could use an extension syntax like â SELECT * WHERE { ?person foaf:knows ?connection OPTION ( time ?ts ) . ?connection foaf:name &quot;Alice&quot; . FILTER ( ?ts &gt; &quot;2009-08-08&quot;^^xsd:datetime ) } This would return everybody who knows Alice since a date more recent than 2009-08-08. This presupposes that the quad table has been extended with a datetime column. The OPTION (time ?ts) syntax is not presently supported but we can easily add something of the sort if there is user demand for it. In practice, this would be an extension mechanism enabling one to access extension columns of RDF_QUAD via a column ?variable syntax in the OPTION clause. If quad metadata were not for every quad but still relatively frequent, another possibility would be making a separate table with a key of GSPO and a dependent part of R, where R would be the reification URI of the quad. Reification statements would then be made with R as a subject. This would be more compact than the reification vocabulary and would not modify the RDF_QUAD table. The syntax for referring to this could be something like â SELECT * WHERE { ?person foaf:knows ?contact OPTION ( reify ?r ) . ?r xx:assertion_time ?ts . ?contact foaf:name &quot;Alice&quot; . FILTER ( ?ts &gt; &quot;2008-8-8&quot;^^xsd:datetime ) } We could even recognize the reification vocabulary and convert it into the reify option if this were really necessary. But since it is so unwieldy I don&#39;t think there would be huge demand. Who knows? You tell us.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>These days, <a href="http://dbpedia.org/resource/Data" id="link-id0x4a44870">data</a> provenance is a big topic across the board, ranging from the <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x4e10e60">linked data</a> <a href="http://dbpedia.org/resource/Giant_Global_Graph" id="link-id0x4738350">web</a>, to <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1fe33310">RDF</a> in general, to any kind of data integration, with or without RDF.  Especially with scientific data we encounter the need for metadata and provenance, repeatability of experiments, etc.  Data without context is worthless, yet the producers of said data do not always have a model or budget for metadata.  And if they do, the approach is often a proprietary relational schema with web services in front.</p>

<p>RDF and linked data principles could evidently be a great help.  This is a large topic that goes into the culture of doing science and will deserve a more extensive treatment down the road.</p>

<p>For now, I will talk about possible ways of dealing with provenance annotations in <a href="http://virtuoso.openlinksw.com" id="link-id0x36581e8">Virtuoso</a> at a fairly technical level.</p>

<p>If data comes many-triples-at-a-time from some source (e.g., library catalogue, user of a social network), then it is often easiest to put the data from each source/user into its own graph.  Annotations can then be made on the graph.  The graph IRI will simply occur as the subject of a triple in the same or some other graph.  For example, all such annotations could go into a special annotations graph.</p>

<p>On the query side, having lots of distinct graphs does not have to be a problem if the index scheme is the right one, i.e., the 4 index scheme <a href="http://docs.openlinksw.com/virtuoso/rdfperformancetuning.html#rdfperfindexes" id="link-id142a0798">discussed in the Virtuoso documentation</a>.  If the query does not specify a graph, then triples in any graph will be considered when evaluating the query.</p>


<p>One could write queries like â</p>

<blockquote>
 <code><pre>SELECT  ?pub 
  WHERE 
    { 
      GRAPH  ?g 
        { 
          ?person  foaf:knows  ?contact 
        } 
      ?contact  foaf:name         &quot;Alice&quot;  . 
      ?g        xx:has_publisher  ?pub 
    }</pre>
 </code>
</blockquote>

<p>This would return the publishers of graphs that assert that somebody knows Alice.</p>

<p>Of course, the <a href="http://www.w3.org/TR/2004/REC-rdf-primer-20040210/#reification" id="link-id14fa9488">RDF reification vocabulary</a> can be used as-is to say things about single triples.  It is however very inefficient and is not supported by any specific optimization.  Further, reification does not seem to get used very much; thus there is no great pressure to specially optimize it.</p>

<p>If we have to say things about specific triples and this occurs frequently (i.e., for more than 10% or so of the triples), then modifying the quad table becomes an option. For all its inefficiency, the RDF reification vocabulary is applicable if reification is a rarity.</p>

<p>Virtuoso&#39;s <code>RDF_QUAD</code> table can be altered to have more columns.  The problem with this is that space usage is increased and the RDF loading and query functions will not know about the columns.  A <a href="http://dbpedia.org/resource/SQL" id="link-id0x4b1d938">SQL</a> update statement can be used to set values for these additional columns if one knows the <code>G,S,P,O</code>. </p>

<p>Suppose we annotated each quad with the user who inserted it and a timestamp.  These would be columns in the <code>RDF_QUAD</code> table.  The next choice would be whether these were primary key parts or dependent parts.  If primary key parts, these would be non-<code>NULL</code> and would occur on every index.  The same quad would exist for each distinct user and time this quad had been inserted.  For loading functions to work, these columns would need a default.  In practice, we think that having such metadata as a dependent part is more likely, so that <code>G,S,P,O</code> are the unique identifier of the quad.  Whether one would then include these columns on indices other than the primary key would depend on how frequently they were accessed.</p>

<p>In <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x472afb0">SPARQL</a>, one could use an extension syntax like â</p>

<blockquote>
 <code><pre>SELECT  * 
  WHERE 
    { ?person      foaf:knows  ?connection 
                   OPTION ( time  ?ts )     . 
      ?connection  foaf:name   &quot;Alice&quot;      . 
      FILTER ( ?ts &gt; &quot;2009-08-08&quot;^^xsd:datetime ) 
    }</pre>
 </code>
</blockquote>

<p>This would return everybody who knows Alice since a date more recent than 2009-08-08.  This presupposes that the quad table has been extended with a datetime column.</p>

<p>The <code>OPTION (time ?ts)</code> syntax is not presently supported but we can easily add something of the sort if there is user demand for it. In practice, this would be an extension mechanism enabling one to access extension columns of <code>RDF_QUAD</code> via a column <code>?variable</code> syntax in the <code>OPTION</code> clause.</p>


<p>If quad metadata were not for every quad but still relatively frequent, another possibility would be making a separate table with a key of <code>GSPO</code> and a dependent part of <code>R</code>, where <code>R</code> would be the reification <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id0x365b190">URI</a> of the quad.  Reification statements would then be made with <code>R</code> as a subject.  This would be more compact than the reification vocabulary and would not modify the <code>RDF_QUAD</code> table.   The syntax for referring to this could be something like â</p>

<blockquote>
 <code><pre>SELECT * 
  WHERE 
    { ?person   foaf:knows         ?contact 
                OPTION ( reify  ?r )          . 
      ?r        xx:assertion_time  ?ts       . 
      ?contact  foaf:name          &quot;Alice&quot;   . 
      FILTER ( ?ts &gt; &quot;2008-8-8&quot;^^xsd:datetime ) 
    }</pre>
 </code>
</blockquote>

<p>We could even recognize the reification vocabulary and convert it into the reify option if this were really necessary.  But since it is so unwieldy I don&#39;t think there would be huge demand.  Who knows?  You tell us.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2009-04-30#1551">
  <rss:title>Web Science and Keynotes at WWW 2009 (#4 of 5)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-04-30T16:00:22Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">(Fourth of five posts related to the WWW 2009 conference, held the week of April 20, 2009.) There was quite a bit of talk about what web science could or ought to be. I will here comment a bit on the panels and keynotes, in no special order. In the web science panel, Tim Berners-Lee said that the deliverable of the web science initiative could be a way of making sense of all the world&#39;s data once the web had transformed into a database capable of answering arbitrary queries. Michael Brodie of Verizon said that one deliverable would be a well considered understanding of the issue of counter-terrorism and civil liberties: Everything, including terrorism, operates on the platform of the web. How do we understand an issue that is not one of privacy, intelligence, jurisprudence, or sociology, but of all these and more? I would add to this that it is not only a matter of governments keeping and analyzing vast amounts of private data, but of basically anybody who wants to do this being able to do so, even if at a smaller scale. In a way, the data web brings formerly government-only capabilities to the public, and is thus a democratization of intelligence and analytics. The citizen blogger increased the accountability of the press; the citizen analyst may have a similar effect. This is trickier though. We remember Jefferson&#39;s words about vigilance and the price of freedom. But vigilance is harder today, not because information is not there but because there is so much of it, with diverse spins put on it. Tim B-L said at another panel that it seemed as if the new capabilities, especially the web as a database, were coming just in time to help us cope with the problems confronting the planet. With this, plus having everybody online, we would have more information, more creativity, more of everything at our disposal. I&#39;d have to say that the web is dual use: The bulk of traffic may contribute to distraction more than to awareness, but then the same infrastructure and the social behaviors it supports may also create unprecedented value and in the best of cases also transparency. I have to think of &quot;For whosoever hath, to him shall be given.&quot; [Matthew 13:12] This can mean many things; here I am talking about whoever hath a drive for knowledge. The web is both equalizing and polarizing: The equality is in the access; the polarity in the use made thereof. For a huge amount of noise there will be some crystallization of value that could not have arisen otherwise. Developments have unexpected effects. I would not have anticipated that gaming should advance supercomputing, for example. Wendy Hall gave a dinner speech about communities and conferences; how the original hypertext conferences, with lots of representation of the humanities, became the techie WWW conference series; and how now we have the pendulum swinging back to more diversity with the web science conferences. So it is with life. Aside from the facts that there are trends and pendulum effects, and that paths that cross usually cross again, it is very hard to say exactly how these things play out. At the &quot;20 years of web&quot; panel, there was a round of questions on how different people had been surprised by the web. Surprises ranged from the web&#39;s actual scalability to its rapid adoption and the culture of &quot;if I do my part, others will do theirs.&quot; On the minus side, the emergence of spam and phishing were mentioned as unexpected developments. Questions of simplicity and complexity got a lot of attention, along with network effects. When things hit the right simplicity at the right place (e.g., HTML and HTTP, which hypertext-wise were nothing special), there is a tipping point. No barrier of entry, not too much modeling, was repeated quite a bit, also in relation to semantic web and ontology design. There is a magic of emergent effects when the pieces are simple enough: Organic chemistry out of a couple of dozen elements; all the world&#39;s information online with a few tags of markup and a couple of protocol verbs. But then this is where the real complexity starts â one half of it in the transport, the other in the applications, yet a narrow interface between the two. This then begs the question of content- and application-aware networks. The preponderance of opinion was for separation of powers â keep carriers and content apart. Michael Brodie commented in the questions to the first panel that simplicity was greatly overrated, that the world was in fact very complex. It seems to me that that any field of human endeavor develops enough complexity to fully occupy the cleverest minds who undertake said activity. The life-cycle between simplicity and complexity seems to be a universal feature. It is a bit like the Zen idea that &quot;for the beginner, rivers are rivers and mountains are mountains, for the student these are imponderable mysteries of bewildering complexity and transcendent dimension but for the master these are again rivers and mountains.&quot; One way of seeing this is that the master, in spite of the actual complexity and interrelatedness of all things, sees where these complexities are significant and where not and knows to communicate concerning these as fits the situation. There is no fixed formula for saying where complexities and simplicities fit, relevance of detail is forever contextual. For technological systems, we find that there emerge relatively simple interfaces on either side of which there is huge complexity: The x86 instruction set, TCP/IP, SQL, to name a few. These are lucky breaks, it is very hard to say beforehand where these will emerge. Object oriented people would like to see such everywhere, which just leads to problems of modeling. There was a keynote from Telefonica about infrastructure. We heard that the power and cooling cost more than the equipment, that data centers ought to be scaled down from the football stadium and 20 megawatt scale, that systems must be designed for partitioning, to name a few topics. This is all well accepted. The new question is whether storage should go into the network infrastructure. We have blogged that the network will be the database, and it is no surprise that a telco should have the same idea, just with slightly different emphasis and wording. For Telefonica, this is about efficiency of bulk delivery, for us this is more about virtualized query-able dataspaces. Both will be distributed but issues of separation of powers may keep the two roles of network with storage separate. In conclusion, the network being the database was much more visible and accepted this year than last. The linked data web was in Tim B-L&#39;s keynote as it was in the opening speech by the Prince of Asturias.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>(Fourth of five posts related to the <a href="http://www2009.org/" id="link-id0x1232b550">WWW 2009</a> conference, held the week of April 20, 2009.)

</p>
<p>There was quite a bit of talk about what web science could or ought to be. I will here comment a bit on the <a href="http://www2009.org/panels.html" id="link-id1514ec30">panels</a> and <a href="http://www2009.org/keynote_abs.html" id="link-id11a5d620">keynotes</a>, in no special order. </p>

<p>In the web science panel, Tim Berners-Lee said that the 
deliverable of the web science initiative could be a way of making sense of all the world&#39;s <a href="http://dbpedia.org/resource/Data" id="link-id0xe01cd68">data</a> once the web had transformed into a database capable of answering arbitrary queries.</p>

<p>Michael Brodie of Verizon said that one deliverable would be a well considered understanding of the issue of counter-terrorism and civil liberties: Everything, including terrorism, operates on the platform of the web. How do we understand an issue that is not one of privacy, intelligence, jurisprudence, or sociology, but of all these and more?</p>

<p>I would add to this that it is not only a matter of governments keeping and analyzing vast amounts of private data, but of basically anybody who wants to do this being able to do so, even if at a smaller scale. In a way, the data web brings formerly government-only capabilities to the public, and is thus a democratization of intelligence and analytics. The citizen blogger increased the accountability of the press; the citizen analyst may have a similar effect. This is trickier though. We remember Jefferson&#39;s words about vigilance and the price of freedom. But vigilance is harder today, not because <a href="http://dbpedia.org/resource/Information" id="link-id0x130558b8">information</a> is not there but because there is so much of it, with diverse spins put on it.</p>

<p>Tim B-L said at another panel that it seemed as if the new capabilities, especially the web as a database, were coming just in time to help us cope with the problems confronting the planet. With this, plus having everybody online, we would have more information, more creativity, more of everything at our disposal.</p>

<p>I&#39;d have to say that the web is dual use: The bulk of traffic may contribute to distraction more than to awareness, but then the same infrastructure and the social behaviors it supports may also create unprecedented value and in the best of cases also transparency. I have to think of &quot;For whosoever hath, to him shall be given.&quot; [Matthew 13:12] This can mean many things; here I am talking about whoever hath a drive for <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x16032470">knowledge</a>.</p>

<p>The web is both equalizing and polarizing: The equality is in the access; the polarity in the use made thereof. For a huge amount of noise there will be some crystallization of value that could not have arisen otherwise. Developments have unexpected effects. I would not have anticipated that gaming should advance supercomputing, for example.</p>

<p>Wendy Hall gave a dinner speech about communities and conferences; how the original hypertext conferences, with lots of representation of the humanities, became the techie WWW conference series; and how now we have the pendulum swinging back to more diversity with the web science conferences. So it is with life. Aside from the facts that there are trends and pendulum effects, and that paths that cross usually cross again, it is very hard to say exactly how these things play out.</p>

<p>At the &quot;20 years of web&quot; panel, there was a round of questions on how different people had been surprised by the web. Surprises ranged from the web&#39;s actual scalability to its rapid adoption and the culture of &quot;if I do my part, others will do theirs.&quot; On the minus side, the emergence of spam and phishing were mentioned as unexpected developments.</p>

<p>Questions of simplicity and complexity got a lot of attention, along with network effects. When things hit the right simplicity at the right place (e.g., HTML and <a href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x1069cc18">HTTP</a>, which hypertext-wise were nothing special), there is a tipping point.</p>

<p>No barrier of entry, not too much modeling, was repeated quite a bit, also in relation to <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x15d2c200">semantic web</a> and ontology design. There is a magic of emergent effects when the pieces are simple enough: Organic chemistry out of a couple of dozen elements; all the world&#39;s information online with a few tags of markup and a couple of protocol verbs. But then this is where the real complexity starts â one half of it in the transport, the other in the applications, yet a narrow interface between the two.</p>

<p>This then begs the question of content- and application-aware networks. The preponderance of opinion was for separation of powers â keep carriers and content apart.</p>

<p>Michael Brodie commented in the questions to the first panel that simplicity was greatly overrated, that the world was in fact very complex. It seems to me that that any field of human endeavor develops enough complexity to fully occupy the cleverest minds who undertake said activity. The life-cycle between simplicity and complexity seems to be a universal feature. It is a bit like the Zen idea that &quot;for the beginner, rivers are rivers and mountains are mountains, for the student these are imponderable mysteries of bewildering complexity and transcendent dimension but for the master these are again rivers and mountains.&quot; One way of seeing this is that the master, in spite of the actual complexity and interrelatedness of all things, sees where these complexities are significant and where not and knows to communicate concerning these as fits the situation.</p>

<p>There is no fixed formula for saying where complexities and simplicities fit, relevance of detail is forever contextual. For technological systems, we find that there emerge relatively simple interfaces on either side of which there is huge complexity: The x86 instruction set, TCP/IP, <a href="http://dbpedia.org/resource/SQL" id="link-id0x10363000">SQL</a>, to name a few. These are lucky breaks, it is very hard to say beforehand where these will emerge. Object oriented people would like to see such everywhere, which just leads to problems of modeling.</p>

<p>There was a keynote from Telefonica about infrastructure. We heard that the power and cooling cost more than the equipment, that data centers ought to be scaled down from the football stadium and 20 megawatt scale, that systems must be designed for partitioning, to name a few topics. This is all well accepted. The new question is whether storage should go into the network infrastructure. We have blogged that the network will be the database, and it is no surprise that a telco should have the same idea, just with slightly different emphasis and wording. For Telefonica, this is about efficiency of bulk delivery, for us this is more about virtualized query-able dataspaces. Both will be distributed but issues of separation of powers may keep the two roles of network with storage separate.</p>

<p>In conclusion, the network being the database was much more visible and accepted this year than last. The <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x100f4cf0">linked data</a> <a href="http://dbpedia.org/resource/Giant_Global_Graph" id="link-id0x15a55db8">web</a> was in Tim B-L&#39;s keynote as it was in the opening speech by the Prince of Asturias.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2009-04-01#1540">
  <rss:title>Web Scale and Fault Tolerance</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-04-01T15:18:06Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">One concern about Virtuoso Cluster is fault tolerance. This post talks about the basics of fault tolerance and what we can do with this, from improving resilience and optimizing performance to accommodating bulk loads without impacting interactive response. We will see that this is yet another step towards a 24/7 web-scale Linked Data Web. We will see how large scale, continuous operation, and redundancy are related. It has been said many times â when things are large enough, failures become frequent. In view of this, basic storage of partitions in multiple copies is built into the Virtuoso cluster from the start. Until now, this feature has not been tested or used very extensively, aside from the trivial case of keeping all schema information in synchronous replicas on all servers. Approaches to Fault Tolerance Fault tolerance has many aspects but it starts with keeping data in at least two copies. There are shared-disk cluster databases like Oracle RAC that do not depend on partitioning. With these, as long as the disk image is intact, servers can come and go. The fault tolerance of the disk in turn comes from mirroring done by the disk controller. Raids other than mirrored disk are not really good for databases because of write speed. With shared-nothing setups like Virtuoso, fault tolerance is based on multiple servers keeping the same logical data. The copies are synchronized transaction-by-transaction but are not bit-for-bit identical nor write-by-write synchronous as is the case with mirrored disks. There are asynchronous replication schemes generally based on log shipping, where the replica replays the transaction log of the master copy. The master copy gets the updates, the replica replays them. Both can take queries. These do not guarantee an entirely ACID fail-over but for many applications they come close enough. In a tightly coupled cluster, it is possible to do synchronous, transactional updates on multiple copies without great added cost. Sending the message to two places instead of one does not make much difference since it is the latency that counts. But once we go to wide area networks, this becomes as good as unworkable for any sort of update volume. Thus, wide area replication must in practice be asynchronous. This is a subject for another discussion. For now, the short answer is that wide area log shipping must be adapted to the application&#39;s requirements for synchronicity and consistency. Also, exactly what content is shipped and to where depends on the application. Some application-specific logic will likely be involved; more than this one cannot say without a specific context. Basics of Partition Fail-Over For now, we will be concerned with redundancy protecting against broken hardware, software slowdown, or crashes inside a single site. The basic idea is simple: Writes go to all copies; reads that must be repeatable or serializable (i.e., locking) go to the first copy; reads that refer to committed state without guarantee of repeatability can be balanced among all copies. When a copy goes offline, nobody needs to know, as long as there is at least one copy online for each partition. The exception in practice is when there are open cursors or such stateful things as aggregations pending on a copy that goes down. Then the query or transaction will abort and the application can retry. This looks like a deadlock to the application. Coming back online is more complicated. This requires establishing that the recovering copy is actually in sync. In practice this requires a short window during which no transactions have uncommitted updates. Sometimes, forcing this can require aborting some transactions, which again looks like a deadlock to the application. When an error is seen, such as a process no longer accepting connections and dropping existing cluster connections, we in practice go via two stages. First, the operations that directly depended on this process are aborted, as well as any computation being done on behalf of the disconnected server. At this stage, attempting to read data from the partition of the failed server will go to another copy but writes will still try to update all copies and will fail if the failed copy continues to be offline. After it is established that the failed copy will stay off for some time, writes may be re-enabled â but now having the failed copy rejoin the cluster will be more complicated, requiring an atomic window to ensure sync, as mentioned earlier. For the DBA, there can be intermittent software crashes where a failed server automatically restarts itself, and there can be prolonged failures where this does not happen. Both are alerts but the first kind can wait. Since a system must essentially run itself, it will wait for some time for the failed server to restart itself. During this window, all reads of the failed partition go to the spare copy and writes give an error. If the spare does not come back up in time, the system will automatically re-enable writes on the spare but now the failed server may no longer rejoin the cluster without a complex sync cycle. This all can happen in well under a minute, faster than a human operator can react. The diagnostics can be done later. If the situation was a hardware failure, recovery consists of taking a spare server and copying the database from the surviving online copy. This done, the spare server can come on line. Copying the database can be done while online and accepting updates but this may take some time, maybe an hour for every 200G of data copied over a network. In principle this could be automated by scripting, but we would normally expect a human DBA to be involved. As a general rule, reacting to the failure goes automatically without disruption of service but bringing the failed copy online will usually require some operator action. Levels of Tolerance and Performance The only way to make failures totally invisible is to have all in duplicate and provisioned so that the system never runs at more than half the total capacity. This is often not economical or necessary. This is why we can do better, using the spare capacity for more than standby. Imagine keeping a repository of linked data. Most of the content will come in through periodic bulk replacement of data sets. Some data will come in through pings from applications publishing FOAF and similar. Some data will come through on-demand RDFization of resources. The performance of such a repository essentially depends on having enough memory. Having this memory in duplicate is just added cost. What we can do instead is have all copies store the whole partition but when routing queries, apply range partitioning on top of the basic hash partitioning. If one partition stores IDs 64K - 128K, the next partition 128K - 192K, and so forth, and all partitions are stored in two full copies, we can route reads to the first 32K IDs to the first copy and reads to the second 32K IDs to the second copy. In this way, the copies will keep different working sets. The RAM is used to full advantage. Of course, if there is a failure, then the working set will degrade, but if this is not often and not for long, this can be quite tolerable. The alternate expense is buying twice as much RAM, likely meaning twice as many servers. This workload is memory intensive, thus servers should have the maximum memory they can have without going to parts that are so expensive one gets a new server for the price of doubling memory. Background Bulk Processing When loading data, the system is online in principle, but query response can be quite bad. A large RDF load will involve most memory and queries will miss the cache. The load will further keep most disks busy, so response is not good. This is the case as soon as a server&#39;s partition of the database is four times the size of RAM or greater. Whether the work is bulk-load or bulk-delete makes little difference. But if partitions are replicated, we can temporarily split the database so that the first copies serve queries and the second copies do the load. If the copies serving on line activities do some updates also, these updates will be committed on both copies. But the load will be committed on the second copy only. This is fully appropriate as long as the data are different. When the bulk load is done, the second copy of each partition will have the full up to date state, including changes that came in during the bulk load. The online activity can be now redirected to the second copies and the first copies can be overwritten in the background by the second copies, so as to again have all data in duplicate. Failures during such operations are not dangerous. If the copies doing the bulk load fail, the bulk load will have to be restarted. If the front end copies fail, the front end load goes to the copies doing the bulk load. Response times will be bad until the bulk load is stopped, but no data is lost. This technique applies to all data intensive background tasks â calculation of entity search ranks, data cleansing, consistency checking, and so on. If two copies are needed to keep up with the online load, then data can be kept just as well in three copies instead of two. This method applies to any data-warehouse-style workload which must coexist with online access and occasional low volume updating. Configurations of Redundancy Right now, we can declare that two or more server processes in a cluster form a group. All data managed by one member of the group is stored by all others. The members of the group are interchangeable. Thus, if there is four-servers-worth of data, then there will be a minimum of eight servers. Each of these servers will have one server process per core. The first hardware failure will not affect operations. For the second failure, there is a 1/7 chance that it stops the whole system, if it falls on the server whose pair is down. If groups consist of three servers, for a total of 12, the two first failures are guaranteed not to interrupt operations; for the third, there is a 1/10 chance that it will. We note that for big databases, as said before, the RAM cache capacity is the sum of all the servers&#39; RAM when in normal operation. There are other, more dynamic ways of splitting data among servers, so that partitions migrate between servers and spawn extra copies of themselves if not enough copies are online. The Google File System (GFS) does something of this sort at the file system level; Amazon&#39;s Dynamo does something similar at the database level. The analogies are not exact, though. If data is partitioned in this manner, for example into 1K slices, each in duplicate, with the rule that the two duplicates will not be on the same physical server, the first failure will not break operations but the second probably will. Without extra logic, there is a probability that the partitions formerly hosted by the failed server have their second copies randomly spread over the remaining servers. This scheme equalizes load better but is less resilient. Maintenance and Continuity Databases may benefit from defragmentation, rebalancing of indices, and so on. While these are possible online, by definition they affect the working set and make response times quite bad as soon as the database is significantly larger than RAM. With duplicate copies, the problem is largely solved. Also, software version changes need not involve downtime. Present Status The basics of replicated partitions are operational. The items to finalize are about system administration procedures and automatic synchronization of recovering copies. This must be automatic because if it is not, the operator will find a way to forget something or do some steps in the wrong order. This also requires a management view that shows what the different processes are doing and whether something is hung or failing repeatedly. All this is for the recovery part; taking failed partitions offline is easy.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>One concern about <a href="http://virtuoso.openlinksw.com" id="link-id0x3b82c38">Virtuoso</a> Cluster is fault tolerance. This post talks about the basics of fault tolerance and what we can do with this, from improving resilience and optimizing performance to accommodating bulk loads without impacting interactive response. We will see that this is yet another step towards a 24/7 web-scale <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x22c42e10">Linked Data</a> <a href="http://dbpedia.org/resource/Giant_Global_Graph" id="link-id0x1e4f0b58">Web</a>. We will see how large scale, continuous operation, and redundancy are related.</p>

<p>It has been said many times â when things are large enough, failures become frequent. In view of this, basic storage of partitions in multiple copies is built into the Virtuoso cluster from the start. Until now, this feature has not been tested or used very extensively, aside from the trivial case of keeping all schema <a href="http://dbpedia.org/resource/Information" id="link-id0x224401c0">information</a> in synchronous replicas on all servers.</p>

<h2>Approaches to Fault Tolerance</h2>

<p>Fault tolerance has many aspects but it starts with keeping <a href="http://dbpedia.org/resource/Data" id="link-id0x230b7500">data</a> in at least two copies. There are shared-disk cluster databases like <a href="http://dbpedia.org/resource/Oracle_Database" id="link-id0xa9a1d8d8">Oracle</a> RAC that do not depend on partitioning. With these, as long as the disk image is intact, servers can come and go. The fault tolerance of the disk in turn comes from mirroring done by the disk controller. Raids other than mirrored disk are not really good for databases because of write speed.</p>

<p>With shared-nothing setups like Virtuoso, fault tolerance is based on multiple servers keeping the same logical data. The copies are synchronized transaction-by-transaction but are not bit-for-bit identical nor write-by-write synchronous as is the case with mirrored disks.</p>

<p>There are asynchronous replication schemes generally based on log shipping, where the replica replays the transaction log of the master copy. The master copy gets the updates, the replica replays them. Both can take queries. These do not guarantee an entirely ACID fail-over but for many applications they come close enough.</p>

<p>In a tightly coupled cluster, it is possible to do synchronous, transactional updates on multiple copies without great added cost. Sending the message to two places instead of one does not make much difference since it is the latency that counts. But once we go to wide area networks, this becomes as good as unworkable for any sort of update volume. Thus, wide area replication must in practice be asynchronous.</p>

<p>This is a subject for another discussion. For now, the short answer is that wide area log shipping must be adapted to the application&#39;s requirements for synchronicity and consistency. Also, exactly what content is shipped and to where depends on the application. Some application-specific logic will likely be involved; more than this one cannot say without a specific context.</p>

<h2>Basics of Partition Fail-Over</h2>

<p>For now, we will be concerned with redundancy protecting against broken hardware, software slowdown, or crashes inside a single site.</p>

<p>The basic idea is simple: Writes go to all copies; reads that must be repeatable or serializable (i.e., locking) go to the first copy; reads that refer to committed state without guarantee of repeatability can be balanced among all copies. When a copy goes offline, nobody needs to know, as long as there is at least one copy online for each partition. The exception in practice is when there are open cursors or such stateful things as aggregations pending on a copy that goes down. Then the query or transaction will abort and the application can retry. This looks like a deadlock to the application.</p>

<p>Coming back online is more complicated. This requires establishing that the recovering copy is actually in sync. In practice this requires a short window during which no transactions have uncommitted updates. Sometimes, forcing this can require aborting some transactions, which again looks like a deadlock to the application.</p>

<p>When an error is seen, such as a process no longer accepting connections and dropping existing cluster connections, we in practice go via two stages. First, the operations that directly depended on this process are aborted, as well as any computation being done on behalf of the disconnected server. At this stage, attempting to read data from the partition of the failed server will go to another copy but writes will still try to update all copies and will fail if the failed copy continues to be offline. After it is established that the failed copy will stay off for some time, writes may be re-enabled â but now having the failed copy rejoin the cluster will be more complicated, requiring an atomic window to ensure sync, as mentioned earlier.</p>

<p>For the DBA, there can be intermittent software crashes where a failed server automatically restarts itself, and there can be prolonged failures where this does not happen. Both are alerts but the first kind can wait. Since a system must essentially run itself, it will wait for some time for the failed server to restart itself. During this window, all reads of the failed partition go to the spare copy and writes give an error. If the spare does not come back up in time, the system will automatically re-enable writes on the spare but now the failed server may no longer rejoin the cluster without a complex sync cycle. This all can happen in well under a minute, faster than a human operator can react. The diagnostics can be done later.</p>

<p>If the situation was a hardware failure, recovery consists of taking a spare server and copying the database from the surviving online copy. This done, the spare server can come on line. Copying the database can be done while online and accepting updates but this may take some time, maybe an hour for every 200G of data copied over a network. In principle this could be automated by scripting, but we would normally expect a human DBA to be involved.</p>

<p>As a general rule, reacting to the failure goes automatically without disruption of service but bringing the failed copy online will usually require some operator action.</p>

<h2>Levels of Tolerance and Performance</h2>

<p>The only way to make failures totally invisible is to have all in duplicate and provisioned so that the system never runs at more than half the total capacity. This is often not economical or necessary. This is why we can do better, using the spare capacity for more than standby.</p>

<p>Imagine keeping a repository of linked data. Most of the content will come in through periodic bulk replacement of data sets. Some data will come in through pings from applications publishing FOAF and similar. Some data will come through on-demand RDFization of resources.</p>

<p>The performance of such a repository essentially depends on having enough memory. Having this memory in duplicate is just added cost. What we can do instead is have all copies store the whole partition but when routing queries, apply range partitioning on top of the basic hash partitioning. If one partition stores IDs 64K - 128K, the next partition 128K - 192K, and so forth, and all partitions are stored in two full copies, we can route reads to the first 32K IDs to the first copy and reads to the second 32K IDs to the second copy. In this way, the copies will keep different working sets. The RAM is used to full advantage.</p>

<p>Of course, if there is a failure, then the working set will degrade, but if this is not often and not for long, this can be quite tolerable. The alternate expense is buying twice as much RAM, likely meaning twice as many servers. This workload is memory intensive, thus servers should have the maximum memory they can have without going to parts that are so expensive one gets a new server for the price of doubling memory.</p>

<h2>Background Bulk Processing</h2>

<p>When loading data, the system is online in principle, but query response can be quite bad. A large <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x3c0cfb8">RDF</a> load will involve most memory and queries will miss the cache. The load will further keep most disks busy, so response is not good. This is the case as soon as a server&#39;s partition of the database is four times the size of RAM or greater. Whether the work is bulk-load or bulk-delete makes little difference.</p>

<p>But if partitions are replicated, we can temporarily split the database so that the first copies serve queries and the second copies do the load. If the copies serving on line activities do some updates also, these updates will be committed on both copies. But the load will be committed on the second copy only. This is fully appropriate as long as the data are different. When the bulk load is done, the second copy of each partition will have the full up to date state, including changes that came in during the bulk load. The online activity can be now redirected to the second copies and the first copies can be overwritten in the background by the second copies, so as to again have all data in duplicate.</p>

<p>Failures during such operations are not dangerous. If the copies doing the bulk load fail, the bulk load will have to be restarted. If the front end copies fail, the front end load goes to the copies doing the bulk load. Response times will be bad until the bulk load is stopped, but no data is lost.</p>

<p>This technique applies to all data intensive background tasks â calculation of <a href="http://dbpedia.org/resource/Entity" id="link-id0x3b38ac0">entity</a> search ranks, data cleansing, consistency checking, and so on. If two copies are needed to keep up with the online load, then data can be kept just as well in three copies instead of two. This method applies to any data-warehouse-style workload which must coexist with online access and occasional low volume updating.</p>

<h2>Configurations of Redundancy</h2>

<p>Right now, we can declare that two or more server processes in a cluster form a group. All data managed by one member of the group is stored by all others. The members of the group are interchangeable. Thus, if there is four-servers-worth of data, then there will be a minimum of eight servers. Each of these servers will have one server process per core. The first hardware failure will not affect operations. For the second failure, there is a 1/7 chance that it stops the whole system, if it falls on the server whose pair is down. If groups consist of three servers, for a total of 12, the two first failures are guaranteed not to interrupt operations; for the third, there is a 1/10 chance that it will.</p>

<p>We note that for big databases, as said before, the RAM cache capacity is the sum of all the servers&#39; RAM when in normal operation.</p>

<p>There are other, more dynamic ways of splitting data among servers, so that partitions migrate between servers and spawn extra copies of themselves if not enough copies are online. The Google File System (GFS) does something of this sort at the file system level; Amazon&#39;s Dynamo does something similar at the database level. The analogies are not exact, though.</p>

<p>If data is partitioned in this manner, for example into 1K slices, each in duplicate, with the rule that the two duplicates will not be on the same physical server, the first failure will not break operations but the second probably will. Without extra logic, there is a probability that the partitions formerly hosted by the failed server have their second copies randomly spread over the remaining servers. This scheme equalizes load better but is less resilient.</p>

<h2>Maintenance and Continuity</h2>

<p>Databases may benefit from defragmentation, rebalancing of indices, and so on. While these are possible online, by definition they affect the working set and make response times quite bad as soon as the database is significantly larger than RAM. With duplicate copies, the problem is largely solved. Also, software version changes need not involve downtime.</p>

<h2>Present Status</h2>

<p>The basics of replicated partitions are operational. The items to finalize are about system administration procedures and automatic synchronization of recovering copies. This must be automatic because if it is not, the operator will find a way to forget something or do some steps in the wrong order. This also requires a management view that shows what the different processes are doing and whether something is hung or failing repeatedly. All this is for the recovery part; taking failed partitions offline is easy.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2009-03-25#1537">
  <rss:title>Beyond Applications - Introducing the Planetary Datasphere (Part 2)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-03-25T15:50:56Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We have looked at the general implications of the DataSphere, a universal, ubiquitous database infrastructure, on end-user experience and application development and content. Now we will look at what this means at the back end, from hosting to security to server software and hardware. Application Hosting For the infrastructure provider, hosting the DataSphere is no different from hosting large Web 2.0 sites. This may be paid for by users, as in the cloud computing model where users rent capacity for their own purposes, or by advertisers, as in most of Web 2.0. Clouds play a role in this as places with high local connectivity. The DataSphere is the atmosphere; the Cloud is an atmospheric phenomenon. What of Proprietary Data and its Security? Having proprietary data does not imply using a proprietary language. I would say that for any domain of discourse, no matter how private or specialized, at least some structural concepts can be borrowed from public, more generic sources. This lowers training thresholds and facilitates integration. Being able to integrate does not imply opening one&#39;s own data. To take an analogy, if you have a bunker with closed circuit air recycling, you still breathe air, even if that air is cut off from the atmosphere at large. For places with complex existing RDBMS security, the best is to map the RDBMS to RDF on the fly, always running all requests through the RDBMS. This implicitly preserves any policy or label based security schemes. What of Individual Privacy on the Open Web? The more complex situations will be found in environments with mixed security needs, as in social networking with partly-open and partly-closed profiles. The FOAF+SSL solution with https:// URIs is one approach. For query processing, we have a question of enforcing instance-level policies. In the DataSphere, granting privileges on tables and views no longer makes sense. In SQL, a policy means that behind the scenes the DBMS will add extra criteria to queries and updates depending on who is issuing them. The query processor adds conditions like getting the user&#39;s department ID and comparing it to the department ID on the payroll record. Labeled security is a scheme where data rows themselves contain security tags and the DBMS enforces these, row by row. I would say that these techniques are suited for highly-structured situations where the roles, compartments, and needs are clear, and where the organization has the database know-how to write, test, and deploy such rules by the table, row, and column. This does not sit well with schema-last. I would not bet much on an average developer&#39;s capacity for making airtight policies on RDF data where not even 100% schema-adherence is guaranteed. Doing security at the RDF graph level seems more appropriate. In many use cases, the graph is analogous to a photo album or a file system directory. A Data Space can be divided into graphs to provide more granularity for expressing topic, provenance, or security. If policy conditions apply mostly to the graph, then things are not as likely to slip by, for example, policy rules missing some infrequent misuse of the schema. In these cases, the burden on the query processor is also not excessive: Just as with documents, the container (table, graph) is the object of access grants, not the individual sentences (DBMS records, RDF triples) in the document. It is left to the application to present a choice of graph level policies to the user. Exactly what these will be depends on the domain of discourse. A policy might restrict access to a meeting in a calendar to people whose OpenIDs figure in the attendee list, or limit access to a photo album to people mentioned in the owner&#39;s social network. Defining such policies is typically a task for the application developer. The difference between the Document Web and the Linked Data Web is that while the Document Web enforces security when a thing is returned to the user, Linked Data Web enforcement must occur whenever a query references something, even if this is an intermediate result not directly shown to the user. The DataSphere will offer a generic policy scheme, filtering what graphs are accessed in a given query situation. Other applications may then verify the safety of one&#39;s disclosed information using the same DataSphere infrastructure. Of course, the user must rely on the infrastructure provider to correctly enforce these rules. Then again, some users will operate and audit their own infrastructure anyway. Federation vs. Centralization On the open web, there is the question of federation vs. centralization. If an application is seen to be an interface to a vocabulary, it becomes more agnostic with respect to this. In practice, if we are talking about hosted services, what is hosted together joins much faster. Data Spaces with lots of interlinking, such as closely connected social networks, will tend to cluster together on the same cloud to facilitate joint operation. Data is ubiquitous and not location-conscious, but what one can efficiently do with it depends on location. Joint access patterns favor joint location. Due to technicalities of the matter, single database clusters will run complex queries within the cluster 100 to 1000 times faster than between clusters. The size of such data clouds may be in the hundreds-of-billions of triples. It seems to make sense to have data belonging to same-type or jointly-used applications close together. In practice, there will arise partitioning by type of usage, user profile, etc., but this is no longer airtight and applications more-or-less float on top of all of this. A search engine can host a copy of the Document Web and allow text lookups on it. But a text lookup is a single well-defined query that happens to parallelize and partition very well. A search engine can also have all the structured public data copied, but the problem there is that queries are a lot less predictable and may take orders of magnitude more resources than a single text lookup. As a partial answer, even now, we can set up a database so that the first million single-row joins cost the user nothing, but doing more requires a special subscription. The cost for hosting a trillion triples will vary radically in function of what throughput is promised. This may result in pricing per service level, a bit like ISP pricing varies in function of promised connectivity. Queries can be run for free if no throughput guarantee applies, and might cost more if the host promises at least five-million joins-per-second including infrequently-accessed data. Performance and cost dynamics will probably lead to the emergence of domain-specific clusters of colocated Data Spaces. The landscape will be hybrid, where usage drives data colocation. A single Google is not a practical solution to the world&#39;s spectrum of query needs. What is the Cost of Schema-Last? The DataSphere proposition is predicated on a worldwide database fabric that can store anything, just like a network can transport anything. It cannot enforce a fixed schema, just like TCP/IP cannot say that it will transport only email. This is continuous schema evolution. Well, TCP/IP can transport anything but it does transport a lot of HTML and email. Similarly, the DataSphere can optimize for some common vocabularies. We have seen that an application-specific relational schema is often 10 times more efficient than an equivalent completely generic RDF representation of the same thing. The gap may narrow, but task specific representations will keep an edge. We ought to know, as we do both. While anything can be represented, the masses are not that creative. For any data-hosting provider, making a specialized representation for the top 100 entities may cut data size in half or better. This is a behind-the-scenes optimization that will in time be a matter of course. Historically, our industry has been driven by two phenomena: New PCs every 2 years. To make this necessary, Windows has been getting bigger and bigger, and not upgrading is not an option if one must exchange documents with new data formats and keep up with security. Agility, or ad hoc over planned. The reason the RDBMS won over CODASYL network databases was that one did not have to define what queries could be made when creating the database. With the Linked Data Web, we have one more step in this direction when we say that one does not have to decide what can be represented when creating the database. To summarize, there is some cost to schema-last, but then our industry needs more complexity to keep justifying constant investment. The cost is in this sense not all bad. Building the DataSphere may be the next great driver of server demand. As a case in point, Cisco, whose fortune was made when the network became ubiquitous, just entered the server game. It&#39;s in the air. DataSphere Precursors Right now, we have the Linked Open Data movement with lots of new data being added. We have the drive for data- and reputation-portability. We have Freebase as a demonstrator of end-users actually producing structured data. We have convergence of terminology around DBpedia, FOAF, SIOC, and more. We have demonstrators of useful data integration on the RDF stack in diverse fields, especially life sciences. We have a totally ubiquitous network for the distribution of this, plus database technology to make this work. We have a practical need for semantics, as search is getting saturated, email is getting killed by spam, and information overload is a constant. Social networks can be leveraged for solving a lot of this, if they can only be opened. Of course, there is a call for transparency in society at large. Well, the battle of transparency vs. spin is a permanent feature of human existence but even there, we cannot ignore the possibilities of open data. Databases and Servers Technically, what does this take? Mostly, this takes a lot of memory. The software is there and we are productizing it as we speak. As with other data intensive things, the key is scalable querying over clusters of commodity servers. Nothing we have not heard before. Of course, the DBMS must know about RDF specifics to get the right query plans and so on but this we have explained elsewhere. This all comes down to the cost of memory. No amount of CPU or network speed will make any difference if data is not in memory. Right now, a board with 8G and a dual core AMD X86-64 and 4 disks may cost about $700. 2 x 4 core Xeon and 16G and 8 disks may be $4000, counting just the components. In our experience, about 32G per billion triples is a minimum. This must be backed by a few independent disks so as to fill the cache in parallel. A cluster with 1 TB of RAM would be under $100K if built from low end boards. The workload is all about large joins across partitions. The queries parallelize well, thus using the largest and most expensive machines for building blocks is not cost efficient. Having absolutely everything in RAM is also not cost efficient, but it is necessary to have many disks to absorb the random access load. Disk access is predominantly random, unlike some analytics workloads that can read serially. If SSD&#39;s get a bit cheaper, one could have SSD for the database and disk for backup. With large data centers, redundancy becomes an issue. The most cost effective redundancy is simply storing partitions in duplicate or triplicate on different commodity servers. The DBMS software should handle the replication and fail-over. For operating such systems, scaling-on-demand is necessary. Data must move between servers, and adding or replacing servers should be an on-the-fly operation. Also, since access is essentially never uniform, the most commonly accessed partitions may benefit from being kept in more copies than less frequently accessed ones. The DBMS must be essentially self administrating since these things are quite complex and easily intractable if one does not have in depth understanding of this rather complex field. The best price point for hardware varies with time. Right now, the optimum is to have many basic motherboards with maximum memory in a rack unit, then another unit with local disks for each motherboard. Much cheaper than SAN&#39;s and Infiniband fabrics. Conclusions and Next Steps The ingredients and use cases are there. If server clusters with 1TB RAM begin under $100K, the cost of deployment is small compared to personnel costs. Bootstrapping the DataSphere from current Linked Open Data, such as DBpedia, OpenCYC, Freebase, and every sort of social network, is feasible. Aside from private data integration and analytics efforts and E-science, the use cases are liberating social networks and C2C and some aspects of search from silos, overcoming spam, and mass use of semantics extracted from text. Emergent effects will then carry the ball to places we have not yet been. The Linked Data Web has its origins in Semantic Web research, and many of the present participants come from these circles. Things may have been slowed down by a disconnect, only too typical of human activity, between Semantic Web research on one hand and database engineering on the other. Right now, the challenge is one of engineering. As documented on this blog, we have worked quite a bit on cluster databases, mostly but not exclusively with RDF use cases. The actual challenges of this are however not at all what is discussed in Semantic Web conferences. These have to do with complexities of parallelism, timing, message bottlenecks, transactions, and the like, i.e., hardcore engineering. These are difficult beyond what the casual onlooker might guess but not impossible. The details that remain to be worked out are nothing semantic, they are hardcore database, concerning automatic provisioning and such matters. It is as if the Semantic Web people look with envy at the Web 2.0 side where there are big deployments in production, yet they do not seem quite ready to take the step themselves. Well, I will write some other time about research and engineering. For now, the message is &amp;mdash go for it. Stay tuned for more announcements, as we near production with our next generation of software. Related Beyond Applications - Introducing the Planetary Datasphere (Part 1) Serendipitous Discovery Quotient (SDQ) How Linked Data will change Advertising The Time for RDBMS Primacy Downgrade is Nigh! Data Spaces</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>
<a href="http://www.openlinksw.com/weblog/oerling/?id=1535" id="link-id155e3bd0">We have looked at the general implications of the DataSphere</a>, a universal, ubiquitous database infrastructure, on end-user experience and application development and content. Now we will look at what this means at the back end, from hosting to security to server software and hardware.</p>

<h2>Application Hosting</h2>

<p>For the infrastructure provider, hosting the DataSphere is no different from hosting large Web 2.0 sites. This may be paid for by users, as in the cloud computing model where users rent capacity for their own purposes, or by advertisers, as in most of Web 2.0.</p>

<p>Clouds play a role in this as places with high local connectivity. The DataSphere is the atmosphere; the Cloud is an atmospheric phenomenon.</p>

<h2>What of Proprietary <a href="http://dbpedia.org/resource/Data" id="link-id0x13b5b4a0">Data</a> and its Security?</h2>

<p>Having proprietary data does not imply using a proprietary language. I would say that for any domain of discourse, no matter how private or specialized, at least some structural concepts can be borrowed from public, more generic sources. This lowers training thresholds and facilitates integration. Being able to integrate does not imply opening one&#39;s own data. To take an analogy, if you have a bunker with closed circuit air recycling, you still breathe air, even if that air is cut off from the atmosphere at large. For places with complex existing <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x24db80e0">RDBMS</a> security, the best is to map the RDBMS to <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x24ea7c40">RDF</a> on the fly, always running all requests through the RDBMS. This implicitly preserves any policy or label based security schemes.</p>

<h2>What of Individual Privacy on the Open Web?</h2>

<p>The more complex situations will be found in environments with mixed security needs, as in social networking with partly-open and partly-closed profiles. The FOAF+SSL solution with <code>https://</code> URIs is one approach. For query processing, we have a question of enforcing instance-level policies. In the DataSphere, granting privileges on tables and views no longer makes sense. In <a href="http://dbpedia.org/resource/SQL" id="link-id0x24aaccc0">SQL</a>, a policy means that behind the scenes the DBMS will add extra criteria to queries and updates depending on who is issuing them. The query processor adds conditions like getting the user&#39;s department ID and comparing it to the department ID on the payroll record. Labeled security is a scheme where data rows themselves contain security tags and the DBMS enforces these, row by row.</p>

<p>I would say that these techniques are suited for highly-structured situations where the roles, compartments, and needs are clear, and where the organization has the database know-how to write, test, and deploy such rules by the table, row, and column. This does not sit well with schema-last. I would not bet much on an average developer&#39;s capacity for making airtight policies on RDF data where not even 100% schema-adherence is guaranteed.</p>

<p>Doing security at the RDF graph level seems more appropriate. In many use cases, the graph is analogous to a photo album or a file system directory. A Data <a href="http://en.wikipedia.org/wiki/Data_Spaces" id="link-id0x2396c058">Space</a> can be divided into graphs to provide more granularity for expressing topic, provenance, or security. If policy conditions apply mostly to the graph, then things are not as likely to slip by, for example, policy rules missing some infrequent misuse of the schema. In these cases, the burden on the query processor is also not excessive: Just as with documents, the container (table, graph) is the object of access grants, not the individual sentences (DBMS records, RDF triples) in the document.</p>

<p>It is left to the application to present a choice of graph level policies to the user. Exactly what these will be depends on the domain of discourse. A policy might restrict access to a meeting in a calendar to people whose OpenIDs figure in the attendee list, or limit access to a photo album to people mentioned in the owner&#39;s social network. Defining such policies is typically a task for the application developer.</p>

<p>The difference between the Document Web and the <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x238a0098">Linked Data</a> <a href="http://dbpedia.org/resource/Giant_Global_Graph" id="link-id0x23882280">Web</a> is that while the Document Web enforces security when a thing is returned to the user, Linked Data Web enforcement must occur whenever a query references something, even if this is an intermediate result not directly shown to the user.</p>

<p>The DataSphere will offer a generic policy scheme, filtering what graphs are accessed in a given query situation. Other applications may then verify the safety of one&#39;s disclosed <a href="http://dbpedia.org/resource/Information" id="link-id0x2388e458">information</a> using the same DataSphere infrastructure. Of course, the user must rely on the infrastructure provider to correctly enforce these rules. Then again, some users will operate and audit their own infrastructure anyway.</p>

<h2>Federation vs. Centralization</h2>

<p>On the open web, there is the question of federation vs. centralization. If an application is seen to be an interface to a vocabulary, it becomes more agnostic with respect to this. In practice, if we are talking about hosted services, what is hosted together joins much faster. Data Spaces with lots of interlinking, such as closely connected social networks, will tend to cluster together on the same cloud to facilitate joint operation. Data is ubiquitous and not location-conscious, but what one can efficiently do with it depends on location. Joint access patterns favor joint location. Due to technicalities of the matter, single database clusters will run complex queries within the cluster 100 to 1000 times faster than between clusters. The size of such data clouds may be in the hundreds-of-billions of triples. It seems to make sense to have data belonging to same-type or jointly-used applications close together. In practice, there will arise partitioning by type of usage, user profile, etc., but this is no longer airtight and applications more-or-less float on top of all of this.</p>

<p>A search engine can host a copy of the Document Web and allow text lookups on it. But a text lookup is a single well-defined query that happens to parallelize and partition very well. A search engine can also have all the structured public data copied, but the problem there is that queries are a lot less predictable and may take orders of magnitude more resources than a single text lookup. As a partial answer, even now, we can set up a database so that the first million single-row joins cost the user nothing, but doing more requires a special subscription.</p>

<p>The cost for hosting a trillion triples will vary radically in function of what throughput is promised. This may result in pricing per service level, a bit like ISP pricing varies in function of promised connectivity. Queries can be run for free if no throughput guarantee applies, and might cost more if the host promises at least five-million joins-per-second including infrequently-accessed data.</p>

<p>Performance and cost dynamics will probably lead to the emergence of domain-specific clusters of colocated Data Spaces. The landscape will be hybrid, where usage drives data colocation. A single Google is not a practical solution to the world&#39;s spectrum of query needs.</p>

<h2>What is the Cost of Schema-Last?</h2>

<p>The DataSphere proposition is predicated on a worldwide database fabric that can store anything, just like a network can transport anything. It cannot enforce a fixed schema, just like TCP/IP cannot say that it will transport only email. This is continuous schema evolution. Well, TCP/IP can transport anything but it does transport a lot of HTML and email. Similarly, the DataSphere can optimize for some common vocabularies.</p>

<p>We have seen that an application-specific relational schema is often 10 times more efficient than an equivalent completely generic RDF representation of the same thing. The gap may narrow, but task specific representations will keep an edge. We ought to know, as we do both.</p>

<p>While anything can be represented, the masses are not that creative. For any data-hosting provider, making a specialized representation for the top 100 entities may cut data size in half or better. This is a behind-the-scenes optimization that will in time be a matter of course.</p>

<p>Historically, our industry has been driven by two phenomena:</p>

<ol>
<li>
  <b>New PCs every 2 years.</b> To make this necessary, Windows has been getting bigger and bigger, and not upgrading is not an option if one must exchange documents with new data formats and keep up with security.</li>

<li>
  <b>Agility, or <i>ad hoc</i> over planned.</b> The reason the RDBMS won over <a href="http://dbpedia.org/resource/CODASYL" id="link-id0x13b23460">CODASYL</a> network databases was that one did not have to define what queries could be made when creating the database. With the Linked Data Web, we have one more step in this direction when we say that one does not have to decide what can be represented when creating the database.</li>
</ol>

<p>To summarize, there is some cost to schema-last, but then our industry needs more complexity to keep justifying constant investment. The cost is in this sense not all bad.</p>

<p>Building the DataSphere may be the next great driver of server demand. As a case in point, Cisco, whose fortune was made when the network became ubiquitous, just entered the server game. It&#39;s in the air.</p>

<h2>DataSphere Precursors</h2>

<p>Right now, we have the <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x236a9be8">Linked Open Data</a> movement with lots of new data being added. We have the drive for data- and reputation-portability. We have Freebase as a demonstrator of end-users actually producing structured data. We have convergence of terminology around <a href="http://dbpedia.org/resource/DBpedia" id="link-id0x24db8350">DBpedia</a>, FOAF, SIOC, and more. We have demonstrators of useful data integration on the RDF stack in diverse fields, especially life sciences.</p>

<p>We have a totally ubiquitous network for the distribution of this, plus database technology to make this work.</p>

<p>We have a practical need for semantics, as search is getting saturated, email is getting killed by spam, and information overload is a constant. Social networks can be leveraged for solving a lot of this, if they can only be opened.</p>

<p>Of course, there is a call for transparency in society at large. Well, the battle of transparency vs. spin is a permanent feature of human existence but even there, we cannot ignore the possibilities of open data.</p>

<h2>Databases and Servers</h2>

<p>Technically, what does this take? Mostly, this takes a lot of memory. The software is there and we are productizing it as we speak. As with other data intensive things, the key is scalable querying over clusters of commodity servers. Nothing we have not heard before. Of course, the DBMS must know about RDF specifics to get the right query plans and so on but this we have explained elsewhere.</p>

<p>This all comes down to the cost of memory. No amount of CPU or network speed will make any difference if data is not in memory. Right now, a board with 8G and a dual core AMD X86-64 and 4 disks may cost about $700. 2 x 4 core Xeon and 16G and 8 disks may be $4000, counting just the components. In our experience, about 32G per billion triples is a minimum. This must be backed by a few independent disks so as to fill the cache in parallel. A cluster with 1 TB of RAM would be under $100K if built from low end boards.</p>

<p>The workload is all about large joins across partitions. The queries parallelize well, thus using the largest and most expensive machines for building blocks is not cost efficient. Having absolutely everything in RAM is also not cost efficient, but it is necessary to have many disks to absorb the random access load. Disk access is predominantly random, unlike some analytics workloads that can read serially. If SSD&#39;s get a bit cheaper, one could have SSD for the database and disk for backup.</p>

<p>With large data centers, redundancy becomes an issue. The most cost effective redundancy is simply storing partitions in duplicate or triplicate on different commodity servers. The DBMS software should handle the replication and fail-over.</p>

<p>For operating such systems, scaling-on-demand is necessary. Data must move between servers, and adding or replacing servers should be an on-the-fly operation. Also, since access is essentially never uniform, the most commonly accessed partitions may benefit from being kept in more copies than less frequently accessed ones. The DBMS must be essentially self administrating since these things are quite complex and easily intractable if one does not have in depth understanding of this rather complex field.</p>

<p>The best price point for hardware varies with time. Right now, the optimum is to have many basic motherboards with maximum memory in a rack unit, then another unit with local disks for each motherboard. Much cheaper than SAN&#39;s and Infiniband fabrics.</p>

<h2>Conclusions and Next Steps</h2>

<p>The ingredients and use cases are there. If server clusters with 1TB RAM begin under $100K, the cost of deployment is small compared to personnel costs.</p>

<p>Bootstrapping the DataSphere from current Linked Open Data, such as DBpedia, <a href="http://dbpedia.org/resource/Cyc" id="link-id0x2396a038">OpenCYC</a>, Freebase, and every sort of social network, is feasible. Aside from private data integration and analytics efforts and E-science, the use cases are liberating social networks and C2C and some aspects of search from silos, overcoming spam, and mass use of semantics extracted from text. Emergent effects will then carry the ball to places we have not yet been.</p>

<p>The Linked Data Web has its origins in <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x13ea7110">Semantic Web</a> research, and many of the present participants come from these circles. Things may have been slowed down by a disconnect, only too typical of human activity, between Semantic Web research on one hand and database engineering on the other. Right now, the challenge is one of engineering. As documented on this <a href="http://dbpedia.org/resource/Blog" id="link-id0x2388e368">blog</a>, we have worked quite a bit on cluster databases, mostly but not exclusively with RDF use cases. The actual challenges of this are however not at all what is discussed in Semantic Web conferences. These have to do with complexities of parallelism, timing, message bottlenecks, transactions, and the like, i.e., hardcore engineering. These are difficult beyond what the casual onlooker might guess but not impossible. The details that remain to be worked out are nothing semantic, they are hardcore database, concerning automatic provisioning and such matters.</p>

<p>It is as if the Semantic Web people look with envy at the Web 2.0 side where there are big deployments in production, yet they do not seem quite ready to take the step themselves. Well, I will write some other time about research and engineering. For now, the message is &amp;mdash <i><b>go for it</b></i>. Stay tuned for more announcements, as we near production with our next generation of software.</p>


<h2>Related</h2>
<ul>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1535" id="link-id14e02bb0">Beyond Applications - Introducing the Planetary Datasphere (Part 1)</a>
</li>
<li>
  <a href="http://www.openlinksw.com/blog/~kidehen/?id=1442" id="link-id117dc518">Serendipitous Discovery Quotient (SDQ)</a>
</li>
<li>
  <a href="http://www.openlinksw.com/blog/~kidehen/?id=1534" id="link-id15c52410">How Linked Data will change Advertising</a>
</li>
<li>
  <a href="http://www.openlinksw.com/blog/~kidehen/?id=1519" id="link-id11e93658">The Time for RDBMS Primacy Downgrade is Nigh!</a>
</li>
<li>
  <a href="http://www.openlinksw.com/blog/~kidehen/?tag=DataSpace" id="link-id1491a588">Data Spaces</a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2009-03-24#1535">
  <rss:title>Beyond Applications - Introducing the Planetary Datasphere (Part 1)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-03-24T14:38:57Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">This is the first in a short series of blog posts about what becomes possible when essentially unlimited linked data can be deployed on the open web and private intranets. The term DataSphere comes from Dan Simmons&#39; Hyperion science fiction series, where it is a sort of pervasive computing capability that plays host to all sorts of processes, including what people do on the net today, and then some. I use this term here in order to emphasize the blurring of silo and application boundaries. The network is not only the computer but also the database. I will look at what effects the birth of a sort of linked data stratum can have on end-user experience, application development, application deployment and hosting, business models and advertising, and security; how cloud computing fits in; and how back-end software such as databases must evolve to support all of these. This is a mid-term vision. The components are coming into production as we speak, but the end result is not here quite yet. I use the word DataSphere to refer to a worldwide database fabric, a global Distributed DBMS collective, within which there are many Data Spaces, or Named Data Spaces. A Data Space is essentially a person&#39;s or organization&#39;s contribution to the DataSphere. I use Linked Data Web to refer to component technologies and practices such as RDF, SPARQL, Linked Data practices, etc. The DataSphere does not have to be built on this technology stack per se, but this stack is still the best bet for it. General There exist applications for performing specialized functions such as social networking, shopping, document search, and C2C commerce at planetary scale. All these applications run on their own databases, each with a task specific schema. They communicate by web pages and by predefined messages for diverse application-specific transactions and reports. These silos are scalable because in general their data has some natural partitioning, and because the set of transactions is predetermined and the data structure is set up for this. The Linked Data Web proposes to create a data infrastructure that can hold anything, just like a network can transport anything. This is not a network with a memory of messages, but a whole that can answer arbitrary questions about what has been said. The prerequisite is that the questions are phrased in a vocabulary that is compatible with the vocabulary in which the statements themselves were made. In this setting, the vocabulary takes the place of the application. Of course, there continues to be a procedural element to applications; this has the function of translating statements between the domain vocabulary and a user interface. Examples are data import from existing applications, running predefined reports, composing new reports, and translating between natural language and the domain vocabulary. The big difference is that the database moves outside of the silo, at least in logical terms. The database will be like the network â horizontal and ubiquitous. The equivalent of TCP/IP will be the RDF/SPARQL combination. The equivalent of routing protocols between ISPs will be gateways between the specific DBMS engines supporting the services. The place of the DBMS in the stack changes The RDBMS in itself is eternal, or at least as eternal as a culture with heavy reliance on written records is. Any such culture will invent the RDBMS and use it where it best fits. We are not replacing this; we are building an abstracted worldwide data layer. This is to the RDBMS supporting line-of-business applications what the www was to enterprise content management systems. For transactions, the Web 2.0-style application-specific messages are fine. Also, any transactional system that must be audited must physically reside somewhere, have physical security, etc. It can&#39;t just be somewhere in the DataSphere, managed by some system with which one has no contract, just like Google&#39;s web page cache can&#39;t be relied on as a permanent repository of web content. Providing space on the Linked Data Web is like providing hosting on the Document Web. This may have varying service levels, pricing models, etc. The value of a queriable DataSphere is that a new application does not have to begin by building its own schema, database infrastructure, service hosting, etc. The application becomes more like a language meme, a cultural form of interaction mediated by a relatively lightweight user-facing component, laterally open for unforeseen interaction with other applications from other domains of discourse. End User Benefits For the end user, the web will still look like a place where one can shop, discuss, date, whatever. These activities will be mediated by user interfaces as they are now. Right now, the end user&#39;s web presence is his/her blog or web site, and their contributions to diverse wikis, social web sites, and so forth. These are scattered. The user&#39;s Data Space is the collection of all these things, now presented in a queriable form. The user&#39;s Data Space is the user&#39;s statement of presence, referencing the diverse contributions of the user on diverse sites. The personal Data Space being a queriable, structured whole facilitates finding and being found, which is what brings individuals to the web in the first place. The best applications and sites are those which make this the easiest. The Linked Data Web allows saying what one wishes in a structured, queriable manner, across all application domains, independently of domain specific silos. The end user&#39;s interaction with the personal data space is through applications, like now. But these applications are just wrappers on top of self describing data, represented in domain specific vocabularies; one vocabulary is used for social networking, another for C2C commerce, and so on. The user is the master of their personal Data Space, free to take it where he or she wishes. Further benefits will include more ready referencing between these spaces, more uniform identity management, cross-application operations, and the emergence of &quot;meta-applications,&quot; i.e., unified interfaces for managing many related applications/tasks. Of course, there is the increase in semantic richness, such as better contextuality derived from entity extraction from text. But this is also possible in a silo. The Linked Data Web angle is the sharing of identifiers for real world entities, which makes extracts of different sources by different parties potentially joinable. The user interaction will hardly ever be with the raw data. But the raw data being still at hand makes for better targeting of advertisements, better offering of related services, easier discovery of related content, and less noise overall. Kingsley Idehen has coined the term SDQ, for Serendipitous Discovery Quotient, to denote this. When applications expose explicit semantics, constructing a user experience that combines relevant data from many sources, including applications as well as highly targeted advertising, becomes natural. It is no longer a matter of &quot;mashing up&quot; web service interfaces with procedural code, but of &quot;meshing&quot; data through declarative queries across application spaces. Applications in the DataSphere The workflows supported by the DataSphere are essentially those taking place on the web now. The DataSphere dimension is expressed by bookmarklets, browser plugins, and the like, with ready access to related data and actions that are relevant for this data. Actions triggered by data can be anything from posting a comment to making an e-commerce purchase. Web 2.0 models fit right in. Web application development now consists of designing an application-specific database schema and writing web pages to interact with this schema. In the DataSphere, the database is abstracted away, as is a large part of the schema. The application floats on a sea of data instead of being tied to its own specific store and schema. Some local transaction data should still be handled in the old way, though. For the application developer, the question becomes one of vocabulary choice. How will the application synthesize URIs from the user interaction? Which URIs will be used, since pretty much anything will in practice have many names (e.g., DBpedia Vs. Freebase identifiers). The end user will generally have no idea of this choice, nor of the various degrees of normalization, etc., in the vocabularies. Still, usage of such applications will produce data using some identifiers and vocabularies. Benefits of ready joining without translation will drive adoption. A vocabulary with instance data will get more instance data. The Linked Data Web infrastructure itself must support vocabulary and identifier choice by answering questions about who uses a particular identifier and where. Even now, we offer entity ranks and resolution of synonyms, queries on what graphs mention a certain identifier and so on. This is a means of finding the most commonly used term for each situation. Convergence of terminology cuts down on translation and makes for easier and more efficient querying. Advertising The application developer is, for purposes of advertising, in the position of the inventory owner, just like a traditional publisher, whether web or other. But with smarter data, it is not a matter of static keywords but of the semantically explicit data behind each individual user impression driving the ads. Data itself carries no ads but the user impression will still go through a display layer that can show ads. If the application relies on reuse of licensed content, such as media, then the content provider may get a cut of the ad revenue even if it is not the direct owner of the inventory. The specifics of implementing and enforcing this are to be worked out. Content Providers, License, and Attribution For the content provider, the URI is the brand carrier. If the data is well linked and queriable, this will drive usage and traffic to the services of the content provider. This is true of any provider, whether a media publisher, e-commerce business, government agency, or anything else. Intellectual property considerations will make the URI a first class citizen. Just like the URI is a part of the document web experience, it is a part of the Linked Data Web experience. Just like Creative Commons licenses allow the licensor to define what type of attribution is required, a data publisher can mandate that a user experience mediated by whatever application should expose the source as a dereferenceable URI. One element of data dereferencing must be linking to applications that facilitate human interaction with the data. A generic data browser is a developer tool; the end user experience must still be mediated by interfaces tailored to the domain. This layer can take care of making the brand visible and can show advertising or be monetized on a usage basis. Next we will look at the service provider and infrastructure side of this. Related Serendipitous Discovery Quotient (SDQ) How Linked Data will change Advertising The Time for RDBMS Primacy Downgrade is Nigh! Data Spaces</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>This is the first in a short series of <a href="http://dbpedia.org/resource/Blog" id="link-id0x12c91d60">blog</a> posts about what becomes possible when essentially unlimited <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x2375f488">linked data</a> can be deployed on the open web and private intranets.</p>

<p>The term <i>DataSphere</i> comes from Dan Simmons&#39; <i><a href="http://dbpedia.org/resource/Hyperion_Cantos" id="link-id12ad4718">Hyperion</a></i> science fiction series, where it is a sort of pervasive computing capability that plays host to all sorts of processes, including what people do on the <a href="http://dbpedia.org/resource/.NET_Framework" id="link-id0x13084f08">net</a> today, and then some. I use this term here in order to emphasize the blurring of silo and application boundaries. The network is not only the computer but also the database. I will look at what effects the birth of a sort of linked data stratum can have on end-user experience, application development, application deployment and hosting, business models and advertising, and security; how cloud computing fits in; and how back-end software such as databases must evolve to support all of these.</p>

<p>This is a mid-term vision. The components are coming into production as we speak, but the end result is not here quite yet.</p>

<p>I use the word <i>DataSphere</i> to refer to a worldwide database fabric, a global Distributed DBMS collective, within which there are many <a href="http://dbpedia.org/resource/Data" id="link-id0x2504fff8">Data</a> Spaces, or Named Data Spaces. A <i>Data <a href="http://en.wikipedia.org/wiki/Data_Spaces" id="link-id0x81175fa0">Space</a></i> is essentially a person&#39;s or organization&#39;s contribution to the DataSphere. I use <i>Linked Data <a href="http://dbpedia.org/resource/Giant_Global_Graph" id="link-id0x70f4e190">Web</a></i> to refer to component technologies and practices such as <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x3a5ddcd8">RDF</a>, <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x23b049e0">SPARQL</a>, Linked Data practices, etc. The DataSphere does not have to be built on this technology stack <i>per se</i>, but this stack is still the best bet for it.</p>

<h2>General</h2>

<p>There exist applications for performing specialized functions such as social networking, shopping, document search, and C2C commerce at planetary scale. All these applications run on their own databases, each with a task specific schema. They communicate by web pages and by predefined messages for diverse application-specific transactions and reports.</p>

<p>These silos are scalable because in general their data has some natural partitioning, and because the set of transactions is predetermined and the data structure is set up for this.</p>

<p>The Linked Data Web proposes to create a data infrastructure that can hold anything, just like a network can transport anything. This is not a network with a memory of messages, but a whole that can answer arbitrary questions about what has been said. The prerequisite is that the questions are phrased in a vocabulary that is compatible with the vocabulary in which the statements themselves were made.</p>

<p>In this setting, the vocabulary takes the place of the application. Of course, there continues to be a procedural element to applications; this has the function of translating statements between the domain vocabulary and a user interface. Examples are data import from existing applications, running predefined reports, composing new reports, and translating between natural language and the domain vocabulary.</p>

<p>The big difference is that the database moves outside of the silo, at least in logical terms. The database will be like the network â horizontal and ubiquitous. The equivalent of TCP/IP will be the RDF/SPARQL combination. The equivalent of routing protocols between ISPs will be gateways between the specific DBMS engines supporting the services.</p>

<h2>The place of the DBMS in the stack changes</h2>

<p>The <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x10082590">RDBMS</a> in itself is eternal, or at least as eternal as a culture with heavy reliance on written records is. Any such culture will invent the RDBMS and use it where it best fits. We are not replacing this; we are building an abstracted worldwide data layer. This is to the RDBMS supporting line-of-business applications what the www was to enterprise content management systems.</p>

<p>For transactions, the Web 2.0-style application-specific messages are fine. Also, any transactional system that must be audited must physically reside somewhere, have physical security, etc. It can&#39;t just be somewhere in the DataSphere, managed by some system with which one has no contract, just like Google&#39;s web page cache can&#39;t be relied on as a permanent repository of web content.</p>

<p>Providing space on the Linked Data Web is like providing hosting on the Document Web. This may have varying service levels, pricing models, etc. The value of a queriable DataSphere is that a new application does not have to begin by building its own schema, database infrastructure, service hosting, etc. The application becomes more like a language <a href="http://dbpedia.org/resource/Meme" id="link-id0x23c85e68">meme</a>, a cultural form of interaction mediated by a relatively lightweight user-facing component, laterally open for unforeseen interaction with other applications from other domains of discourse.</p>

<h2>End User Benefits</h2>

<p>For the end user, the web will still look like a place where one can shop, discuss, date, whatever. These activities will be mediated by user interfaces as they are now. Right now, the end user&#39;s web presence is his/her blog or web site, and their contributions to diverse wikis, social web sites, and so forth. These are scattered. The user&#39;s Data Space is the collection of all these things, now presented in a queriable form. The user&#39;s Data Space is the user&#39;s statement of presence, referencing the diverse contributions of the user on diverse sites.</p>

<p>The personal Data Space being a queriable, structured whole facilitates finding and being found, which is what brings individuals to the web in the first place. The best applications and sites are those which make this the easiest. The Linked Data Web allows saying what one wishes in a structured, queriable manner, across all application domains, independently of domain specific silos. The end user&#39;s interaction with the personal data space is through applications, like now. But these applications are just wrappers on top of self describing data, represented in domain specific vocabularies; one vocabulary is used for social networking, another for C2C commerce, and so on. The user is the master of their personal Data Space, free to take it where he or she wishes.</p>

<p>Further benefits will include more ready referencing between these spaces, more uniform identity management, cross-application operations, and the emergence of &quot;meta-applications,&quot; i.e., unified interfaces for managing many related applications/tasks.</p>

<p>Of course, there is the increase in semantic richness, such as better contextuality derived from <a href="http://dbpedia.org/resource/Entity" id="link-id0x23904698">entity</a> extraction from text. But this is also possible in a silo. The Linked Data Web angle is the sharing of identifiers for real world entities, which makes extracts of different sources by different parties potentially joinable. The user interaction will hardly ever be with the raw data. But the raw data being still at hand makes for better targeting of advertisements, better offering of related services, easier discovery of related content, and less noise overall.</p>

<p>
<a href="http://myopenlink.net/dataspace/person/kidehen#this" id="link-id0x37342a60">Kingsley Idehen</a> has coined the term <a href="http://www.openlinksw.com/blog/~kidehen/?id=1442" id="link-id0x3a56e4e8">SDQ</a>, for <a href="http://www.openlinksw.com/blog/~kidehen/?id=1442" id="link-id0x23649b70">Serendipitous Discovery Quotient</a>, to denote this. When applications expose explicit semantics, constructing a user experience that combines relevant data from many sources, including applications as well as highly targeted advertising, becomes natural. It is no longer a matter of &quot;mashing up&quot; web service interfaces with procedural code, but of &quot;meshing&quot; data through declarative queries across application spaces.</p>

<h2>Applications in the DataSphere</h2>

<p>The workflows supported by the DataSphere are essentially those taking place on the web now. The DataSphere dimension is expressed by bookmarklets, browser plugins, and the like, with ready access to related data and actions that are relevant for this data. Actions triggered by data can be anything from posting a comment to making an e-commerce purchase. Web 2.0 models fit right in.</p>

<p>Web application development now consists of designing an application-specific database schema and writing web pages to interact with this schema. In the DataSphere, the database is abstracted away, as is a large part of the schema. The application floats on a sea of data instead of being tied to its own specific store and schema. Some local transaction data should still be handled in the old way, though.</p>

<p>For the application developer, the question becomes one of vocabulary choice. How will the application synthesize URIs from the user interaction? Which URIs will be used, since pretty much anything will in practice have many names (e.g., <a href="http://dbpedia.org/resource/DBpedia" id="link-id0x2364eae8">DBpedia</a> Vs. Freebase identifiers). The end user will generally have no idea of this choice, nor of the various degrees of normalization, etc., in the vocabularies. Still, usage of such applications will produce data using some identifiers and vocabularies. Benefits of ready joining without translation will drive adoption. A vocabulary with instance data will get more instance data.</p>

<p>The Linked Data Web infrastructure itself must support vocabulary and identifier choice by answering questions about who uses a particular identifier and where. Even now, we offer entity ranks and resolution of synonyms, queries on what graphs mention a certain identifier and so on. This is a means of finding the most commonly used term for each situation. Convergence of terminology cuts down on translation and makes for easier and more efficient querying.</p>

<h2>Advertising</h2>

<p>The application developer is, for purposes of advertising, in the position of the inventory owner, just like a traditional publisher, whether web or other. But with smarter data, it is not a matter of static keywords but of the semantically explicit data behind each individual user impression driving the ads. Data itself carries no ads but the user impression will still go through a display layer that can show ads. If the application relies on reuse of licensed content, such as media, then the content provider may get a cut of the ad revenue even if it is not the direct owner of the inventory. The specifics of implementing and enforcing this are to be worked out.</p>

<h2>Content Providers, License, and Attribution</h2>

<p>For the content provider, the <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id0xa9abc2f8">URI</a> is the brand carrier. If the data is well linked and queriable, this will drive usage and traffic to the services of the content provider. This is true of any provider, whether a media publisher, e-commerce business, government agency, or anything else.</p>

<p>Intellectual property considerations will make the URI a first class citizen. Just like the URI is a part of the document web experience, it is a part of the Linked Data Web experience. Just like Creative Commons licenses allow the licensor to define what type of attribution is required, a data publisher can mandate that a user experience mediated by whatever application should expose the source as a dereferenceable URI.

</p>
<p>One element of data dereferencing must be linking to applications that facilitate human interaction with the data. A generic data browser is a developer tool; the end user experience must still be mediated by interfaces tailored to the domain. This layer can take care of making the brand visible and can show advertising or be monetized on a usage basis.</p>

<p>Next we will look at the service provider and infrastructure side of this.</p>

<h2>Related</h2>
<ul>
<li>
  <a href="http://www.openlinksw.com/blog/~kidehen/?id=1442" id="link-id148ea4e0">Serendipitous Discovery Quotient (SDQ)</a>
</li>
<li>
  <a href="http://www.openlinksw.com/blog/~kidehen/?id=1534" id="link-id14b07f88">How Linked Data will change Advertising</a>
</li>
<li>
  <a href="http://www.openlinksw.com/blog/~kidehen/?id=1519" id="link-id117c6608">The Time for RDBMS Primacy Downgrade is Nigh!</a>
</li>
<li>
  <a href="http://www.openlinksw.com/blog/~kidehen/?tag=DataSpace" id="link-id154e1d58">Data Spaces</a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-12-16#1498">
  <rss:title>&quot;E Pluribus Unum&quot;, or &quot;Inversely Functional Identity&quot;, or &quot;Smooshing Without the Stickiness&quot; (re-updated)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-12-16T14:14:43Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">What a terrible word, smooshing... I have understood it to mean that when you have two names for one thing, you give each all the attributes of the other. This smooshes them together, makes them interchangeable. This is complex, so I will begin with the point and the interested may read on for the details and implications. Starting with soon to be released version 6, Virtuoso allows you to say that two things, if they share a uniquely identifying property, are the same. Examples of uniquely identifying properties would be a book&#39;s ISBN number, or a person&#39;s social security plus full name. In relational language this is a unique key, and in RDF parlance, an inverse functional property. In most systems, such problems are dealt with as a preprocessing step before querying. For example, all the items that are considered the same will get the same properties or at load time all identifiers will be normalized according to some application rules. This is good if the rules are clear and understood. This is so in closed situations, where things tend to have standard identifiers to begin with. But on the open web this is not so clear cut. In this post, we show how to do these things ad hoc, without materializing anything. At the end, we also show how to materialize identity and what the consequences of this are with open web data. We use real live web crawls from the Billion Triples Challenge data set. On the linked data web, there are independently arising descriptions of the same thing and thus arises the need to smoosh, if these are to be somehow integrated. But this is only the beginning of the problems. To address these, we have added the option of specifying that some property will be considered inversely functional in a query. This is done at run time and the property does not really have to be inversely functional in the pure sense. foaf:name will do for an example. This simply means that for purposes of the query concerned, two subjects which have at least one foaf:name in common are considered the same. In this way, we can join between FOAF files. With the same database, a query about music preferences might consider having the same name as &quot;same enough,&quot; but a query about criminal prosecution would obviously need to be more precise about sameness. Our ontology is defined like this: -- Populate a named graph with the triples you want to use in query time inferencing ttlp ( &#39; @prefix foaf: &lt;xmlns=&quot;http&quot; xmlns.com=&quot;xmlns.com&quot; foaf=&quot;foaf&quot;&gt; &lt;/&gt; @prefix owl: &lt;xmlns=&quot;http&quot; www.w3.org=&quot;www.w3.org&quot; owl=&quot;owl&quot;&gt; &lt;/&gt; foaf:mbox_sha1sum a owl:InverseFunctionalProperty . foaf:name a owl:InverseFunctionalProperty . &#39;, &#39;xx&#39;, &#39;b3sifp&#39; ); -- Declare that the graph contains an ontology for use in query time inferencing rdfs_rule_set ( &#39;http://example.com/rules/b3sifp#&#39;, &#39;b3sifp&#39; ); Then use it: sparql DEFINE input:inference &quot;http://example.com/rules/b3sifp#&quot; SELECT DISTINCT ?k ?f1 ?f2 WHERE { ?k foaf:name ?n . ?n bif:contains &quot;&#39;Kjetil Kjernsmo&#39;&quot; . ?k foaf:knows ?f1 . ?f1 foaf:knows ?f2 }; VARCHAR VARCHAR VARCHAR ______________________________________ _______________________________________________ ______________________________ http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/dajobe http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/net_twitter http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/amyvdh http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/pom http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/mattb http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/davorg http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/distobj http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/perigrin .... Without the inference, we get no matches. This is because the data in question has one graph per FOAF file, and blank nodes for persons. No graph references any person outside the ones in the graph. So if somebody is mentioned as known, then without the inference there is no way to get to what that person&#39;s FOAF file says, since the same individual will be a different blank node there. The declaration in the context named b3sifp just means that all things with a matching foaf:name or foaf:mbox_sha1sum are the same. Sameness means that two are the same for purposes of DISTINCT or GROUP BY, and if two are the same, then both have the UNION of all of the properties of both. If this were a naive smoosh, then the individuals would have all the same properties but would not be the same for DISTINCT. If we have complex application rules for determining whether individuals are the same, then one can materialize owl:sameAs triples and keep them in a separate graph. In this way, the original data is not contaminated and the materialized volume stays reasonable â nothing like the blow-up of duplicating properties across instances. The pro-smoosh argument is that if every duplicate makes exactly the same statements, then there is no great blow-up. Best and worst cases will always depend on the data. In rough terms, the more ad hoc the use, the less desirable the materialization. If the usage pattern is really set, then a relational-style application-specific representation with identity resolved at load time will perform best. We can do that too, but so can others. The principal point is about agility as concerns the inference. Run time is more agile than materialization, and if the rules change or if different users have different needs, then materialization runs into trouble. When talking web scale, having multiple users is a given; it is very uneconomical to give everybody their own copy, and the likelihood of a user accessing any significant part of the corpus is minimal. Even if the queries were not limited, the user would typically not wait for the answer of a query doing a scan or aggregation over 1 billion blog posts or something of the sort. So queries will typically be selective. Selective means that they do not access all of the data, hence do not benefit from ready-made materialization for things they do not even look at. The exception is corpus-wide statistics queries. But these will not be done in interactive time anyway, and will not be done very often. Plus, since these do not typically run all in memory, these are disk bound. And when things are disk bound, size matters. Reading extra entailment on the way is just a performance penalty. Enough talk. Time for an experiment. We take the Yahoo and Falcon web crawls from the Billion Triples Challenge set, and do two things with the FOAF data in them: Resolve identity at insert time. We remove duplicate person URIs, and give the single URI all the properties of all the duplicate URIs. We expect these to be most often repeats. If a person references another person, we normalize this reference to go to the single URI of the referenced person. Give every duplicate URI of a person all the properties of all the duplicates. If these are the same value, the data should not get much bigger, or so we think. For the experiment, we will consider two people the same if they have the same foaf:name and are both instances of foaf:Person. This gets some extra hits but should not be statistically significant. The following is a commented SQL script performing the smoosh. We play with internal IDs of things, thus some of these operations cannot be done in SPARQL alone. We use SPARQL where possible for readability. As the documentation states, iri_to_id converts from the qualified name of an IRI to its ID and id_to_iri does the reverse. We count the triples that enter into the smoosh: -- the name is an existence because else we&#39;d get several times more due to -- the names occurring in many graphs sparql SELECT COUNT(*) WHERE { { SELECT DISTINCT ?person WHERE { ?person a foaf:Person } } . FILTER ( bif:exists ( SELECT (1) WHERE { ?person foaf:name ?nn } ) ) . ?person ?p ?o }; -- We get 3284674 We make a few tables for intermediate results. -- For each distinct name, gather the properties and objects from -- all subjects with this name CREATE TABLE name_prop ( np_name ANY, np_p IRI_ID_8, np_o ANY, PRIMARY KEY ( np_name, np_p, np_o ) ); ALTER INDEX name_prop ON name_prop PARTITION ( np_name VARCHAR (-1, 0hexffff) ); -- Map from name to canonical IRI used for the name CREATE TABLE name_iri ( ni_name ANY PRIMARY KEY, ni_s IRI_ID_8 ); ALTER INDEX name_iri ON name_iri PARTITION ( ni_name VARCHAR (-1, 0hexffff) ); -- Map from person IRI to canonical person IRI CREATE TABLE pref_iri ( i IRI_ID_8, pref IRI_ID_8, PRIMARY KEY ( i ) ); ALTER INDEX pref_iri ON pref_iri PARTITION ( i INT (0hexffff00) ); -- a table for the materialization where all aliases get all properties of every other CREATE TABLE smoosh_ct ( s IRI_ID_8, p IRI_ID_8, o ANY, PRIMARY KEY ( s, p, o ) ); ALTER INDEX smoosh_ct ON smoosh_ct PARTITION ( s INT (0hexffff00) ); -- disable transaction log and enable row auto-commit. This is necessary, otherwise -- bulk operations are done transactionally and they will run out of rollback space. LOG_ENABLE (2); -- Gather all the properties of all persons with a name under that name. -- INSERT SOFT means that duplicates are ignored INSERT SOFT name_prop SELECT &quot;n&quot;, &quot;p&quot;, &quot;o&quot; FROM ( sparql DEFINE output:valmode &quot;LONG&quot; SELECT ?n ?p ?o WHERE { ?x a foaf:Person . ?x foaf:name ?n . ?x ?p ?o } ) xx ; -- Now choose for each name the canonical IRI INSERT INTO name_iri SELECT np_name, ( SELECT MIN (s) FROM rdf_quad WHERE o = np_name AND p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ) AS mini FROM name_prop WHERE np_p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ; -- For each person IRI, map to the canonical IRI of that person INSERT SOFT pref_iri (i, pref) SELECT s, ni_s FROM name_iri, rdf_quad WHERE o = ni_name AND p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ; -- Make a graph where all persons have one iri with all the properties of all aliases -- and where person-to-person refs are canonicalized INSERT SOFT rdf_quad (g,s,p,o) SELECT IRI_TO_ID (&#39;psmoosh&#39;), ni_s, np_p, COALESCE ( ( SELECT pref FROM pref_iri WHERE i = np_o ), np_o ) FROM name_prop, name_iri WHERE ni_name = np_name OPTION ( loop, quietcast ) ; -- A little explanation: The properties of names are copied into rdf_quad with the name -- replaced with its canonical IRI. If the object has a canonical IRI, this is used as -- the object, else the object is unmodified. This is the COALESCE with the sub-query. -- This takes a little time. To check on the progress, take another connection to the -- server and do STATUS (&#39;cluster&#39;); -- It will return something like -- Cluster 4 nodes, 35 s. 108 m/s 1001 KB/s 75% cpu 186% read 12% clw threads 5r 0w 0i -- buffers 549481 253929 d 8 w 0 pfs -- Now finalize the state; this makes it permanent. Else the work will be lost on server -- failure, since there was no transaction log CL_EXEC (&#39;checkpoint&#39;); -- See what we got sparql SELECT COUNT (*) FROM &lt;psmoosh&gt; WHERE {?s ?p ?o}; -- This is 2253102 -- Now make the copy where all have the properties of all synonyms. This takes so much -- space we do not insert it as RDF quads, but make a special table for it so that we can -- run some statistics. This saves time. INSERT SOFT smoosh_ct (s, p, o) SELECT s, np_p, np_o FROM name_prop, rdf_quad WHERE o = np_name AND p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ; -- as above, INSERT SOFT so as to ignore duplicates SELECT COUNT (*) FROM smoosh_ct; -- This is 167360324 -- Find out where the bloat comes from SELECT TOP 20 COUNT (*), ID_TO_IRI (p) FROM smoosh_ct GROUP BY p ORDER BY 1 DESC; The results are: 54728777 http://www.w3.org/2002/07/owl#sameAs 48543153 http://xmlns.com/foaf/0.1/knows 13930234 http://www.w3.org/2000/01/rdf-schema#seeAlso 12268512 http://xmlns.com/foaf/0.1/interest 11415867 http://xmlns.com/foaf/0.1/nick 6683963 http://xmlns.com/foaf/0.1/weblog 6650093 http://xmlns.com/foaf/0.1/depiction 4231946 http://xmlns.com/foaf/0.1/mbox_sha1sum 4129629 http://xmlns.com/foaf/0.1/homepage 1776555 http://xmlns.com/foaf/0.1/holdsAccount 1219525 http://xmlns.com/foaf/0.1/based_near 305522 http://www.w3.org/1999/02/22-rdf-syntax-ns#type 274965 http://xmlns.com/foaf/0.1/name 155131 http://xmlns.com/foaf/0.1/dateOfBirth 153001 http://xmlns.com/foaf/0.1/img 111130 http://www.w3.org/2001/vcard-rdf/3.0#ADR 52930 http://xmlns.com/foaf/0.1/gender 48517 http://www.w3.org/2004/02/skos/core#subject 45697 http://www.w3.org/2000/01/rdf-schema#label 44860 http://purl.org/vocab/bio/0.1/olb Now compare with the predicate distribution of the smoosh with identities canonicalized sparql SELECT COUNT (*) ?p FROM &lt;psmoosh&gt; WHERE { ?s ?p ?o } GROUP BY ?p ORDER BY 1 DESC LIMIT 20; Results are: 748311 http://xmlns.com/foaf/0.1/knows 548391 http://xmlns.com/foaf/0.1/interest 140531 http://www.w3.org/2000/01/rdf-schema#seeAlso 105273 http://www.w3.org/1999/02/22-rdf-syntax-ns#type 78497 http://xmlns.com/foaf/0.1/name 48099 http://www.w3.org/2004/02/skos/core#subject 45179 http://xmlns.com/foaf/0.1/depiction 40229 http://www.w3.org/2000/01/rdf-schema#comment 38272 http://www.w3.org/2000/01/rdf-schema#label 37378 http://xmlns.com/foaf/0.1/nick 37186 http://dbpedia.org/property/abstract 34003 http://xmlns.com/foaf/0.1/img 26182 http://xmlns.com/foaf/0.1/homepage 23795 http://www.w3.org/2002/07/owl#sameAs 17651 http://xmlns.com/foaf/0.1/mbox_sha1sum 17430 http://xmlns.com/foaf/0.1/dateOfBirth 15586 http://xmlns.com/foaf/0.1/page 12869 http://dbpedia.org/property/reference 12497 http://xmlns.com/foaf/0.1/weblog 12329 http://blogs.yandex.ru/schema/foaf/school We can drop the owl:sameAs triples from the count, so the bloat is a bit less by that but it still is tens of times larger than the canonicalized copy or the initial state. Now, when we try using the psmoosh graph, we still get different results from the results with the original data. This is because foaf:knows relations to things with no foaf:name are not represented in the smoosh. The exist: sparql SELECT COUNT (*) WHERE { ?s foaf:knows ?thing . FILTER ( !bif:exists ( SELECT (1) WHERE { ?thing foaf:name ?nn } ) ) }; -- 1393940 So the smoosh graph is not an accurate rendition of the social network. It would have to be smooshed further to be that, since the data in the sample is quite irregular. But we do not go that far here. Finally, we calculate the smoosh blow up factors. We do not include owl:sameAs triples in the counts. select (167360324 - 54728777) / 3284674.0; 34.290022997716059 select 2229307 / 3284674.0; = 0.678699621332284 So, to get a smoosh that is not really the equivalent of the original, either multiply the original triple count by 34 or 0.68, depending on whether synonyms are collapsed or not. Making the smooshes does not take very long, some minutes for the small one. Inserting the big one would be longer, a couple of hours maybe. It was 33 minutes for filling the smoosh_ct table. The metrics were not with optimal tuning so the performance numbers just serve to show that smooshing takes time. Probably more time than allowable in an interactive situation, no matter how the process is optimized.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>What a terrible word, smooshing...  I have understood it to mean that when you have two names for one thing, you give each all the attributes of the other.  This smooshes them together, makes them interchangeable.</p>

<p>This is complex, so I will begin with the point and the interested may read on for the details and implications.  Starting with soon to be released version 6, <a href="http://virtuoso.openlinksw.com" id="link-id15718cb8">Virtuoso</a> allows you to say that two things, if they share a uniquely identifying property, are the same.  Examples of uniquely identifying properties would be a book&#39;s ISBN number, or a person&#39;s social security plus full name.  In relational language this is a <i>unique key</i>, and in <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id145ed998">RDF</a> parlance, an <i>inverse functional property</i>.</p>

<p>In most systems, such problems are dealt with as a preprocessing step before querying.  For example, all the items that are considered the same will get the same properties or at load time all identifiers will be normalized according to some application rules.  This is good if the rules are clear and understood.  This is so in closed situations, where things tend to have standard identifiers to begin with.  But on the open web this is not so clear cut.</p>

<p>In this post, we show how to do these things <i>ad hoc</i>, without materializing anything.  At the end, we also show how to materialize identity and what the consequences of this are with open web <a href="http://dbpedia.org/resource/Data" id="link-id11726358">data</a>.  We use real live web crawls from the <a href="http://challenge.semanticweb.org/" id="link-id14f40448">Billion Triples Challenge</a> data set.</p>

<p>On the <a href="http://dbpedia.org/resource/Linked_Data" id="link-id156e2b10">linked data</a> <a href="http://dbpedia.org/resource/Giant_Global_Graph" id="link-id1106ce08">web</a>, there are independently arising descriptions of the same thing and thus arises the need to smoosh, if these are to be somehow integrated.  But this is only the beginning of the problems.</p>

<p>To address these, we have added the option of specifying that some property will be considered inversely functional in a query.  This is done at run time and the property does not really have to be inversely functional in the pure sense.  <code>foaf:name</code> will do for an example.  This simply means that for purposes of the query concerned, two subjects which have at least one <code>foaf:name</code> in common are considered the same. In this way, we can join between FOAF files.  With the same database, a query about music preferences might consider having the same name as &quot;same enough,&quot; but a query about criminal prosecution would obviously need to be more precise about sameness.</p>

<p>Our ontology is defined like this:</p>

<blockquote>
<pre>-- Populate a named graph with the triples you want to use in query time inferencing<br />
ttlp ( &#39;
        @prefix foaf: &lt;xmlns=&quot;http&quot; xmlns.com=&quot;xmlns.com&quot; foaf=&quot;foaf&quot;&gt;
                      &lt;/&gt;
        @prefix owl:  &lt;xmlns=&quot;http&quot; www.w3.org=&quot;www.w3.org&quot; owl=&quot;owl&quot;&gt;
                      &lt;/&gt;
        foaf:mbox_sha1sum  a  owl:InverseFunctionalProperty  .
        foaf:name          a  owl:InverseFunctionalProperty  .
       &#39;,
       &#39;xx&#39;,
       &#39;b3sifp&#39;
     );<br />
-- Declare that the graph contains an ontology for use in query time inferencing <br />
rdfs_rule_set ( &#39;http://example.com/rules/b3sifp#&#39;,
                &#39;b3sifp&#39;
              );
</pre></blockquote>

<p>Then use it:</p>

<blockquote>
<pre>sparql 
   DEFINE input:inference &quot;http://example.com/rules/b3sifp#&quot; 
   SELECT DISTINCT ?k ?f1 ?f2 
   WHERE { ?k   foaf:name     ?n                   . 
           ?n   bif:contains  &quot;&#39;Kjetil Kjernsmo&#39;&quot;  . 
           ?k   foaf:knows    ?f1                  . 
           ?f1  foaf:knows    ?f2 
         };<br />
VARCHAR                                  VARCHAR                                           VARCHAR
______________________________________   _______________________________________________   ______________________________<br />
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/dajobe
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/net_twitter
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/amyvdh
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/pom
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/mattb
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/davorg
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/distobj
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/perigrin
....
</pre></blockquote>

<p>Without the inference, we get no matches.  This is because the data in question has one graph per FOAF file, and blank nodes for persons.  No graph references any person outside the ones in the graph.  So if somebody is mentioned as known, then without the inference there is no way to get to what that person&#39;s FOAF file says, since the same individual will be a different blank node there.  The declaration in the context named <code>b3sifp</code> just means that all things with a matching <code>foaf:name</code> or <code>foaf:mbox_sha1sum</code> are the same.</p>

<p>Sameness means that two are the same for purposes of <code>DISTINCT</code> or <code>GROUP BY</code>, and if two are the same, then both have the <code>UNION</code> of all of the properties of both.</p>

<p>If this were a naive smoosh, then the individuals would have all the same properties but would not be the same for <code>DISTINCT</code>.</p>

<p>If we have complex application rules for determining whether individuals are the same, then one can materialize <code>owl:sameAs</code> triples and keep them in a separate graph.  In this way, the original data is not contaminated and the materialized volume stays reasonable â nothing like the blow-up of duplicating properties across instances.</p>

<p>The pro-smoosh argument is that if every duplicate makes exactly the same statements, then there is no great blow-up.  Best and worst cases will always depend on the data.  In rough terms, the more <i>ad hoc</i> the use, the less desirable the materialization.  If the usage pattern is really set, then a relational-style application-specific representation with identity resolved at load time will perform best.  We can do that too, but so can others.</p>

<p>The principal point is about agility as concerns the inference.  Run time is more agile than materialization, and if the rules change or if different users have different needs, then materialization runs into trouble.  When talking web scale, having multiple users is a given; it is very uneconomical to give everybody their own copy, and the likelihood of a user accessing any significant part of the corpus is minimal.  Even if the queries were not limited, the user would typically not wait for the answer of a query doing a scan or aggregation over 1 billion <a href="http://dbpedia.org/resource/Blog" id="link-id1156a550">blog</a> posts or something of the sort.  So queries will typically be selective.  Selective means that they do not access all of the data, hence do not benefit from ready-made materialization for things they do not even look at. </p>

<p>The exception is corpus-wide statistics queries.  But these will not be done in interactive time anyway, and will not be done very often. Plus, since these do not typically run all in memory, these are disk bound.  And when things are disk bound, size matters.  Reading extra entailment on the way is just a performance penalty.</p>

<p>Enough talk. Time for an experiment.  We take the Yahoo and Falcon web crawls from the Billion Triples Challenge set, and do two things with the FOAF data in them:</p>

<ol>
<li>Resolve identity at insert time.  We remove duplicate person URIs, and give the single <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id11317008">URI</a> all the properties of all the duplicate URIs.  We expect these to be most often repeats.  If a person references another person, we normalize this reference to go to the single URI of the referenced person.</li>

<li>Give every duplicate URI of a person all the properties of all the duplicates.  If these are the same value, the data should not get much bigger, or so we think.</li>
</ol>

<p>For the experiment, we will consider two people the same if they have the same <code>foaf:name</code> and are both instances of <code>foaf:Person</code>.  This gets some extra hits but should not be statistically significant.</p>

<p>The following is a commented <a href="http://dbpedia.org/resource/SQL" id="link-id110945b0">SQL</a> script performing the smoosh.  We play with internal IDs of things, thus some of these operations cannot be done in SPARQL alone.  We use SPARQL where possible for readability.  As the documentation states, <code>iri_to_id</code> converts from the qualified name of an IRI to its ID and <code>id_to_iri</code> does the reverse.</p>

<p>We count the triples that enter into the smoosh:</p>

<blockquote>
<pre>-- the name is an existence because else we&#39;d get several times more due to 
-- the names occurring in many graphs <br />
sparql 
   SELECT COUNT(*) 
    WHERE { { SELECT DISTINCT ?person 
               WHERE { ?person a foaf:Person }
            } . 
            FILTER ( bif:exists ( SELECT (1) 
                                   WHERE { ?person foaf:name ?nn } 
                                )
                       ) . 
            ?person ?p ?o
          };<br />
-- We get 3284674
</pre></blockquote>

<p>We make a few tables for intermediate results.</p>

<blockquote>
<pre>-- For each distinct name, gather the properties and objects from 
-- all subjects with this name <br />
CREATE TABLE name_prop 
   ( np_name  ANY, 
     np_p     IRI_ID_8, 
     np_o     ANY, 
     PRIMARY KEY ( np_name, 
                   np_p, 
                   np_o
                 )
   );
ALTER INDEX name_prop 
   ON name_prop 
   PARTITION ( np_name VARCHAR (-1, 0hexffff) );<br />
-- Map from name to canonical IRI used for the name <br />
CREATE TABLE name_iri ( ni_name  ANY PRIMARY KEY, 
                        ni_s     IRI_ID_8
                      );
ALTER INDEX name_iri 
   ON name_iri 
   PARTITION ( ni_name VARCHAR (-1, 0hexffff) );<br />
-- Map from person IRI to canonical person IRI<br />
CREATE TABLE pref_iri 
   ( i     IRI_ID_8, 
     pref  IRI_ID_8, 
     PRIMARY KEY ( i )
   );
ALTER INDEX pref_iri 
   ON pref_iri 
   PARTITION ( i INT (0hexffff00) );<br />
-- a table for the materialization where all aliases get all properties of every other <br />
CREATE TABLE smoosh_ct 
   ( s  IRI_ID_8, 
     p  IRI_ID_8, 
     o  ANY, 
     PRIMARY KEY ( s, 
                   p, 
                   o
                 ) 
   );
ALTER INDEX smoosh_ct 
   ON smoosh_ct 
   PARTITION ( s INT (0hexffff00) );<br />
-- disable transaction log and enable row auto-commit.  This is necessary, otherwise 
-- bulk operations are done transactionally and they will run out of rollback space.<br />
LOG_ENABLE (2);<br />
-- Gather all the properties of all persons with a name under that name.  
-- INSERT SOFT means that duplicates are ignored <br />
INSERT SOFT name_prop 
   SELECT &quot;n&quot;, &quot;p&quot;, &quot;o&quot; 
   FROM ( sparql 
          DEFINE output:valmode &quot;LONG&quot; 
          SELECT ?n ?p ?o 
          WHERE { ?x a foaf:Person . 
                 ?x foaf:name ?n . 
                 ?x ?p ?o
               }
        ) xx ;<br />
-- Now choose for each name the canonical IRI <br />
INSERT INTO name_iri 
   SELECT np_name, 
          ( SELECT MIN (s) 
              FROM rdf_quad 
             WHERE o = np_name 
                   AND p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;)
          ) AS mini 
     FROM name_prop 
    WHERE np_p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ;<br />
-- For each person IRI, map to the canonical IRI of that person <br />
INSERT SOFT pref_iri (i, pref) 
   SELECT s, 
          ni_s 
     FROM name_iri, 
          rdf_quad 
    WHERE o = ni_name 
          AND p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ;<br />
-- Make a graph where all persons have one iri with all the properties of all aliases 
-- and where person-to-person refs are canonicalized<br />
INSERT SOFT rdf_quad (g,s,p,o) 
   SELECT IRI_TO_ID (&#39;psmoosh&#39;), 
          ni_s, 
          np_p, 
 COALESCE ( ( SELECT pref 
              FROM pref_iri 
              WHERE i = np_o
            ), 
            np_o 
          )
     FROM name_prop, 
          name_iri 
    WHERE ni_name = np_name 
   OPTION ( loop, quietcast ) ;<br />
-- A little explanation:  The properties of names are copied into rdf_quad with the name 
-- replaced with its canonical IRI.  If the object has a canonical IRI, this is used as 
-- the object, else the object is unmodified.  This is the COALESCE with the sub-query.<br />
-- This takes a little time.  To check on the progress, take another connection to the 
-- server and do <br />
STATUS (&#39;cluster&#39;);<br />
-- It will return something like 
-- Cluster 4 nodes, 35 s. 108 m/s 1001 KB/s  75% cpu 186%  read 12% clw threads 5r 0w 0i 
-- buffers 549481 253929 d 8 w 0 pfs<br />
-- Now finalize the state; this makes it permanent.  Else the work will be lost on server 
-- failure, since there was no transaction log <br />
CL_EXEC (&#39;checkpoint&#39;);<br />
-- See what we got<br />
sparql 
   SELECT COUNT (*) 
     FROM &lt;psmoosh&gt; 
     WHERE {?s ?p ?o};<br />
-- This is 2253102<br />
-- Now make the copy where all have the properties of all synonyms.  This takes so much 
-- space we do not insert it as RDF quads, but make a special table for it so that we can 
-- run some statistics.  This saves time.<br />
INSERT SOFT smoosh_ct (s, p, o)  
   SELECT s, np_p, np_o 
     FROM name_prop, 
          rdf_quad 
    WHERE o = np_name 
          AND p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ;<br />
-- as above, INSERT SOFT so as to ignore duplicates <br />
SELECT COUNT (*) 
   FROM smoosh_ct;<br />
-- This is  167360324<br />
-- Find out where the bloat comes from <br />
SELECT TOP 20 COUNT (*), 
              ID_TO_IRI (p) 
   FROM smoosh_ct 
   GROUP BY p 
   ORDER BY 1 DESC;
</pre></blockquote>
<p>The results are:</p>

<blockquote>
<pre>54728777          http://www.w3.org/2002/07/owl#sameAs
48543153          http://xmlns.com/foaf/0.1/knows
13930234          http://www.w3.org/2000/01/rdf-schema#seeAlso
12268512          http://xmlns.com/foaf/0.1/interest
11415867          http://xmlns.com/foaf/0.1/nick
6683963           http://xmlns.com/foaf/0.1/weblog
6650093           http://xmlns.com/foaf/0.1/depiction
4231946           http://xmlns.com/foaf/0.1/mbox_sha1sum
4129629           http://xmlns.com/foaf/0.1/homepage
1776555           http://xmlns.com/foaf/0.1/holdsAccount
1219525           http://xmlns.com/foaf/0.1/based_near
305522            http://www.w3.org/1999/02/22-rdf-syntax-ns#type
274965            http://xmlns.com/foaf/0.1/name
155131            http://xmlns.com/foaf/0.1/dateOfBirth
153001            http://xmlns.com/foaf/0.1/img
111130            http://www.w3.org/2001/vcard-rdf/3.0#ADR
52930             http://xmlns.com/foaf/0.1/gender
48517             http://www.w3.org/2004/02/skos/core#subject
45697             http://www.w3.org/2000/01/rdf-schema#label
44860             http://purl.org/vocab/bio/0.1/olb
</pre></blockquote>

<p>Now compare with the predicate distribution of the smoosh with identities canonicalized </p>

<blockquote>
<pre>sparql 
     SELECT COUNT (*) ?p 
       FROM &lt;psmoosh&gt; 
      WHERE { ?s ?p ?o } 
   GROUP BY ?p 
   ORDER BY 1 DESC 
      LIMIT 20;</pre></blockquote>

<p>Results are:</p>
<blockquote>
<pre>748311            http://xmlns.com/foaf/0.1/knows
548391            http://xmlns.com/foaf/0.1/interest
140531            http://www.w3.org/2000/01/rdf-schema#seeAlso
105273            http://www.w3.org/1999/02/22-rdf-syntax-ns#type
78497             http://xmlns.com/foaf/0.1/name
48099             http://www.w3.org/2004/02/skos/core#subject
45179             http://xmlns.com/foaf/0.1/depiction
40229             http://www.w3.org/2000/01/rdf-schema#comment
38272             http://www.w3.org/2000/01/rdf-schema#label
37378             http://xmlns.com/foaf/0.1/nick
37186             http://dbpedia.org/property/abstract
34003             http://xmlns.com/foaf/0.1/img
26182             http://xmlns.com/foaf/0.1/homepage
23795             http://www.w3.org/2002/07/owl#sameAs
17651             http://xmlns.com/foaf/0.1/mbox_sha1sum
17430             http://xmlns.com/foaf/0.1/dateOfBirth
15586             http://xmlns.com/foaf/0.1/page
12869             http://dbpedia.org/property/reference
12497             http://xmlns.com/foaf/0.1/weblog
12329             http://blogs.yandex.ru/schema/foaf/school
</pre></blockquote>

<p>We can drop the <code>owl:sameAs</code> triples from the count, so the bloat is a bit less by that but it still is tens of times larger than the canonicalized copy or the initial state.</p>

<p>Now, when we try using the psmoosh graph, we still get different results from the results with the original data.  This is because <code>foaf:knows</code> relations to things with no <code>foaf:name</code> are not represented in the smoosh.  The exist:</p>

<blockquote>
<pre>sparql 
SELECT COUNT (*) 
   WHERE { ?s foaf:knows ?thing . 
           FILTER ( !bif:exists ( SELECT (1) 
                                   WHERE { ?thing foaf:name ?nn }
                                )
                  ) 
         };<br />
-- 1393940
</pre></blockquote>

<p>So the smoosh graph is not an accurate rendition of the social network.  It would have to be smooshed further to be that, since the data in the sample is quite irregular.  But we do not go that far here.</p>

<p>Finally, we calculate the smoosh blow up factors.  We do not include <code>owl:sameAs</code> triples in the counts.</p>

<blockquote>
<pre>select (167360324 - 54728777) / 3284674.0;
34.290022997716059<br />
select 2229307 / 3284674.0;
= 0.678699621332284
</pre></blockquote>

<p>So, to get a smoosh that is not really the equivalent of the original, either multiply the original triple count by 34 or 0.68, depending on whether synonyms are collapsed or not.</p>

<p>Making the smooshes does not take very long, some minutes for the small one.  Inserting the big one would be longer, a couple of hours maybe.  It was 33 minutes for filling the <code>smoosh_ct</code> table.  The metrics were not with optimal tuning so the performance numbers just serve to show that smooshing takes time.  Probably more time than allowable in an interactive situation, no matter how the process is optimized.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-12-11#1494">
  <rss:title>Virtuoso Anytime:  No Query Is Too Complex (updated)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-12-11T16:13:10Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">A persistent argument against the linked data web has been the cost, scalability, and vulnerability of SPARQL end points, should the linked data web gain serious mass and traffic. As we are on the brink of hosting the whole DBpedia Linked Open Data cloud in Virtuoso Cluster, we have had to think of what we&#39;ll do if, for example, somebody decides to count all the triples in the set. How can we encourage clever use of data, yet not die if somebody, whether through malice, lack of understanding, or simple bad luck, submits impossible queries? Restricting the language is not the way; any language beyond text search can express queries that will take forever to execute. Also, just returning a timeout after the first second (or whatever arbitrary time period) leaves people in the dark and does not produce an impression of responsiveness. So we decided to allow arbitrary queries, and if a quota of time or resources is exceeded, we return partial results and indicate how much processing was done. Here we are looking for the top 10 people whom people claim to know without being known in return, like this: SQL&gt; sparql SELECT ?celeb, COUNT (*) WHERE { ?claimant foaf:knows ?celeb . FILTER (!bif:exists ( SELECT (1) WHERE { ?celeb foaf:knows ?claimant } ) ) } GROUP BY ?celeb ORDER BY DESC 2 LIMIT 10; celeb callret-1 VARCHAR VARCHAR ________________________________________ _________ http://twitter.com/BarackObama 252 http://twitter.com/brianshaler 183 http://twitter.com/newmediajim 101 http://twitter.com/HenryRollins 95 http://twitter.com/wilw 81 http://twitter.com/stevegarfield 78 http://twitter.com/cote 66 mailto:adam.westerski@deri.org 66 mailto:michal.zaremba@deri.org 66 http://twitter.com/dsifry 65 *** Error S1TAT: [Virtuoso Driver][Virtuoso Server]RC...: Returning incomplete results, query interrupted by result timeout. Activity: 1R rnd 0R seq 0P disk 1.346KB / 3 messages SQL&gt; sparql SELECT ?celeb, COUNT (*) WHERE { ?claimant foaf:knows ?celeb . FILTER (!bif:exists ( SELECT (1) WHERE { ?celeb foaf:knows ?claimant } ) ) } GROUP BY ?celeb ORDER BY DESC 2 LIMIT 10; celeb callret-1 VARCHAR VARCHAR ________________________________________ _________ http://twitter.com/JasonCalacanis 496 http://twitter.com/Twitterrific 466 http://twitter.com/ev 442 http://twitter.com/BarackObama 356 http://twitter.com/laughingsquid 317 http://twitter.com/gruber 294 http://twitter.com/chrispirillo 259 http://twitter.com/ambermacarthur 224 http://twitter.com/t 219 http://twitter.com/johnedwards 188 *** Error S1TAT: [Virtuoso Driver][Virtuoso Server]RC...: Returning incomplete results, query interrupted by result timeout. Activity: 329R rnd 44.6KR seq 342P disk 638.4KB / 46 messages The first query read all data from disk; the second run had the working set from the first and could read some more before time ran out, hence the results were better. But the response time was the same. If one has a query that just loops over consecutive joins, like in basic SPARQL, interrupting the processing after a set time period is simple. But such queries are not very interesting. To give meaningful partial answers with nested aggregation and sub-queries requires some more tricks. The basic idea is to terminate the innermost active sub-query/aggregation at the first timeout, and extend the timeout a bit so that accumulated results get fed to the next aggregation, like from the GROUP BY to the ORDER BY. If this again times out, we continue with the next outer layer. This guarantees that results are delivered if there were any results found for which the query pattern is true. False results are not produced, except in cases where there is comparison with a count and the count is smaller than it would be with the full evaluation. One can also use this as a basis for paid services. The cutoff does not have to be time; it can also be in other units, making it insensitive to concurrent usage and variations of working set. This system will be deployed on our Billion Triples Challenge demo instance in a few days, after some more testing. When Virtuoso 6 ships, all LOD Cloud AMIs and OpenLink-hosted LOD Cloud SPARQL endpoints will have this enabled by default. (AMI users will be able to disable the feature, if desired.) The feature works with Virtuoso 6 in both single server and cluster deployment.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[
<p>A persistent argument against the <a href="http://dbpedia.org/resource/Linked_Data" id="link-id1199d5f8">linked data</a> <a href="http://dbpedia.org/resource/Giant_Global_Graph" id="link-id116f2730">web</a> has been the cost, scalability, and vulnerability of <a href="http://dbpedia.org/resource/SPARQL" id="link-id14e423c0">SPARQL</a> end points, should the linked data web gain serious mass and traffic.</p>

<p>As we are on the brink of hosting the whole <a href="http://dbpedia.org/resource/DBpedia" id="link-id1376a8b0">DBpedia</a> <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id113c8d20">Linked Open Data</a> cloud in <a href="http://virtuoso.openlinksw.com" id="link-id11425a78">Virtuoso</a> Cluster, we have had to think of what we&#39;ll do if, for example, somebody decides to count all the triples in the set.</p>

<p>How can we encourage clever use of <a href="http://dbpedia.org/resource/Data" id="link-id116f1210">data</a>, yet not die if somebody, whether through malice, lack of understanding, or simple bad luck, submits impossible queries?</p>

<p>Restricting the language is not the way; any language beyond text search can express queries that will take forever to execute.  Also, just returning a timeout after the first second (or whatever arbitrary time period) leaves people in the dark and does not produce an impression of responsiveness.  So we decided to allow arbitrary queries, and if a quota of time or resources is exceeded, we return partial results and indicate how much processing was done.</p>

<p>Here we are looking for the top 10 people whom people claim to know without being known in return, like this:</p>

<blockquote>
<pre>SQL&gt; sparql 
SELECT ?celeb, 
       COUNT (*)
WHERE { ?claimant foaf:knows ?celeb .
        FILTER (!bif:exists ( SELECT (1) 
                              WHERE { ?celeb foaf:knows ?claimant }
                            )
               )
      } 
GROUP BY ?celeb 
ORDER BY DESC 2 
LIMIT 10;<br />
celeb                                      callret-1
VARCHAR                                    VARCHAR
________________________________________   _________<br />
http://twitter.com/BarackObama             252
http://twitter.com/brianshaler             183
http://twitter.com/newmediajim             101
http://twitter.com/HenryRollins            95
http://twitter.com/wilw                    81
http://twitter.com/stevegarfield           78
http://twitter.com/cote                    66
mailto:adam.westerski@deri.org             66
mailto:michal.zaremba@deri.org             66
http://twitter.com/dsifry                  65<br />
*** Error S1TAT: [Virtuoso Driver][Virtuoso Server]RC...: Returning incomplete 
results, query interrupted by result timeout.  
Activity:      1R rnd      0R seq      0P disk  1.346KB /      3 messages<br />
SQL&gt; sparql 
SELECT ?celeb, 
       COUNT (*)
WHERE { ?claimant foaf:knows ?celeb .
        FILTER (!bif:exists ( SELECT (1) 
                              WHERE { ?celeb foaf:knows ?claimant }
                            )
               )
      } 
GROUP BY ?celeb 
ORDER BY DESC 2 
LIMIT 10;<br />
celeb                                      callret-1
VARCHAR                                    VARCHAR
________________________________________   _________<br />
http://twitter.com/JasonCalacanis          496
http://twitter.com/Twitterrific            466
http://twitter.com/ev                      442
http://twitter.com/BarackObama             356
http://twitter.com/laughingsquid           317
http://twitter.com/gruber                  294
http://twitter.com/chrispirillo            259
http://twitter.com/ambermacarthur          224
http://twitter.com/t                       219
http://twitter.com/johnedwards             188<br />
*** Error S1TAT: [Virtuoso Driver][Virtuoso Server]RC...: Returning incomplete 
results, query interrupted by result timeout.  
Activity:    329R rnd   44.6KR seq    342P disk  638.4KB /     46 messages</pre></blockquote>

<p>The first query read all data from disk; the second run had the working set from the first and could read some more before time ran out, hence the results were better.  But the response time was the same.</p>

<p>If one has a query that just loops over consecutive joins, like in basic SPARQL, interrupting the processing after a set time period is simple.  But such queries are not very interesting.  To give meaningful partial answers with nested aggregation and sub-queries requires some more tricks.  The basic idea is to terminate the innermost active sub-query/aggregation at the first timeout, and extend the timeout a bit so that accumulated results get fed to the next aggregation, like from the <code>GROUP BY</code> to the <code>ORDER BY</code>.  If this again times out, we continue with the next outer layer.  This guarantees that results are delivered if there were any results found for which the query pattern is true.  False results are not produced, except in cases where there is comparison with a count and the count is smaller than it would be with the full evaluation.</p>

<p>One can also use this as a basis for paid services.  The cutoff does not have to be time; it can also be in other units, making it insensitive to concurrent usage and variations of working set.</p>

<p>This system will be deployed on our <a href="http://challenge.semanticweb.org/" id="link-id11500a58">Billion Triples Challenge</a> <a href="http://b3s.openlinksw.com/" id="link-id11683120">demo instance</a> in a few days, after some more testing.  When Virtuoso 6 ships, all <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id1157a500">LOD</a> Cloud AMIs and OpenLink-hosted LOD Cloud SPARQL endpoints will have this enabled by default.  (AMI users will be able to disable the feature, if desired.)  The feature works with Virtuoso 6 in both single server and cluster deployment.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-11-27#1487">
  <rss:title>An Example of RDF Scalability</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-11-27T11:23:47Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We hear it to exhaustion, where is RDF scalability? We have been suggesting for a while that this is a solved question. I will here give some concrete numbers to back this. The scalability dream is to add hardware and get increased performance in proportion to the power the added component has when measured by itself. A corollary dream is to take scalability effects that are measured in a simple task and see them in a complex task. Below we show how we do 3.3 million random triple lookups per second on two 8 core commodity servers producing complete results, joining across partitions. On a single 4 core server, the figure is about 1 million lookups per second. With a single thread, it is about 250K lookups per second. This is the good case. But even our worse case is quite decent. We took a simple SPARQL query, counting how many people say they reciprocally know each other. In the Billion Triples Challenge data set, there are 25M foaf:knows quads of which 92K are reciprocal. Reciprocal here means that when x knows y in some graph, y knows x in the same or any other graph. SELECT COUNT (*) WHERE { ?p1 foaf:knows ?p2 . ?p2 foaf:knows ?p1 } There is no guarantee that the triple of x knows y is in the same partition as the triple y knows x. Thus the join is randomly distributed, n partitions to n partitions. We left this out of the Billion Triples Challenge demo because this did not run fast enough for our liking. Since then, we have corrected this. If run on a single thread, this query would be a loop over all the quads with a predicate of foaf:knows, and an inner loop looking for a quad with 3 of 4 fields given (SPO). If we have a partitioned situation, we have a loop over all the foaf:knows quads in each partition, and an inner lookup looking for the reciprocal foaf:knows quad in whatever partition it may be found. We have implemented this with two different message patterns: Centralized: One process reads all the foaf:knows quads from all processes. Every 50K quads, it sends a batch of reciprocal quad checks to each partition that could contain a reciprocal quad. Each partition keeps the count of found reciprocal quads, and these are gathered and added up at the end. Symmetrical: Each process reads the foaf:knows quads in its partition, and sends a batch of checks to each process that could have the reciprocal foaf:knows quad every 50K quads. At the end, the counts are gathered from all partitions. There is some additional control traffic but we do not go into its details here. Below is the result measured on 2 machines each with 2 x Xeon 5345 (quad core; total 8 cores), 16G RAM, and each machine running 6 Virtuoso instances. The interconnect is dual 1-Gbit ethernet. Numbers are with warm cache. Centralized: 35,543 msec, 728,634 sequential + random lookups per second Cluster 12 nodes, 35 s. 1072 m/s 39,085 KB/s 316% cpu ... Symmetrical: 7706 msec, 3,360,740 sequential + random lookups per second Cluster 12 nodes, 7 s. 572 m/s 16,983 KB/s 1137% cpu ... The second line is the summary from the cluster status report for the duration of the query. The interesting numbers are the KB/s and the %CPU. The former is the cross-sectional data transfer rate for intra-cluster communication; the latter is the consolidated CPU utilization, where a constantly-busy core counts for 100%. The point to note is that the symmetrical approach takes 4x less real time with under half the data transfer rate. Further, when using multiple machines, the speed of a single interface does not limit the overall throughput as it does in the centralized situation. These figures represent the best and worst cases of distributed JOINing. If we have a straight sequence of JOINs, with single pattern optionals and existences and the order in which results are produced is not significant (i.e., there is aggregation, existence test, or ORDER BY), the symmetrical pattern is applicable. On the other hand, if there are multiple triple pattern optionals, complex sub-queries, DISTINCTs in the middle of the query, or results have to be produced in the order of an index, then the centralized approach must be used at least part of the time. Also, if we must make transitive closures, which can be thought of as an extension of a DISTINCT in a subquery, we must pass the data through a single point before moving the bindings to the next JOIN in the sequence. This happens for example in resolving owl:sameAs at run time. However, the good news is that performance does not fall much below the centralized figure even when there are complex nested structures with intermediate transitive closures, DISTINCTs, complex existence tests, etc., that require passing all intermediate results through a central point. No matter the complexity, it is always possible to vector some tens-of-thousands of variable bindings into a single message exchange. And if there are not that many intermediate results, then single query execution time is not a problem anyhow. For our sample query, we would get still more speed by using a partitioned hash join, filling the hash from the foaf:knows relations and then running the foaf:knows relations through the hash. If the hash size is right, a hash lookup is somewhat better than an index lookup. The problem is that when the hash join is not the right solution, it is an expensive mistake: the best case is good; the worst case is very bad. But if there is no index then hash join is better than nothing. One problem of hash joins is that they make temporary data structures which, if large, will skew the working set. One must be quite sure of the cardinality before it is safe to try a hash join. So we do not do hash joins with RDF, but we do use them sometimes with relational data. These same methods apply to relational data just as well. This does not make generic RDF storage outperform an application-specific relational representation on the same platform, as the latter benefits from all the same optimizations, but in terms of sheer numbers, this makes RDF representation an option where it was not an option before. RDF is all about not needing to design the schema around the queries, and not needing to limit what joins with what else.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We hear it to exhaustion, where is <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1eab4128">RDF</a> scalability?  We have been suggesting for a while that this is a solved question.  I will here give some concrete numbers to back this.</p>

<p>The scalability dream is to add hardware and get increased performance in proportion to the power the added component has when measured by itself. A corollary dream is to take scalability effects that are measured in a simple task and see them in a complex task.</p>

<p>Below we show how we do 3.3 million random triple lookups per second on two 8 core commodity servers producing complete results, joining across partitions. On a single 4 core server, the figure is about 1 million lookups per second.  With a single thread, it is about 250K lookups per second.  This is the good case.  But even our worse case is quite decent.</p>

<p>We took a simple <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x15cb3da8">SPARQL</a> query, counting how many people say they reciprocally know each other.  In the <a href="http://challenge.semanticweb.org/" id="link-id0x1bfb7a00">Billion Triples Challenge</a> <a href="http://dbpedia.org/resource/Data" id="link-id0xa57187d8">data</a> set, there are 25M <code>foaf:knows</code> quads of which 92K are reciprocal. <i>Reciprocal</i> here means that when x knows y in some graph, y knows x in the same or any other graph.</p>

<pre>SELECT COUNT (*) 
WHERE { 
         ?p1  foaf:knows  ?p2  . 
         ?p2  foaf:knows  ?p1 
      }</pre>

<p>There is no guarantee that the triple of <code>x knows y</code> is in the same partition as the triple y knows x.  Thus the join is randomly distributed, n partitions to n partitions.</p>

<p>We left this out of the Billion Triples Challenge demo because this did not run fast enough for our liking.  Since then, we have corrected this.</p>

<p>If run on a single thread, this query would be a loop over all the quads with a predicate of <code>foaf:knows</code>, and an inner loop looking for a quad with 3 of 4 fields given (<code>SPO</code>). If we have a partitioned situation, we have a loop over all the <code>foaf:knows</code> quads in each partition, and an inner lookup looking for the reciprocal <code>foaf:knows</code> quad in whatever partition it may be found.</p>

<p>We have implemented this with two different message patterns: </p>

<ol>
 <li>
  <p>
    <b>Centralized:</b> One process reads all the <code>foaf:knows</code> quads from all processes.  Every 50K quads, it sends a batch of reciprocal quad checks to each partition that could contain a reciprocal quad.  Each partition keeps the count of found reciprocal quads, and these are gathered and added up at the end.</p>
 </li>

<li>
  <p>
    <b>Symmetrical:</b> Each process reads the <code>foaf:knows</code> quads in its partition, and sends a batch of checks to each process that could have the reciprocal <code>foaf:knows</code> quad every 50K quads.  At the end, the counts are gathered from all partitions.  There is some additional control traffic but we do not go into its details here.</p>
</li>
</ol>

<p>Below is the result measured on 2 machines each with 2 x Xeon 5345 (quad core; total 8 cores), 16G RAM, and each machine running 6 <a href="http://virtuoso.openlinksw.com" id="link-id0x1c0c94a8">Virtuoso</a> instances.  The interconnect is dual 1-Gbit ethernet. Numbers are with warm cache.</p>

<blockquote>
<code>Centralized:  35,543 msec,  728,634 sequential + random lookups per second <br />
Cluster 12 nodes, 35 s. 1072 m/s 39,085 KB/s  316% cpu ...
 <br /> <br />
Symmetrical:  7706 msec, 3,360,740 sequential + random lookups per second  <br />
Cluster 12 nodes, 7 s. 572 m/s 16,983 KB/s  1137% cpu ...</code>
</blockquote>

<p>The second line is the summary from the cluster status report for the duration of the query.  The interesting numbers are the KB/s and the %CPU.  The former is the cross-sectional data transfer rate for intra-cluster communication; the latter is the consolidated CPU utilization, where a constantly-busy core counts for 100%.  The point to note is that the symmetrical approach takes 4x less real time with under half the data transfer rate.  Further, when using multiple machines, the speed of a single interface does not limit the overall throughput as it does in the centralized situation.</p>

<p>These figures represent the best and worst cases of distributed <code>JOIN</code>ing.  If we have a straight sequence of <code>JOIN</code>s, with single pattern optionals and existences and the order in which results are produced is not significant (i.e., there is aggregation, existence test, or <code>ORDER BY</code>), the symmetrical pattern is applicable.  On the other hand, if there are multiple triple pattern optionals, complex sub-queries, <code>DISTINCT</code>s in the middle of the query, or results have to be produced in the order of an index, then the centralized approach must be used at least part of the time.</p>

<p>Also, if we must make transitive closures, which can be thought of as an extension of a <code>DISTINCT</code> in a subquery, we must pass the data through a single point before moving the bindings to the next <code>JOIN</code> in the sequence. This happens for example in resolving <code><a href="http://dbpedia.org/resource/Web_Ontology_Language" id="link-id0x28005280">owl</a>:sameAs</code> at run time.  However, the good news is that performance does not fall much below the centralized figure even when there are complex nested structures with intermediate transitive closures, <code>DISTINCT</code>s, complex existence tests, etc., that require passing all intermediate results through a central point. No matter the complexity, it is always possible to vector some tens-of-thousands of variable bindings into a single message exchange.  And if there are not that many intermediate results, then single query execution time is not a problem anyhow.</p>

<p>For our sample query, we would get still more speed by using a partitioned hash join, filling the hash from the <code>foaf:knows</code> relations and then running the <code>foaf:knows</code> relations through the hash.  If the hash size is right, a hash lookup is somewhat better than an index lookup.  The problem is that when the hash join is not the right solution, it is an expensive mistake:  the best case is good; the worst case is very bad. But if there is no index then hash join is better than nothing.  One problem of hash joins is that they make temporary data structures which, if large, will skew the working set.  One must be quite sure of the cardinality before it is safe to try a hash join.  So we do not do hash joins with RDF, but we do use them sometimes with relational data. </p>

<p>These same methods apply to relational data just as well.  This does not make generic RDF storage outperform an application-specific relational representation on the same platform, as the latter benefits from all the same optimizations, but in terms of sheer numbers, this makes RDF representation an option where it was not an option before. RDF is all about not needing to design the schema around the queries, and not needing to limit what joins with what else.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-10-24#1459">
  <rss:title>State of the Semantic Web, Part 1 - Sociology, Business, and Messaging (update 2)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-10-24T10:19:03Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">I was in Vienna for the Linked Data Practitioners gathering this week. Danny Ayers asked me if I would blog about the State of the Semantic Web or write the This Week&#39;s Semantic Web column. I don&#39;t have the time to cover all that may have happened during the past week but I will editorialize about the questions that again were raised in Vienna. How these things relate to Virtuoso will be covered separately. This is about the overarching questions of the times, not the finer points of geek craft. SÃ¶ren Auer asked me to say a few things about relational to RDF mapping. I will cite some highlights from this, as they pertain to the general scene. There was an &quot;open hacking&quot; session Wednesday night featuring lightning talks. I will use some of these too as a starting point. The messaging? The SWEO (Semantic Web Education and Outreach) interest group of the W3C spent some time looking for an elevator pitch for the Semantic Web. It became &quot;Data Unleashed.&quot; Why not? Let&#39;s give this some context. So, if we are holding a Semantic Web 101 session, where should we begin? I hazard to guess that we should not begin by writing a FOAF file in Turtle by hand, as this is one thing that is not likely to happen in the real world. Of course, the social aspect of the Data Web is the most immediately engaging, so a demo might be to go make an account with myopenlink.net and see that after one has entered the data one normally enters for any social network, one has become a Data Web citizen. This means that one can be found, just like this, with a query against the set of data spaces hosted on the system. Then we just need a few pages that repurpose this data and relate it to other data. We show some samples of queries like this in our Billion Triples Challenge demo. We will make a webcast about this to make it all clearer. Behold: The Data Web is about the world becoming a database; writing SPARQL queries or triples is incidental. You will write FOAF files by hand just as little as you now write SQL insert statements for filling in your account information on Myspace. Every time there is a major shift in technology, this shift needs to be motivated by addressing a new class of problem. This means doing something that could not be done before. The last time this happened was when the relational database became the dominant IT technology. At that time, the questions involved putting the enterprise in the database and building a cluster of Line Of Business (LOB) applications around the database. The argument for the RDBMS was that you did not have to constrain the set of queries that might later be made, when designing the database. In other words, it was making things more ad hoc. This was opposed then on grounds of being less efficient than the hierarchical and network databases which the relational eventually replaced. Today, the point of the Data Web is that you do not have to constrain what your data can join or integrate with, when you design your database. The counter-argument is that this is slow and geeky and not scalable. See the similarity? A difference is that we are not specifically aiming at replacing the RDBMS. In fact, if you know exactly what you will query and have a well defined workload, a relational representation optimized for the workload will give you about 10x the performance of the equivalent RDF warehouse. OLTP remains a relational-only domain. However, when we are talking about doing queries and analytics against the Web, or even against more than a handful of relational systems, the things which make RDBMS good become problematic. What is the business value of this? The most reliable of human drives is the drive to make oneself known. This drives all, from any social scene to business communications to politics. Today, when you want to proclaim you exist, you do so first on the Web. The Web did not become the prevalent media because business loved it for its own sake, it became prevalent because business could not afford not to assert their presence there. If anything, the Web eroded the communications dominance of a lot of players, which was not welcome but still had to be dealt with, by embracing the Web. Today, in a world driven by data, the Data Web will be catalyzed by similar factors: If your data is not there, you will not figure in query results. Search engines will play some role there but also many social applications will have reports that are driven by published data. Also consider any e-commerce, any marketplace, and so forth. The Data Portability movement is a case in point: Users want to own their own content; silo operators want to capitalize on holding it. Right now, we see these things in silos; the Data Web will create bridges between these, and what is now in silo data centers will be increasingly available on an ad hoc basis with Open Data. Again, we see a movement from the specialized to the generic: What LinkedIn does in its data center can be done with ad hoc queries with linked open data. Of course, LinkedIn does these things somewhat more efficiently because their system is built just for this task, but the linked data approach has the built-in readiness to join with everything else at almost no cost, without making a new data warehouse for each new business question. We could call this the sociological aspect of the thing. Getting to more concrete business, we see an economy that, we could say, without being alarmists, is confronted with some issues. Well, generally when times are bad, this results in consolidation of property and power. Businesses fail and get split up and sold off in pieces, government adds controls and regulations and so forth. This means ad hoc data integration, as control without data is just pretense. If times are lean, this also means that there is little readiness to do wholesale replacement of systems, which will take years before producing anything. So we must play with what there is and make it deliver, in ways and conditions that were not necessarily anticipated. The agility of the Data Web, if correctly understood, can be of great benefit there, especially on the reporting and business intelligence side. Specifically mapping line-of-business systems into RDF on the fly will help with integration, making the specialized warehouse the slower and more expensive alternative. But this too is needed at times. But for the RDF community to be taken seriously there, the messaging must be geared in this direction. Writing FOAF files by hand is not where you begin the pitch. Well, what is more natural then having a global, queriable information space, when you have a global information driven economy? The Data Web is about making this happen. First with doing this in published generally available data; next with the enterprises having their private data for their own use but still linking toward the outside, even though private data stays private: You can still use standard terms and taxonomies, where they apply, when talking of proprietary information. But let&#39;s get back to more specific issues At the lightning talks in Vienna, one participant said, &quot;Man&#39;s enemy is not the lion that eats men, it&#39;s his own brother. Semantic Web&#39;s enemy is the XML Web services stack that ate its lunch.&quot; There is some truth to the first part. The second part deserves some comment. The Web services stack is about transactions. When you have a fixed, often repeating task, it is a natural thing to make this a Web service. Even though SOA is not really prevalent in enterprise IT, it has value in things like managing supply-chain logistics with partners, etc. Lots of standard messages with unambiguous meaning. To make a parallel with the database world: first there was OLTP; then there was business intelligence. Of course, you must first have the transactions, to have something to analyze. SOA is for the transactions; the Data Web is for integration, analysis, and discovery. It is the ad hoc component of the real time enterprise, if you will. It is not a competitor against a transaction oriented SOA. In fact, RDF has no special genius for transactions. Another mistake that often gets made is stretching things beyond their natural niche. Doing transactions in RDF is this sort of over-stretching without real benefit. &quot;I made an ontology and it really did solve a problem. How do I convince the enterprise people, the MBA who says it&#39;s too complex, the developer who says it is not what he&#39;s used to, and so on?&quot; This is an education question. One of the findings of SWEO&#39;s enterprise survey was that there was awareness that difficult problems existed. There were and are corporate ontologies and taxonomies, diversely implemented. Some of these needs are recognized. RDF based technologies offer to make these more open standards based. open standards have proven economical in the past. What we also hear is that major enterprises do not even know what their information and human resources assets are: Experts can&#39;t be found even when they are in the next department, or reports and analysis gets buried in wikis, spreadsheets, and emails. Just as when SQL took off, we need vendors to do workshops on getting started with a technology. The affair in Vienna was a step in this direction. Another type of event specially focusing on vertical problems and their Data Web solutions is a next step. For example, one could do a workshop on integrating supply chain information with Data Web technologies. Or one on making enterprise knowledge bases from HR, CRM, office automation, wikis, etc. The good thing is that all these things are additions to, not replacements of, the existing mission-critical infrastructure. And better use of what you already have ought to be the theme of the day.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>I was in <a href="http://dbpedia.org/resource/Vienna" id="link-id0x28471870">Vienna</a> for the <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x26f0ec28">Linked Data</a> Practitioners gathering this week. Danny Ayers asked me if I would <a href="http://dbpedia.org/resource/Blog" id="link-id0x26cf7678">blog</a> about the State of the <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x273087e0">Semantic Web</a> or write the <i>This Week&#39;s Semantic Web</i> column. I don&#39;t have the time to cover all that may have happened during the past week but I will editorialize about the questions that again were raised in Vienna. How these things relate to <a href="http://virtuoso.openlinksw.com" id="link-id0x264e11b8">Virtuoso</a> will be covered separately. This is about the overarching questions of the times, not the finer points of geek craft.</p>
<p>
<a href="http://www.informatik.uni-leipzig.de/~auer/foaf.rdf#me" id="link-id0x2787de70">SÃ¶ren Auer</a> asked me to say a few things about relational to <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x280b12f8">RDF</a> mapping. I will cite some highlights from this, as they pertain to the general scene. There was an &quot;open hacking&quot; session Wednesday night featuring lightning talks. I will use some of these too as a starting point.</p>
<h3>The messaging?</h3>
<p>The <a href="http://www.w3.org/2001/sw/sweo/" id="link-id0x28078030">SWEO</a> (Semantic Web Education and Outreach) interest group of the W3C spent some time looking for an elevator pitch for the Semantic Web. It became &quot;<a href="http://dbpedia.org/resource/Data" id="link-id0x290a48c0">Data</a> Unleashed.&quot; Why not? Let&#39;s give this some context.</p>
<p>So, if we are holding a <i>Semantic Web 101</i> session, where should we begin? I hazard to guess that we should not begin by writing a FOAF file in Turtle by hand, as this is one thing that is not likely to happen in the real world.</p>
<p>Of course, the social aspect of the Data Web is the most immediately engaging, so a demo might be to go make an account with <a href="http://myopenlink.net/" id="link-id0x272ed6d0">myopenlink</a>.<a href="http://dbpedia.org/resource/.NET_Framework" id="link-id0x277dbbd0">net</a> and see that after one has entered the data one normally enters for any social network, one has become a Data Web citizen. This means that one can be found, just like this, with a query against the set of data spaces hosted on the system. Then we just need a few pages that repurpose this data and relate it to other data. We show some samples of queries like this in our <a href="http://challenge.semanticweb.org/" id="link-id0x25fda5c8">Billion Triples Challenge</a> demo. We will make a webcast about this to make it all clearer.</p>
<p>Behold: The Data Web is about the world becoming a database; writing <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x278c3878">SPARQL</a> queries or triples is incidental. You will write FOAF files by hand just as little as you now write <a href="http://dbpedia.org/resource/SQL" id="link-id0x27e6be18">SQL</a> insert statements for filling in your account <a href="http://dbpedia.org/resource/Information" id="link-id0x2727a278">information</a> on Myspace.</p>
<p>Every time there is a major shift in technology, this shift needs to be motivated by addressing a new class of problem. This means doing something that could not be done before. The last time this happened was when the relational database became the dominant IT technology. At that time, the questions involved putting the enterprise in the database and building a cluster of Line Of Business (LOB) applications around the database. The argument for the <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x26020128">RDBMS</a> was that you did not have to constrain the set of queries that might later be made, when designing the database. In other words, it was making things more <i>ad hoc</i>. This was opposed then on grounds of being less efficient than the hierarchical and network databases which the relational eventually replaced.</p>
<p>Today, the point of the Data Web is that you do not have to constrain what your data can join or integrate with, when you design your database. The counter-argument is that this is slow and geeky and not scalable. See the similarity?</p>
<p>A difference is that we are not specifically aiming at replacing the RDBMS. In fact, if you know exactly what you will query and have a well defined workload, a relational representation optimized for the workload will give you about 10x the performance of the equivalent RDF warehouse. OLTP remains a relational-only domain.</p>
<p>However, when we are talking about doing queries and analytics against the Web, or even against more than a handful of relational systems, the things which make RDBMS good become problematic.</p>
<h3>What is the business value of this?</h3>
<p>The most reliable of human drives is the drive to make oneself known. This drives all, from any social scene to business communications to politics. Today, when you want to proclaim you exist, you do so first on the Web. The Web did not become the prevalent media because business loved it for its own sake, it became prevalent because business could not afford not to assert their presence there. If anything, the Web eroded the communications dominance of a lot of players, which was not welcome but still had to be dealt with, by embracing the Web.</p>
<p>Today, in a world driven by data, the Data Web will be catalyzed by similar factors: If your data is not there, you will not figure in query results. Search engines will play some role there but also many social applications will have reports that are driven by published data. Also consider any e-commerce, any marketplace, and so forth. The Data Portability movement is a case in point: Users want to own their own content; silo operators want to capitalize on holding it. Right now, we see these things in silos; the Data Web will create bridges between these, and what is now in silo data centers will be increasingly available on an ad hoc basis with Open Data.</p>
<p>Again, we see a movement from the specialized to the generic: What LinkedIn does in its data center can be done with ad hoc queries with <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x261c7bc8">linked open data</a>. Of course, LinkedIn does these things somewhat more efficiently because their system is built just for this task, but the linked data approach has the built-in readiness to join with everything else at almost no cost, without making a new data warehouse for each new business question.</p>
<p>We could call this the sociological aspect of the thing. Getting to more concrete business, we see an economy that, we could say, without being alarmists, is confronted with some issues. Well, generally when times are bad, this results in consolidation of property and power. Businesses fail and get split up and sold off in pieces, government adds controls and regulations and so forth. This means ad hoc data integration, as control without data is just pretense. If times are lean, this also means that there is little readiness to do wholesale replacement of systems, which will take years before producing anything. So we must play with what there is and make it deliver, in ways and conditions that were not necessarily anticipated. The agility of the Data Web, if correctly understood, can be of great benefit there, especially on the reporting and business intelligence side. Specifically mapping line-of-business systems into RDF on the fly will help with integration, making the specialized warehouse the slower and more expensive alternative. But this too is needed at times.</p>
<p>But for the RDF community to be taken seriously there, the messaging must be geared in this direction. Writing FOAF files by hand is not where you begin the pitch. Well, what is more natural then having a global, queriable information space, when you have a global information driven economy?</p>
<p>The Data Web is about making this happen. First with doing this in published generally available data; next with the enterprises having their private data for their own use but still linking toward the outside, even though private data stays private: You can still use standard terms and taxonomies, where they apply, when talking of proprietary information.</p>
<h3>But let&#39;s get back to more specific issues</h3>
<p>At the lightning talks in Vienna, one participant said, &quot;Man&#39;s enemy is not the lion that eats men, it&#39;s his own brother. Semantic Web&#39;s enemy is the <a href="http://dbpedia.org/resource/XML" id="link-id0x26273118">XML</a> Web services stack that ate its lunch.&quot; There is some truth to the first part. The second part deserves some comment. The Web services stack is about transactions. When you have a fixed, often repeating task, it is a natural thing to make this a Web service. Even though SOA is not really prevalent in enterprise IT, it has value in things like managing supply-chain logistics with partners, etc. Lots of standard messages with unambiguous meaning. To make a parallel with the database world: first there was OLTP; then there was business intelligence. Of course, you must first have the transactions, to have something to analyze.</p>
<p>SOA is for the transactions; the Data Web is for integration, analysis, and discovery. It is the <i>ad hoc</i> component of the real time enterprise, if you will. It is not a competitor against a transaction oriented SOA. In fact, RDF has no special genius for transactions. Another mistake that often gets made is stretching things beyond their natural niche. Doing transactions in RDF is this sort of over-stretching without real benefit.</p>
<p>&quot;I made an ontology and it really did solve a problem. How do I convince the enterprise people, the MBA who says it&#39;s too complex, the developer who says it is not what he&#39;s used to, and so on?&quot;</p>
<p>This is an education question. One of the findings of SWEO&#39;s enterprise survey was that there was awareness that difficult problems existed. There were and are corporate ontologies and taxonomies, diversely implemented. Some of these needs are recognized. RDF based technologies offer to make these more open standards based. open standards have proven economical in the past. What we also hear is that major enterprises do not even know what their information and human resources assets are: Experts can&#39;t be found even when they are in the next department, or reports and analysis gets buried in wikis, spreadsheets, and emails.</p>
<p>Just as when SQL took off, we need vendors to do workshops on getting started with a technology. The affair in Vienna was a step in this direction. Another type of event specially focusing on vertical problems and their Data Web solutions is a next step. For example, one could do a workshop on integrating supply chain information with Data Web technologies. Or one on making enterprise <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x260172a8">knowledge</a> bases from HR, CRM, office automation, wikis, etc. The good thing is that all these things are additions to, not replacements of, the existing mission-critical infrastructure. And better use of what you already have ought to be the theme of the day.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-09-30#1445">
  <rss:title>OpenLink Software&#39;s Virtuoso Submission to the Billion Triples Challenge</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-09-30T15:39:26Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Introduction We use Virtuoso 6 Cluster Edition to demonstrate the following: Text and structured information based lookups Analytics queries Analysis of co-occurrence of features like interests and tags. Dealing with identity of multiple IRI&#39;s (owl:sameAs) The demo is based on a set of canned SPARQL queries that can be invoked using the OpenLink Data Explorer (ODE) Firefox extension. The demo queries can also be run directly against the SPARQL end point. The demo is being worked on at the time of submission and may be shown online by appointment. Automatic annotation of the data based on named entity extraction is being worked on at the time of this submission. By the time of ISWC 2008 the set of sample queries will be enhanced with queries based on extracted named entities and their relationships in the UMBEL and Open CYC ontologies. Also examples involving owl:sameAs are being added, likewise with similarity metrics and search hit scores. The Data The database consists of the billion triples data sets and some additions like Umbel. Also the Freebase extract is newer than the challenge original. The triple count is 1115 million. In the case of web harvested resources, the data is loaded in one graph per resource. In the case of larger data sets like Dbpedia or the US census, all triples of the provenance share a data set specific graph. All string literals are additionally indexed in a full text index. No stop words are used. Most queries do not specify a graph. Thus they are evaluated against the union of all the graphs in the database. The indexing scheme is SPOG, GPOS, POGS, OPGS. All indices ending in S are bitmap indices. The Queries The demo uses Virtuoso SPARQL extensions in most queries. These extensions consist on one hand of well known SQL features like aggregation with grouping and existence and value subqueries and on the other of RDF specific features. The latter include run time RDFS and OWL inferencing support and backward chaining subclasses and transitivity. Simple Lookups sparql select ?s ?p (bif:search_excerpt (bif:vector (&#39;semantic&#39;, &#39;web&#39;), ?o)) where { ?s ?p ?o . filter (bif:contains (?o, &quot;&#39;semantic web&#39;&quot;)) } limit 10 ; This looks up triples with semantic web in the object and makes a search hit summary of the literal, highlighting the search terms. sparql select ?tp count(*) where { ?s ?p2 ?o2 . ?o2 a ?tp . ?s foaf:nick ?o . filter (bif:contains (?o, &quot;plaid_skirt&quot;)) } group by ?tp order by desc 2 limit 40 ; This looks at what sorts of things are referenced by the properties of the foaf handle plaid_skirt. What are these things called? sparql select ?lbl count(*) where { ?s ?p2 ?o2 . ?o2 rdfs:label ?lbl . ?s foaf:nick ?o . filter (bif:contains (?o, &quot;plaid_skirt&quot;)) } group by ?lbl order by desc 2 ; Many of these things do not have a rdfs:label. Let us use a more general concept of lable which groups dc:title, foaf:name and other name-like properties together. The subproperties are resolved at run time, there is no materialization. sparql define input:inference &#39;b3s&#39; select ?lbl count(*) where { ?s ?p2 ?o2 . ?o2 b3s:label ?lbl . ?s foaf:nick ?o . filter (bif:contains (?o, &quot;plaid_skirt&quot;)) } group by ?lbl order by desc 2 ; We can list sources by the topics they contain. Below we look for graphs that mention terrorist bombing. sparql select ?g count(*) where { graph ?g { ?s ?p ?o . filter (bif:contains (?o, &quot;&#39;terrorist bombing&#39;&quot;)) } } group by ?g order by desc 2 ; Now some web 2.0 tagging of search results. The tag cloud of &quot;computer&quot; sparql select ?lbl count (*) where { ?s ?p ?o . ?o bif:contains &quot;computer&quot; . ?s sioc:topic ?tg . optional { ?tg rdfs:label ?lbl } } group by ?lbl order by desc 2 limit 40 ; This query will find the posters who talk the most about sex. sparql select ?auth count (*) where { ?d dc:creator ?auth . ?d ?p ?o filter (bif:contains (?o, &quot;sex&quot;)) } group by ?auth order by desc 2 ; Analytics We look for people who are joined by having relatively uncommon interests but do not know each other. sparql select ?i ?cnt ?n1 ?n2 ?p1 ?p2 where { { select ?i count (*) as ?cnt where { ?p foaf:interest ?i } group by ?i } filter ( ?cnt &gt; 1 &amp;&amp; ?cnt &lt; 10) . ?p1 foaf:interest ?i . ?p2 foaf:interest ?i . filter (?p1 != ?p2 &amp;&amp; !bif:exists ((select (1) where {?p1 foaf:knows ?p2 })) &amp;&amp; !bif:exists ((select (1) where {?p2 foaf:knows ?p1 }))) . ?p1 foaf:nick ?n1 . ?p2 foaf:nick ?n2 . } order by ?cnt limit 50 ; The query takes a fairly long time, mostly spent counting the interested in 25M interest triples. It then takes people that share the interest and checks that neither claims to know the other. It then sorts the results rarest interest first. The query can be written more efficently but is here just to show that database-wide scans of the population are possible ad hoc. Now we go to SQL to make a tag co-occurrence matrix. This can be used for showing a Technorati-style related tags line at the bottom of a search result page. This showcases the use of SQL together with SPARQL. The half-matrix of tags t1, t2 with the co-occurrence count at the intersection is much more efficiently done in SQL, specially since it gets updated as the data changes. This is an example of materialized intermediate results based on warehoused RDF. create table tag_count (tcn_tag iri_id_8, tcn_count int, primary key (tcn_tag)); alter index tag_count on tag_count partition (tcn_tag int (0hexffff00)); create table tag_coincidence (tc_t1 iri_id_8, tc_t2 iri_id_8, tc_count int, tc_t1_count int, tc_t2_count int, primary key (tc_t1, tc_t2)) alter index tag_coincidence on tag_coincidence partition (tc_t1 int (0hexffff00)); create index tc2 on tag_coincidence (tc_t2, tc_t1) partition (tc_t2 int (0hexffff00)); How many times each topic is mentioned? insert into tag_count select * from (sparql define output:valmode &quot;LONG&quot; select ?t count (*) as ?cnt where { ?s sioc:topic ?t } group by ?t) xx option (quietcast); Take all t1, t2 where t1 and t2 are tags of the same subject, store only the permutation where the internal id of t1 &lt; that of t2. insert into tag_coincidence (tc_t1, tc_t2, tc_count) select &quot;t1&quot;, &quot;t2&quot;, cnt from (select &quot;t1&quot;, &quot;t2&quot;, count (*) as cnt from (sparql define output:valmode &quot;LONG&quot; select ?t1 ?t2 where { ?s sioc:topic ?t1 . ?s sioc:topic ?t2 }) tags where &quot;t1&quot; &lt; &quot;t2&quot; group by &quot;t1&quot;, &quot;t2&quot;) xx where isiri_id (&quot;t1&quot;) and isiri_id (&quot;t2&quot;) option (quietcast); Now put the individual occurrence counts into the same table with the co-occurrence. This denormalization makes the related tags lookup faster. update tag_coincidence set tc_t1_count = (select tcn_count from tag_count where tcn_tag = tc_t1), tc_t2_count = (select tcn_count from tag_count where tcn_tag = tc_t2); Now each tag_coincidence row has the joint occurrence count and individual occurrence counts. A single select will return a Technorati-style related tags listing. To show the URI&#39;s of the tags: select top 10 id_to_iri (tc_T1), id_to_iri (tc_t2), tc_count from tag_coincidence order by tc_count desc; Social Networks We look at what interests people have sparql select ?o ?cnt where { { select ?o count (*) as ?cnt where { ?s foaf:interest ?o } group by ?o } filter (?cnt &gt; 100) } order by desc 2 limit 100 ; Now the same for the Harry Potter fans sparql select ?i2 count (*) where { ?p foaf:interest &lt;http://www.livejournal.com/interests.bml?int=harry+potter&gt; . ?p foaf:interest ?i2 } group by ?i2 order by desc 2 limit 20 ; We see whether knows relations are symmmetrical. We return the top n people that others claim to know without being reciprocally known. sparql select ?celeb, count (*) where { ?claimant foaf:knows ?celeb . filter (!bif:exists ((select (1) where { ?celeb foaf:knows ?claimant }))) } group by ?celeb order by desc 2 limit 10 ; We look for a well connected person to start from. sparql select ?p count (*) where { ?p foaf:knows ?k } group by ?p order by desc 2 limit 50 ; We look for the most connected of the many online identities of Stefan Decker. sparql select ?sd count (distinct ?xx) where { ?sd a foaf:Person . ?sd ?name ?ns . filter (bif:contains (?ns, &quot;&#39;Stefan Decker&#39;&quot;)) . ?sd foaf:knows ?xx } group by ?sd order by desc 2 ; We count the transitive closure of Stefan Decker&#39;s connections sparql select count (*) where { { select * where { ?s foaf:knows ?o } } option (transitive, t_distinct, t_in(?s), t_out(?o)) . filter (?s = &lt;mailto:stefan.decker@deri.org&gt;) } ; Now we do the same while following owl:sameAs links. sparql define input:same-as &quot;yes&quot; select count (*) where { { select * where { ?s foaf:knows ?o } } option (transitive, t_distinct, t_in(?s), t_out(?o)) . filter (?s = &lt;mailto:stefan.decker@deri.org&gt;) } ; Demo System The system runs on Virtuoso 6 Cluster Edition. The database is partitioned into 12 partitions, each served by a distinct server process. The system demonstrated hosts these 12 servers on 2 machines, each with 2 xXeon 5345 and 16GB memory and 4 SATA disks. For scaling, the processes and corresponding partitions can be spread over a larger number of machines. If each ran on its own server with 16GB RAM, the whole data set could be served from memory. This is desirable for search engine or fast analytics applications. Most of the demonstrated queries run in memory on second invocation. The timing difference between first and second run is easily an order of magnitude.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<h2>Introduction</h2> 

<p>We use <a href="http://virtuoso.openlinksw.com" id="link-id0xa278560">Virtuoso</a> 6 Cluster Edition to demonstrate the following:</p>
<ul>
<li>Text and structured <a href="http://dbpedia.org/resource/Information" id="link-id0xb3a4490">information</a> based lookups</li>
<li>Analytics queries</li>
<li>Analysis of co-occurrence of features like interests and tags.</li>
<li>Dealing with identity of multiple IRI&#39;s (<a href="http://dbpedia.org/resource/Web_Ontology_Language" id="link-id0xa904bd8">owl</a>:sameAs)</li>
</ul>

<p>The demo is based on a set of canned <a href="http://dbpedia.org/resource/SPARQL" id="link-id0xac185d0">SPARQL</a> queries that can be invoked using the <a href="http://ode.openlinksw.com/" id="link-id0xb8efe28">OpenLink Data Explorer</a> (<a href="http://ode.openlinksw.com/" id="link-id0xb341808">ODE</a>) Firefox extension.</p>
<p>The demo queries can also be run directly against the SPARQL end point.</p>

<p>The demo is being worked on at the time of submission and may be shown online by appointment.</p>

<p>Automatic annotation of the <a href="http://dbpedia.org/resource/Data" id="link-id0xa2fcc88">data</a> based on <a href="http://dbpedia.org/resource/Named_entity_recognition" id="link-id0xc085440">named entity extraction</a> is
being worked on at the time of this submission.  By the time of ISWC
2008 the set of sample queries will be enhanced with queries based on
extracted <a href="http://dbpedia.org/resource/Named_entity_recognition" id="link-id0xa92b3e0">named entities</a> and their relationships in the <a href="http://umbel.org/about/" id="link-id0xa1c7c38">UMBEL</a> and Open
CYC ontologies.
</p>

<p>Also examples involving owl:sameAs are being added, likewise  with similarity metrics and search hit scores.</p>

<h2>The Data</h2>

<p>The database consists of the billion triples data sets and some additions like Umbel.   Also the Freebase extract is newer than the challenge original.</p>
<p>The triple count is 1115 million.</p>
<p>In the case of web harvested resources, the data is loaded in one graph per resource.</p>
<p>In the case of larger data sets like <a href="http://dbpedia.org/resource/DBpedia" id="link-id0xa949850">Dbpedia</a> or the US census, all triples of the provenance share a data set specific graph.</p>
<p>All string literals are additionally indexed in a full text index.  No stop words are used.</p>

<p>Most queries do not specify a graph.  Thus they are evaluated against the union of all the graphs in the database.
The indexing scheme is SPOG, GPOS, POGS, OPGS.  All indices ending in S are bitmap indices.
</p>

<h2>The Queries </h2>


<p>The demo uses Virtuoso SPARQL extensions  in most queries.  These
extensions consist on one hand of well known <a href="http://dbpedia.org/resource/SQL" id="link-id0xc116190">SQL</a> features like
aggregation with grouping and existence and value subqueries and on
the other of <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xa9047f0">RDF</a> specific features.
The latter include  run time RDFS and OWL inferencing support  and backward
chaining subclasses and transitivity.  
</p>


<h3>Simple Lookups</h3> 

<pre>sparql 
select ?s ?p (bif:search_excerpt (bif:vector (&#39;<a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0xbe38070">semantic&#39;, &#39;web</a>&#39;), ?o)) 
where 
  {
    ?s ?p ?o . 
    filter (bif:contains (?o, &quot;&#39;semantic web&#39;&quot;)) 
  } 
limit 10
;
</pre>

<p>This looks up triples with semantic web in the object and makes a search hit summary of the literal, 
highlighting the search terms.
</p>

<pre>sparql 
select ?tp count(*) 
where 
  { 
    ?s ?p2 ?o2 . 
    ?o2 a ?tp . 
    ?s foaf:nick ?o . 
    filter (bif:contains (?o, &quot;plaid_skirt&quot;)) 
  } 
group by ?tp
order by desc 2
limit 40
;
</pre>

<p>This looks at what sorts of things are referenced by the properties of the foaf handle plaid_skirt.</p>
<p>What are these things called?</p>

<pre>sparql 
select ?lbl count(*) 
where 
  { 
    ?s ?p2 ?o2 . 
    ?o2 rdfs:label ?lbl . 
    ?s foaf:nick ?o . 
    filter (bif:contains (?o, &quot;plaid_skirt&quot;)) 
  } 
group by ?lbl
order by desc 2
;
</pre>

<p>Many of these things do not have a rdfs:label.  Let us use a more general concept of lable 
which groups dc:title, foaf:name and other name-like properties together.  The subproperties are 
resolved at run time, there is no materialization.
</p>

<pre>sparql 
define input:inference &#39;b3s&#39;
select ?lbl count(*) 
where 
  { 
    ?s ?p2 ?o2 . 
    ?o2 b3s:label ?lbl . 
    ?s foaf:nick ?o . 
    filter (bif:contains (?o, &quot;plaid_skirt&quot;)) 
  } 
group by ?lbl
order by desc 2
;
</pre>

<p>We can list sources by the topics they contain.  
Below we look for graphs that mention terrorist bombing.
</p>

<pre>sparql 
select ?g count(*) 
where 
  { 
    graph ?g 
      {
        ?s ?p ?o . 
        filter (bif:contains (?o, &quot;&#39;terrorist bombing&#39;&quot;)) 
      }
  } 
group by ?g 
order by desc 2
;
</pre>

<p>Now some web 2.0 tagging of search results.  The <a href="http://dbpedia.org/resource/Tag" id="link-id0xa366510">tag</a> cloud of &quot;computer&quot;</p>

<pre>sparql 
select ?lbl count (*) 
where 
  { 
    ?s ?p ?o . 
    ?o bif:contains &quot;computer&quot; . 
    ?s sioc:topic ?tg .
    optional 
      {
        ?tg rdfs:label ?lbl
      }
  }
group by ?lbl 
order by desc 2 
limit 40
;
</pre>

<p>This query will find the posters who talk the most about sex.</p>

<pre>sparql 
select ?auth count (*) 
where 
  { 
    ?d dc:creator ?auth .
    ?d ?p ?o
    filter (bif:contains (?o, &quot;sex&quot;)) 
  } 
group by ?auth
order by desc 2
;
</pre>

<h3>Analytics </h3>

<p>We look for people who are joined by having relatively uncommon interests but do not know each other.</p>

<pre>sparql select ?i ?cnt ?n1 ?n2 ?p1 ?p2 
where 
  {
    {
      select ?i count (*) as ?cnt 
      where 
        { ?p foaf:interest ?i } 
      group by ?i
    }
    filter ( ?cnt &gt; 1 &amp;&amp; ?cnt &lt; 10) .
    ?p1 foaf:interest ?i .
    ?p2 foaf:interest ?i .
    filter  (?p1 != ?p2 &amp;&amp; 
             !bif:exists ((select (1) where {?p1 foaf:knows ?p2 })) &amp;&amp; 
             !bif:exists ((select (1) where {?p2 foaf:knows ?p1 }))) .
    ?p1 foaf:nick ?n1 .
    ?p2 foaf:nick ?n2 .
  } 
order by ?cnt 
limit 50
;
</pre>

<p>The query takes a fairly long time, mostly spent counting the interested in 25M interest triples.  
It then takes people that share the interest and checks that neither claims to know the other.  
It then sorts the results rarest interest first.  The query can be written more efficently but is 
here just to show that database-wide scans of the population are possible ad hoc.
</p>

<p>Now we go to SQL to make a tag co-occurrence matrix. This can be used for showing a Technorati-style
related tags line at the bottom of a search result page.  This showcases the use of SQL together 
with SPARQL.  The half-matrix of tags t1, t2 with the co-occurrence count at the intersection is 
much more efficiently done in SQL, specially since it gets updated as the data changes.  
This is an example of materialized intermediate results based on warehoused RDF.
</p>

<pre>create table 
tag_count (tcn_tag iri_id_8, 
           tcn_count int, 
           primary key (tcn_tag));
           
alter index 
tag_count on tag_count partition (tcn_tag int (0hexffff00));

create table 
tag_coincidence (tc_t1 iri_id_8, 
                 tc_t2 iri_id_8, 
                 tc_count int, 
                 tc_t1_count int, 
                 tc_t2_count int, 
                 primary key  (tc_t1, tc_t2))

alter index 
tag_coincidence on tag_coincidence partition (tc_t1 int (0hexffff00));

create index 
tc2 on tag_coincidence (tc_t2, tc_t1) partition (tc_t2 int (0hexffff00));
</pre>

<p>How many times each topic is mentioned?</p>

<pre>
insert into tag_count 
  select * 
    from (sparql define output:valmode &quot;LONG&quot; 
                 select ?t count (*) as ?cnt 
                 where 
                   {
                     ?s sioc:topic ?t
                   } 
                 group by ?t) 
    xx option (quietcast);
</pre>

<p>Take all t1, t2 where t1 and t2 are tags of the same subject, store only the permutation where the internal id of t1 &lt; that of t2.</p>

<pre>insert into tag_coincidence  (tc_t1, tc_t2, tc_count)
  select &quot;t1&quot;, &quot;t2&quot;, cnt 
    from 
      (select  &quot;t1&quot;, &quot;t2&quot;, count (*) as cnt 
         from 
           (sparql define output:valmode &quot;LONG&quot;
                   select ?t1 ?t2 
                     where 
                       {
                         ?s sioc:topic ?t1 . 
                         ?s sioc:topic ?t2 
                       }) tags
         where &quot;t1&quot; &lt; &quot;t2&quot; 
         group by &quot;t1&quot;, &quot;t2&quot;) xx
    where isiri_id (&quot;t1&quot;) and 
          isiri_id (&quot;t2&quot;) 
    option (quietcast); 
</pre>

<p>Now put the individual occurrence counts into the same table with the co-occurrence.  This 
denormalization makes the related tags lookup faster.
</p>


<pre>update tag_coincidence 
  set tc_t1_count = (select tcn_count from tag_count where tcn_tag = tc_t1),
      tc_t2_count = (select tcn_count from tag_count where tcn_tag = tc_t2);
</pre>

<p>Now each tag_coincidence row has the joint occurrence count and individual occurrence counts.  
A single select will return a Technorati-style related tags listing.
</p>

<p>To show the <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id0xaf355c8">URI</a>&#39;s of the tags:
</p>

<pre>select top 10 id_to_iri (tc_T1), id_to_iri (tc_t2), tc_count 
  from tag_coincidence 
  order by tc_count desc;
</pre>

<h3>Social Networks </h3>

<p>We look at what interests people have </p>

<pre>sparql 
select ?o ?cnt  
where 
  {
    {
      select ?o count (*) as ?cnt 
        where 
          {
            ?s foaf:interest ?o
          } 
        group by ?o
    } 
    filter (?cnt &gt; 100) 
  } 
order by desc 2 
limit 100
;
</pre>

<p>Now the same for the Harry Potter fans </p>

<pre>sparql 
select ?i2 count (*) 
where 
  { 
    ?p foaf:interest &lt;<a href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0xa274410">http</a>://www.livejournal.com/interests.bml?int=harry+potter&gt; .
    ?p foaf:interest ?i2 
  } 
group by ?i2 
order by desc 2 
limit 20
;
</pre>

<p>We see whether knows relations are symmmetrical.  We return the top n people that others claim to know without being reciprocally known.</p>

<pre>sparql 
select ?celeb, count (*) 
where 
  { 
    ?claimant foaf:knows ?celeb . 
    filter (!bif:exists ((select (1) 
                          where 
                            {
                              ?celeb foaf:knows ?claimant 
                            }))) 
  } 
group by ?celeb 
order by desc 2 
limit 10
;
</pre>

<p>We look for a well connected person to start from.</p>

<pre>sparql 
select ?p count (*) 
where 
  {
    ?p foaf:knows ?k 
  } 
group by ?p 
order by desc 2 
limit 50
;
</pre>

<p>We look for the most connected of the many online identities of Stefan Decker.</p>

<pre>sparql 
select ?sd count (distinct ?xx) 
where 
  { 
    ?sd a foaf:Person . 
    ?sd ?name ?ns . 
    filter (bif:contains (?ns, &quot;&#39;Stefan Decker&#39;&quot;)) . 
    ?sd foaf:knows ?xx 
  } 
group by ?sd 
order by desc 2
;
</pre>

<p>We count the transitive closure of Stefan Decker&#39;s connections </p>

<pre>sparql 
select count (*) 
where 
  { 
    {
      select * 
      where 
        { 
          ?s foaf:knows ?o 
        }
    }
    option (transitive, t_distinct, t_in(?s), t_out(?o)) . 
    filter (?s = &lt;mailto:stefan.decker@deri.org&gt;)
  }
;
</pre>

<p>Now we do the same while following owl:sameAs links.</p>

<pre>sparql 
define input:same-as &quot;yes&quot;
select count (*) 
where 
  { 
    {
      select * 
      where 
        { 
          ?s foaf:knows ?o 
        }
    }
    option (transitive, t_distinct, t_in(?s), t_out(?o)) . 
    filter (?s = &lt;mailto:stefan.decker@deri.org&gt;)
  }
;
</pre>

<h2>Demo System</h2> 

<p>The system runs on Virtuoso 6 Cluster Edition.  The database is partitioned into 12 partitions, 
each served by a distinct server process. The system demonstrated hosts these 12 servers on 2 
machines, each with  2 xXeon 5345 and 16GB memory and 4 SATA disks. For scaling, the processes 
and corresponding partitions can be spread over a larger number of machines.  If each ran on its 
own server with 16GB RAM, the whole data set could be served from memory. This is desirable for 
search engine or fast analytics applications. Most of the demonstrated queries run in memory on 
second invocation. The timing difference between first and second run is easily an order of 
magnitude.
</p>
</div>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-06-09#1376">
  <rss:title>The DARQ Matter of Federation</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-06-09T13:57:30Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Astronomers propose that the universe is held together, so to speak, by the gravity of invisible &quot;dark matter&quot; spread in interstellar and intergalactic space. For the data web, it will be held together by federation, also an invisible factor. As in Minkowski space, so in cyberspace. To take the astronomical analogy further, putting too much visible stuff in one place makes a black hole, whose chief properties are that it is very heavy, can only get heavier and that nothing comes out. DARQ is Bastian Quilitz&#39;s federated extension of the Jena ARQ SPARQL processor. It has existed for a while and was also presented at ESWC2008. There is also SPARQL FED from Andy Seaborne, an explicit means of specifying which end point will process which fragment of a distributed SPARQL query. Still, for federation to deliver in an open, decentralized world, it must be transparent. For a specific application, with a predictable workload, it is of course OK to partition queries explicitly. Bastian had split DBpedia among five Virtuoso servers and was querying this set with DARQ. The end result was that there was a rather frightful cost of federation as opposed to all the data residing in a single Virtuoso. The other result was that if selectivity of predicates was not correctly guessed by the federation engine, the proposition was a non-starter. With correct join order it worked, though. Yet, we really want federation. Looking further down the road, we simply must make federation work. This is just as necessary as running on a server cluster for mid-size workloads. Since we are convinced of the cause, let&#39;s talk about the means. For DARQ as it now stands, there&#39;s probably an order of magnitude or even more to gain from a couple of simple tricks. If going to a SPARQL end point that is not the outermost in the loop join sequence, batch the requests together in one HTTP/1.1 message. So, if the query is &quot;get me my friends living in cities of over a million people,&quot; there will be the fragment &quot;get city where x lives&quot; and later &quot;ask if population of x greater than 1000000&quot;. If I have 100 friends, I send the 100 requests in a batch to each eligible server. Further, if running against a server of known brand, use a client-server connection and prepared statements with array parameters. This can well improve the processing speed at the remote end point by another order of magnitude. This gain may however not be as great as the latency savings from message batching. We will provide a sample of how to do this with Virtuoso over JDBC so Bastian can try this if interested. These simple things will give a lot of mileage and may even decide whether federation is an option in specific applications. For the open web however, these measures will not yet win the day. When federating SQL, colocation of data is sort of explicit. If two tables are joined and they are in the same source, then the join can go to the source. For SPARQL this is also so but with a twist: If a foaf:Person is found on a given server, this does not mean that the Person&#39;s geek code or email hash will be on the same server. Thus {?p name &quot;Johnny&quot; . ?p geekCode ?g . ?p emailHash ?h } does not necessarily denote a colocated join if many servers serve items of the vocabulary. However, in most practical cases, for obtaining a rapid answer, treating this as a colocated fragment will be appropriate. Thus, it may be necessary to be able to declare that geek codes will be assumed colocated with names. This will save a lot of message passing and offer decent, if not theoretically total recall. For search style applications, starting with such assumptions will make sense. If nothing is found, then we can partition each join step separately for the unlikely case that there were a server that gave geek codes but not names. For Virtuoso, we find that a federated query&#39;s asynchronous, parallel evaluation model is not so different from that on a local cluster. So the cluster version could have the option of federated query. The difference is that a cluster is local and tightly coupled and predictably partitioned but a federated setting is none of these. For description, we would take DARQ&#39;s description model and maybe extend it a little where needed. Also we would enhance the protocol to allow just asking for the query cost estimate given a query with literals specified. We will do this eventually. We would like to talk to Bastian about large improvements to DARQ, specially when working with Virtuoso. We&#39;ll see. Of course, one mode of federating is the crawl-as-you-go approach of the Virtuoso Sponger. This will bring in fragments following seeAlso or sameAs declarations or other references. This will however not have the recall of a warehouse or federation over well described SPARQL end-points. But up to a certain volume it has the speed of local storage. The emergence of voiD (Vocabulary of Interlinked Data) is a step in the direction of making federation a reality. There is a separate post about this.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Astronomers propose that the universe is held together, so to speak, by the gravity of invisible &quot;dark matter&quot; spread in interstellar and intergalactic space.</p>
<p>For the <a href="http://dbpedia.org/resource/Data" id="link-id0x19bbd830">data</a> web, it will be held together by federation, also an invisible factor. As in Minkowski space, so in <a href="http://dbpedia.org/resource/Cyberspace" id="link-id0x19af2488">cyberspace</a>.</p>
<p>To take the astronomical analogy further, putting too much visible stuff in one place makes a black hole, whose chief properties are that it is very heavy, can only get heavier and that nothing comes out.</p>
<p>
<a href="http://darq.sourceforge.net/" id="link-id0x19b7a9c8">DARQ</a> is Bastian Quilitz&#39;s federated extension of the <a href="http://jena.sourceforge.net/" id="link-id0x19ce3da0">Jena</a> <a href="http://jena.sourceforge.net/ARQ/" id="link-id0xa569a258">ARQ</a> <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1a8d2270">SPARQL</a> processor. It has existed for a while and was also presented at <a href="http://www.eswc2008.org/" id="link-id0x1aad1d00">ESWC2008</a>. There is also SPARQL FED from Andy Seaborne, an explicit means of specifying which end point will process which fragment of a distributed SPARQL query. Still, for federation to deliver in an open, decentralized world, it must be transparent. For a specific application, with a predictable workload, it is of course OK to partition queries explicitly.</p>
<p>Bastian had split <a href="http://dbpedia.org/resource/DBpedia" id="link-id0x1a8ac770">DBpedia</a> among five <a href="http://virtuoso.openlinksw.com" id="link-id0x19601d30">Virtuoso</a> servers and was querying this set with DARQ. The end result was that there was a rather frightful cost of federation as opposed to all the data residing in a single Virtuoso. The other result was that if selectivity of predicates was not correctly guessed by the federation engine, the proposition was a non-starter. With correct join order it worked, though.</p>
<p>Yet, we really want federation. Looking further down the road, we simply must make federation work. This is just as necessary as running on a server cluster for mid-size workloads.</p>
<p>Since we are convinced of the cause, let&#39;s talk about the means.</p>
<p>For DARQ as it now stands, there&#39;s probably an order of magnitude or even more to gain from a couple of simple tricks. If going to a SPARQL end point that is not the outermost in the loop join sequence, batch the requests together in one <a href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x19b94818">HTTP</a>/1.1 message. So, if the query is &quot;get me my friends living in cities of over a million people,&quot; there will be the fragment &quot;get city where x lives&quot; and later &quot;ask if population of x greater than 1000000&quot;. If I have 100 friends, I send the 100 requests in a batch to each eligible server.</p>
<p>Further, if running against a server of known brand, use a client-server connection and prepared statements with array parameters. This can well improve the processing speed at the remote end point by another order of magnitude. This gain may however not be as great as the latency savings from message batching. We will provide a sample of how to do this with Virtuoso over <a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id0x17822258">JDBC</a> so Bastian can try this if interested.</p>
<p>These simple things will give a lot of mileage and may even decide whether federation is an option in specific applications. For the open web however, these measures will not yet win the day.</p>
<p>When federating <a href="http://dbpedia.org/resource/SQL" id="link-id0x1a651628">SQL</a>, colocation of data is sort of explicit. If two tables are joined and they are in the same source, then the join can go to the source. For SPARQL this is also so but with a twist:</p>
<p>If a foaf:Person is found on a given server, this does not mean that the Person&#39;s geek code or email hash will be on the same server. Thus <code>{?p name &quot;Johnny&quot; . ?p geekCode ?g . ?p emailHash ?h }</code> does not necessarily denote a colocated join if many servers serve items of the vocabulary.</p>
<p>However, in most practical cases, for obtaining a rapid answer, treating this as a colocated fragment will be appropriate. Thus, it may be necessary to be able to declare that geek codes will be assumed colocated with names. This will save a lot of message passing and offer decent, if not theoretically total recall. For search style applications, starting with such assumptions will make sense. If nothing is found, then we can partition each join step separately for the unlikely case that there were a server that gave geek codes but not names.</p>
<p>For Virtuoso, we find that a federated query&#39;s asynchronous, parallel evaluation model is not so different from that on a local cluster. So the cluster version could have the option of federated query. The difference is that a cluster is local and tightly coupled and predictably partitioned but a federated setting is none of these.</p>
<p>For description, we would take DARQ&#39;s description model and maybe extend it a little where needed. Also we would enhance the protocol to allow just asking for the query cost estimate given a query with literals specified. We will do this eventually.</p>
<p>We would like to talk to Bastian about large improvements to DARQ, specially when working with Virtuoso. We&#39;ll see.</p>
<p>Of course, one mode of federating is the crawl-as-you-go approach of the Virtuoso <a href="http://virtuoso.openlinksw.com/Whitepapers/html/VirtSpongerWhitePaper.html" id="link-id0x1dddce48">Sponger</a>. This will bring in fragments following seeAlso or sameAs declarations or other references. This will however not have the recall of a warehouse or federation over well described SPARQL end-points. But up to a certain volume it has the speed of local storage.</p>
<p>The emergence of voiD (Vocabulary of Interlinked Data) is a step in the direction of making federation a reality. There is <a href="http://www.openlinksw.com/weblog/oerling/?id=1377" id="link-id1109a4c8">a separate post</a> about this.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-05-09#1358">
  <rss:title>DBpedia Benchmark Revisited</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-05-09T19:27:00Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We ran the DBpedia benchmark queries again with different configurations of Virtuoso. I had not studied the details of the matter previously but now did have a closer look at the queries. Comparing numbers given by different parties is a constant problem. In the case reported here, we loaded the full DBpedia 3, all languages, with about 198M triples, onto Virtuoso v5 and Virtuoso Cluster v6, all on the same 4 core 2GHz Xeon with 8G RAM. All databases were striped on 6 disks. The Cluster configuration was with 4 processes in the same box. We ran the queries in two variants: With graph specified in the SPARQL FROM clause, using the default indices. With no graph specified anywhere, using an alternate indexing scheme. The times below are for the sequence of 5 queries; individual query times are not reported. I did not do a line-by-line review of the execution plans since they seem to run well enough. We could get some extra mileage from cost model tweaks, especially for the numeric range conditions, but we will do this when somebody comes up with better times. First, about Virtuoso v5: Because there is a query in the set that specifies no condition on S or O and only P, this simply cannot be done with the default indices. With Virtuoso Cluster v6 it sort-of can, because v6 is more space efficient. So we added the index: create bitmap index rdf_quad_pogs on rdf_quad (p, o, g, s); Â  Virtuoso v5 with gspo, ogps, pogs Virtuoso Cluster v6 with gspo, ogps Virtuoso Cluster v6 with gspo, ogps, pogs cold 210 s 136 s 33.4 s warm 0.600 s 4.01 s 0.628 s OK, so now let us do it without a graph being specified. For all platforms, we drop any existing indices, and -- create table r2 (g iri_id_8, s, iri_id_8, p iri_id_8, o any, primary key (s, p, o, g)) alter index R2 on R2 partition (s int (0hexffff00)); log_enable (2); insert into r2 (g, s, p, o) select g, s, p, o from rdf_quad; drop table rdf_quad; alter table r2 rename RDF_QUAD; create bitmap index rdf_quad_opgs on rdf_quad (o, p, g, s) partition (o varchar (-1, 0hexffff)); create bitmap index rdf_quad_pogs on rdf_quad (p, o, g, s) partition (o varchar (-1, 0hexffff)); create bitmap index rdf_quad_gpos on rdf_quad (g, p, o, s) partition (o varchar (-1, 0hexffff)); The code is identical for v5 and v6, except that with v5 we use iri_id (32 bit) for the type, not iri_id_8 (64 bit). We note that we run out of IDs with v5 around a few billion triples, so with v6 we have double the ID length and still manage to be vastly more space efficient. With the above 4 indices, we can query the data pretty much in any combination without hitting a full scan of any index. We note that all indices that do not begin with s end with s as a bitmap. This takes about 60% of the space of a non-bitmap index for data such as DBpedia. If you intend to do completely arbitrary RDF queries in Virtuoso, then chances are you are best off with the above index scheme. Â  Virtuoso v5 with gspo, ogps, pogs Virtuoso Cluster v6 with spog, pogs, opgs, gpos warm 0.595 s 0.617 s The cold times were about the same as above, so not reproduced. Graph or No Graph? It is in the SPARQL spirit to specify a graph and for pretty much any application, there are entirely sensible ways of keeping the data in graphs and specifying which ones are concerned by queries. This is why Virtuoso is set up for this by default. On the other hand, for the open web scenario, dealing with an unknown large number of graphs, enumerating graphs is not possible and questions like which graph of which source asserts x become relevant. We have two distinct use cases which warrant different setups of the database, simple as that. The latter use case is not really within the SPARQL spec, so implementations may or may not support this. For example Oracle or Vertica would not do this well since they partition data according to graph or predicate, respectively. On the other hand, stores that work with one quad table, which is most of the ones out there, should do it maybe with some configuring, as shown above. Frameworks like Jena are not to my knowledge geared towards having a wildcard for graph, although I would suppose this can be arranged by adding some &quot;super-graph&quot; object, a graph of all graphs. I don&#39;t think this is directly supported and besides most apps would not need it. Once the indices are right, there is no difference between specifying a graph and not specifying a graph with the queries considered. With more complex queries, specifying a graph or set of graphs does allow some optimizations that cannot be done with no graph specified. For example, bitmap intersections are possible only when all leading key parts are given. Conclusions The best warm cache time is with v5; the five queries run under 600 ms after the first go. This is noted to show that all-in-memory with a single thread of execution is hard to beat. Cluster v6 performs the same queries in 623 ms. What is gained in parallelism is lost in latency if all operations complete in microseconds. On the other hand, Cluster v6 leaves v5 in the dust in any situation that has less than 100% hit rate. This is due to actual benefit from parallelism if operations take longer than a few microseconds, such as in the case of disk reads. Cluster v6 has substantially better data layout on disk, as well as fewer pages to load for the same content. This makes it possible to run the queries without the pogs index on Cluster v6 even when v5 takes prohibitively long. The morale of the story is to have a lot of RAM and space-efficient data representation. The DBpedia benchmark does not specify any random access pattern that would give a measure of sustained throughput under load, so we are left with the extremes of cold and warm cache of which neither is quite realistic. Chris Bizer and I have talked on and off about benchmarks and I have made suggestions that we will see incorporated into the Berlin SPARQL benchmark, which will, I believe, be much more informative. Appendix: Query Text For reference, the query texts specifying the graph are below. To run without specifying the graph, just drop the FROM &lt;http://dbpedia.org&gt; from each query. The returned row counts are indicated below each query&#39;s text. sparql SELECT ?p ?o FROM &lt;http://dbpedia.org&gt; WHERE { &lt;http://dbpedia.org/resource/Metropolitan_Museum_of_Art&gt; ?p ?o }; -- 1337 rows sparql PREFIX p: &lt;http://dbpedia.org/property/&gt; SELECT ?film1 ?actor1 ?film2 ?actor2 FROM &lt;http://dbpedia.org&gt; WHERE { ?film1 p:starring &lt;http://dbpedia.org/resource/Kevin_Bacon&gt; . ?film1 p:starring ?actor1 . ?film2 p:starring ?actor1 . ?film2 p:starring ?actor2 . }; -- 23910 rows sparql PREFIX p: &lt;http://dbpedia.org/property/&gt; SELECT ?artist ?artwork ?museum ?director FROM &lt;http://dbpedia.org&gt; WHERE { ?artwork p:artist ?artist . ?artwork p:museum ?museum . ?museum p:director ?director }; -- 303 rows sparql PREFIX geo: &lt;http://www.w3.org/2003/01/geo/wgs84_pos#&gt; PREFIX foaf: &lt;http://xmlns.com/foaf/0.1/&gt; PREFIX xsd: &lt;http://www.w3.org/2001/XMLSchema#&gt; SELECT ?s ?homepage FROM &lt;http://dbpedia.org&gt; WHERE { &lt;http://dbpedia.org/resource/Berlin&gt; geo:lat ?berlinLat . &lt;http://dbpedia.org/resource/Berlin&gt; geo:long ?berlinLong . ?s geo:lat ?lat . ?s geo:long ?long . ?s foaf:homepage ?homepage . FILTER ( ?lat &lt;= ?berlinLat + 0.03190235436 &amp;&amp; ?long &gt;= ?berlinLong - 0.08679199218 &amp;&amp; ?lat &gt;= ?berlinLat - 0.03190235436 &amp;&amp; ?long &lt;= ?berlinLong + 0.08679199218) }; -- 56 rows sparql PREFIX geo: &lt;http://www.w3.org/2003/01/geo/wgs84_pos#&gt; PREFIX foaf: &lt;http://xmlns.com/foaf/0.1/&gt; PREFIX xsd: &lt;http://www.w3.org/2001/XMLSchema#&gt; PREFIX p: &lt;http://dbpedia.org/property/&gt; SELECT ?s ?a ?homepage FROM &lt;http://dbpedia.org&gt; WHERE { &lt;http://dbpedia.org/resource/New_York_City&gt; geo:lat ?nyLat . &lt;http://dbpedia.org/resource/New_York_City&gt; geo:long ?nyLong . ?s geo:lat ?lat . ?s geo:long ?long . ?s p:architect ?a . ?a foaf:homepage ?homepage . FILTER ( ?lat &lt;= ?nyLat + 0.3190235436 &amp;&amp; ?long &gt;= ?nyLong - 0.8679199218 &amp;&amp; ?lat &gt;= ?nyLat - 0.3190235436 &amp;&amp; ?long &lt;= ?nyLong + 0.8679199218) }; -- 13 rows</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We ran the <a href="http://dbpedia.org/resource/DBpedia" id="link-id0x1b7f9688">DBpedia</a> benchmark queries again with different
configurations of <a href="http://virtuoso.openlinksw.com" id="link-id0x1cca2e00">Virtuoso</a>. I had not studied the details of the
matter previously but now did have a closer look at the
queries.</p>
<p>Comparing numbers given by different parties is a constant
problem. In the case reported here, we loaded the full DBpedia 3,
all languages, with about 198M triples, onto Virtuoso v5 and Virtuoso Cluster v6,
all on the same 4 core 2GHz Xeon with 8G RAM. All databases were
striped on 6 disks. The Cluster configuration was with 4 processes
in the same box.</p>
<p>We ran the queries in two variants:</p> 
<ul>
<li>With graph
specified in the <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1b77f758">SPARQL</a> <code>FROM</code> clause, using the default indices.</li>
<li>With no graph specified anywhere, using an
alternate indexing scheme.</li>
</ul>
<p>The times below are for the sequence of 5 queries; individual
query times are not reported. I did not do a line-by-line review of
the execution plans since they seem to run well enough. We could
get some extra mileage from cost model tweaks, especially for the
numeric range conditions, but we will do this when somebody comes up
with better times.</p>
<p>First, about Virtuoso v5: Because there is a query in the set that
specifies no condition on S or O and only P, this simply cannot be
done with the default indices. With Virtuoso Cluster v6 it sort-of can, because v6 is
more space efficient.</p>
<p>So we added the index:</p>
<blockquote>
<code>
create bitmap index <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1cb0b180">rdf</a>_quad_pogs on rdf_quad (p, o, g, s);
</code>
</blockquote>

<table>
 <tr>
  <td>Â </td>
  <td align="center"><b>Virtuoso v5 with<br /> gspo, ogps, pogs</b>
  </td>
  <td align="center"><b>Virtuoso Cluster v6 with <br />gspo, ogps</b>
  </td>
  <td align="center"><b>Virtuoso Cluster v6 with <br />gspo, ogps, pogs</b>
  </td>
 </tr>
<tr>
  <td><b>cold</b>
  </td>
  <td align="center">210 s</td>
  <td align="center">136 s</td>
  <td align="center">33.4 s</td>
</tr>
<tr>
  <td><b>warm</b>
  </td>
  <td align="center">0.600 s</td>
  <td align="center">4.01 s</td>
  <td align="center">0.628 s</td>
</tr>
</table>

<p>OK, so now let us do it without a graph being specified. For
all platforms, we drop any existing indices, and --</p>
<blockquote>
<code>
create table r2 (g iri_id_8, s, iri_id_8, p iri_id_8, o any, primary key (s, p, o, g)) <br />
alter index R2 on R2 partition (s int (0hexffff00)); <br />
 <br />
log_enable (2); <br />
insert into r2 (g, s, p, o) select g, s, p, o from rdf_quad; <br />
 <br />
drop table rdf_quad; <br />
alter table r2 rename RDF_QUAD; <br />
create bitmap index rdf_quad_opgs on rdf_quad (o, p, g, s) partition (o varchar (-1, 0hexffff)); <br />
create bitmap index rdf_quad_pogs on rdf_quad (p, o, g, s) partition (o varchar (-1, 0hexffff)); <br />
create bitmap index rdf_quad_gpos on rdf_quad (g, p, o, s) partition (o varchar (-1, 0hexffff));
</code>
</blockquote>
<p>The code is identical for v5 and v6, except that with v5 we use
<code>iri_id (32 bit)</code> for the type, not <code>iri_id_8 (64 bit)</code>. We note that
we run out of IDs with v5 around a few billion triples, so with v6
we have double the ID length and still manage to be vastly more
space efficient.</p>
<p>With the above 4 indices, we can query the <a href="http://dbpedia.org/resource/Data" id="link-id0x6339b80">data</a> pretty much in
any combination without hitting a full scan of any index. We note
that all indices that do not begin with s end with s as a bitmap.
This takes about 60% of the space of a non-bitmap index for data such
as DBpedia.</p>
<p>If you intend to do completely arbitrary RDF queries in
Virtuoso, then chances are you are best off with the above index
scheme.</p>

<table>
 <tr>
  <td>Â </td>
  <td align="center"><b> Virtuoso v5 with<br /> gspo, ogps, pogs</b>
  </td>
  <td align="center"><b> Virtuoso Cluster v6 with <br /> spog, pogs, opgs, gpos </b>
  </td>
 </tr>
<tr>
  <td><b>warm</b>
  </td>
  <td align="center">0.595 s</td>
  <td align="center">0.617 s</td>
</tr>
</table>

<p>The cold times were about the same as above, so not
reproduced.</p>
<h3>Graph or No Graph?</h3>
<p>It is in the SPARQL spirit to specify a graph and for pretty
much any application, there are entirely sensible ways of keeping
the data in graphs and specifying which ones are concerned by
queries. This is why Virtuoso is set up for this by default.</p>
<p>On the other hand, for the open web scenario, dealing with an
unknown large number of graphs, enumerating graphs is not possible
and questions like which graph of which source asserts x become
relevant. We have two distinct use cases which warrant different
setups of the database, simple as that.</p>
<p>The latter use case is not really within the SPARQL spec, so
implementations may or may not support this. For example <a href="http://dbpedia.org/resource/Oracle_Database" id="link-id0x11ed7028">Oracle</a> or
Vertica would not do this well since they partition data according
to graph or predicate, respectively. On the other hand, stores that
work with one quad table, which is most of the ones out there,
should do it maybe with some configuring, as shown above.</p>
<p>Frameworks like Jena are not to my <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x1a49ded0">knowledge</a> geared towards
having a wildcard for graph, although I would suppose this can be
arranged by adding some &quot;super-graph&quot; object, a graph of all
graphs. I don&#39;t think this is directly supported and besides most
apps would not need it.</p>
<p>Once the indices are right, there is no difference between
specifying a graph and not specifying a graph with the queries considered. With
more complex queries, specifying a graph or set of graphs does
allow some optimizations that cannot be done with no graph specified.
For example, bitmap intersections are possible only when all
leading key parts are given.</p>
<h3>Conclusions</h3>
<p>The best warm cache time is with v5; the five queries run under
600 ms after the first go. This is noted to show that all-in-memory with
a single thread of execution is hard to beat.</p>
<p>Cluster v6 performs the same queries in 623 ms. What is gained in
parallelism is lost in latency if all operations complete in
microseconds. On the other hand, Cluster v6 leaves v5 in the dust in
any situation that has less than 100% hit rate. This is due to
actual benefit from parallelism if operations take longer than a
few microseconds, such as in the case of disk reads. Cluster v6 has
substantially better data layout on disk, as well as fewer pages to
load for the same content.</p>
<p>This makes it possible to run the queries without the pogs
index on Cluster v6 even when v5 takes prohibitively long.</p>
<p>The morale of the story is to have a lot of RAM and space-efficient data representation.</p>
<p>The DBpedia benchmark does not specify any random access
pattern that would give a measure of sustained throughput under
load, so we are left with the extremes of cold and warm cache of
which neither is quite realistic.</p>
<p>Chris Bizer and I have talked on and off about benchmarks and
I have made suggestions that we will see incorporated into the
Berlin SPARQL benchmark, which will, I believe, be much more
informative.</p>
<h3>Appendix: Query Text</h3>
<p>For reference, the query texts specifying the graph are below. To
run without specifying the graph, just drop the <code>FROM
&lt;<a href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x1905bfd0">http</a>://dbpedia.org&gt;</code> from each query. The returned row counts are indicated
below each query&#39;s text.</p>
<blockquote>
 <code><pre>
sparql SELECT ?p ?o FROM &lt;http://dbpedia.org&gt; WHERE {
  &lt;http://dbpedia.org/resource/Metropolitan_Museum_of_Art&gt; ?p ?o };

-- 1337 rows

sparql PREFIX p: &lt;http://dbpedia.org/property/&gt;
SELECT ?film1 ?actor1 ?film2 ?actor2
FROM &lt;http://dbpedia.org&gt; WHERE {
  ?film1 p:starring &lt;http://dbpedia.org/resource/Kevin_Bacon&gt; .
  ?film1 p:starring ?actor1 .
  ?film2 p:starring ?actor1 .
  ?film2 p:starring ?actor2 . };

--  23910 rows

sparql PREFIX p: &lt;http://dbpedia.org/property/&gt;
SELECT ?artist ?artwork ?museum ?director FROM &lt;http://dbpedia.org&gt; 
WHERE {
  ?artwork p:artist ?artist .
  ?artwork p:museum ?museum .
  ?museum p:director ?director };

-- 303 rows

sparql PREFIX geo: &lt;http://www.w3.org/2003/01/geo/wgs84_pos#&gt;
PREFIX foaf: &lt;http://xmlns.com/foaf/0.1/&gt;
PREFIX xsd: &lt;http://www.w3.org/2001/XMLSchema#&gt;
SELECT ?s ?homepage FROM &lt;http://dbpedia.org&gt;  WHERE {
   &lt;http://dbpedia.org/resource/Berlin&gt; geo:lat ?berlinLat .
   &lt;http://dbpedia.org/resource/Berlin&gt; geo:long ?berlinLong . 
   ?s geo:lat ?lat .
   ?s geo:long ?long .
   ?s foaf:homepage ?homepage .
   FILTER (
     ?lat        &lt;=     ?berlinLat + 0.03190235436 &amp;&amp;
     ?long       &gt;=     ?berlinLong - 0.08679199218 &amp;&amp;
     ?lat        &gt;=     ?berlinLat - 0.03190235436 &amp;&amp; 
     ?long       &lt;=     ?berlinLong + 0.08679199218) };

-- 56 rows

sparql PREFIX geo: &lt;http://www.w3.org/2003/01/geo/wgs84_pos#&gt;
PREFIX foaf: &lt;http://xmlns.com/foaf/0.1/&gt;
PREFIX xsd: &lt;http://www.w3.org/2001/XMLSchema#&gt;
PREFIX p: &lt;http://dbpedia.org/property/&gt;
SELECT ?s ?a ?homepage FROM &lt;http://dbpedia.org&gt;  WHERE {
   &lt;http://dbpedia.org/resource/New_York_City&gt; geo:lat ?nyLat .
   &lt;http://dbpedia.org/resource/New_York_City&gt; geo:long ?nyLong . 
   ?s geo:lat ?lat .
   ?s geo:long ?long .
   ?s p:architect ?a .
   ?a foaf:homepage ?homepage .
   FILTER (
     ?lat        &lt;=     ?nyLat + 0.3190235436 &amp;&amp;
     ?long       &gt;=     ?nyLong - 0.8679199218 &amp;&amp;
     ?lat        &gt;=     ?nyLat - 0.3190235436 &amp;&amp; 
     ?long       &lt;=     ?nyLong + 0.8679199218) };

-- 13 rows
</pre>
 </code>
</blockquote>
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2007-11-08#1269">
  <rss:title>Social Web RDF Store Benchmark</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2007-11-08T13:39:39Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Elaborating on my previous post, as food for thought for an RDF store benchmarking activity under the W3C, I present the following rough sketch. At the end of the below, I propose some common business questions that should be answered by a social web aggregator. The problem with these is that it is not really possible to ask interesting questions over a large database without involving some sort of counting and grouping. I feel that we simply cannot make a representative benchmark without these, quite regardless of the fact that SPARQL in its present form does not have these features. Hence I have simply stated the questions and left any implementation open. If this seems like an interesting direction, the nascent W3C benchmarking XG (experimental group) can refine the business questions, relative query frequencies, exact data set composition, etc. Social Web RDF Benchmark by Orri Erling Goals This benchmark model&#39;s use of RDF for representing and analyzing use of social software by user communities. The benchmark consists of a scalable synthetic data set, a feed of updates to the data set, and a query mix. The data set reflects the common characteristics of the social web, with realistic distribution of connections, user contributed content, commenting, tagging, and other social web activities. The data set is expressed in the FOAF and SIOC vocabularies. The query mix is divided between relatively short, dashboard or search engine style lookups, and longer running analytics queries. The system being modeled is an an aggregator of social web content; we could liken it to an RDF-based Technorati with some extra features. Users can publish their favorite queries or mesh-ups as logical views served by the system. In this manner, queries come to depend on other queries, somewhat like SQL VIEWs can reference each other. There is a small qualification data set that can be tested against the queries to validate that the system under test (SUT) produces the correct results. The benchmark is scaled by number of users. To facilitate comparison, some predefined scales are offered, i.e., 100K, 300K, 1M, 3M, 10M users. Each simulated user both produces and consumes content. The level of activity of users is unevenly divided. There are two work mixes â the browsing mix, which consists of a mix of lookups and contributing content, and the analytics mix, which consists of long-running queries for tracking the state of the network. For each 100 browsing mixes, one analytics mix is performed. A benchmark run is at least 1h real-time in duration. The metric is calculated by the number of browsing mixes completed during the test window. This simulates 10% of the users being online at any one time, thus for a scale of 1M users, 100K browsing mixes will be simultaneously proceeding. The test driver submits the work via HTTP. What load balancing or degree of parallel serving of the requests is used is left up to the SUT. The metric is expressed as queries per second, taking the total number of queries executed by completed browsing mixes and dividing this by the real time of the measurement window. The metric is called qpsSW, for queries per second, socialweb. The cost metric is $/qpsSW, calculated by the costing rules of the TPC. If compute-on-demand infrastructure is used, the costing will be $/qpsSW/day. The test sponsor is the party contributing the result. The contribution consists of the metric and of a full disclosure report (FDR), written following a template given in the benchmark specification. The disclosure requirements follow the TPC practices, including publishing any configuration scripts, data definition language statements, timing for warm-up and test window, times for individual queries etc. All details of the hardware and software are disclosed. Test Support Software The software consists of the data generator and of a test driver. The test driver calls functions supplied by the test sponsor for performing the diverse operations in the test. Source code for any modifications of the test driver is to be published as part of the FDR. Rules for SUT Any hardware/software combination â including single machines, clusters, clusters rented from computer providers like Amazon EC2 â is eligible. The SUT must produce correct answers for the validation queries against the validation data set. The implementation of the queries is not restricted. These can be any SPARQL or other queries, application server based logic, stored procedures or other, in any language, provided full source code is provided in the FDR. The data set is provided as serialized RDF. The means of storage are left up to the SUT. The basic intention is to use a triple store of some form, but the specific indexing, use of property tables, materialized views, and so forth, is left up to the test sponsor. All tuning and configuration is to be published in the FDR. Simulated Workload For each operation of each mix, the specification shall present: The logical intent of the operation, the business question, e.g., What is the hot topic among my friends? The question or update expressed in terms of the data in the data set. Sample text of a query answering the question or pseudo-code for deriving the answer. Result set layout, if applicable. The relative frequencies of the queries are given in the query mix summary. Browsing Mix The browsing mix consists of the following operations: Updates Make a blog post. Make a blog comment. Make a new social contact. For one new social contact, there are 10 posts and 20 comments. Queries What are the 10 most recent posts by somebody in my friends or their friends? This would be a typical dashboard item. What are the authoritative bloggers on topic x? This is a moderately complex ad-hoc query. Take posts tagged with the topic, count links to them, take the blogs containing them, show the 10 most cited blogs with the most recent posts with the tag. This would be typical of a stored query, like a parameterizable report. How do I contact person x? Calculate the chain of common acquaintances best for reaching person x. For practicality, we do not do a full walk of anything but just take the distinct persons in 2 steps of the user and in 2 steps of x and see the intersection. Who are the people like me? Find the top 10 people ranked by count of tags in common in the person&#39;s tag cloud. The tag cloud is the set of interests and the set of tags in blog posts of the person. Who react to or talk about me? Count of replies to material by the user, grouped by the commenting user and the site of the comment, top 20, sorted by count descending. Who are my fans that I do not know? Same as above, excluding people within 2 steps. Who are my competitors? Most prolific posters on topics of my interest that do not cite me. Where is the action? On forums where I participate, what are the top 5 threads, as measured by posts in the last day. Show count of posts in the last day and the day before that. How do I get there? Who are the people active around both topic x and y? This is defined by a person having participated during the last year in forums of x as well as of y. Forums are tagged by topics. The most active users are first. The ranking is proportional to the sum of the number of posts in x and y. Analytic Mix These queries are typical questions about the state of the conversation space as a whole and can for example be published as a weekly summary page. The fastest propagating idea - What is the topic with the most users who have joined in the last day? A user is considered to have joined if the user was not discussing this in the past 10 days. Prime movers - What users start conversations? A conversation is the set of material in reply to or citing a post. The reply distance can be arbitrarily long, the citing distance is a direct link to the original post or a reply there to. The number and extent of conversations contribute towards the score. Geography - Over the last 10 days, for each geographic area, show the top 50 tags. The location is the location of the poster. Social hubs - For each community, get the top 5 people who are central to it in terms of number of links to other members of the same community and in terms of being linked from posts. A community is the set of forums that have a specific topic.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Elaborating on <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1269" id="link-idfe9e1d8">my previous post</a>, as food for thought for an <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1d1e1468">RDF</a> store benchmarking activity under the W3C, I present the following rough sketch. At the end of the below, I propose some common business questions that should be answered by a social web aggregator.</p>
<p>The problem with these is that it is not really possible to ask interesting questions over a large database without involving some sort of counting and grouping. I feel that we simply cannot make a representative benchmark without these, quite regardless of the fact that <a href="http://dbpedia.org/resource/SPARQL" id="link-id0xba84830">SPARQL</a> in its present form does not have these features. Hence I have simply stated the questions and left any implementation open. If this seems like an interesting direction, the nascent W3C benchmarking XG (experimental group) can refine the business questions, relative query frequencies, exact <a href="http://dbpedia.org/resource/Data" id="link-id0x1c272b10">data</a> set composition, etc.</p>
<h3>Social Web RDF Benchmark </h3>
<p>
<i>by Orri Erling</i>
</p>
<h4>Goals</h4>
<p>This benchmark model&#39;s use of RDF for representing and analyzing use of social software by user communities. The benchmark consists of a scalable synthetic data set, a feed of updates to the data set, and a query mix. The data set reflects the common characteristics of the social web, with realistic distribution of connections, user contributed content, commenting, tagging, and other social web activities. The data set is expressed in the FOAF and SIOC vocabularies. The query mix is divided between relatively short, dashboard or search engine style lookups, and longer running analytics queries.</p>
<p>The system being modeled is an an aggregator of social web content; we could liken it to an RDF-based Technorati with some extra features.</p>
<p>Users can publish their favorite queries or mesh-ups as logical views served by the system. In this manner, queries come to depend on other queries, somewhat like <a href="http://dbpedia.org/resource/SQL" id="link-id0xb75c930">SQL</a> VIEWs can reference each other.</p>
<p>There is a small qualification data set that can be tested against the queries to validate that the system under test (SUT) produces the correct results.</p>
<p>The benchmark is scaled by number of users. To facilitate comparison, some predefined scales are offered, i.e., 100K, 300K, 1M, 3M, 10M users. Each simulated user both produces and consumes content. The level of activity of users is unevenly divided.</p>
<p>There are two work mixes â the browsing mix, which consists of a mix of lookups and contributing content, and the analytics mix, which consists of long-running queries for tracking the state of the network. For each 100 browsing mixes, one analytics mix is performed.</p>
<p>A benchmark run is at least 1h real-time in duration. The metric is calculated by the number of browsing mixes completed during the test window. This simulates 10% of the users being online at any one time, thus for a scale of 1M users, 100K browsing mixes will be simultaneously proceeding.</p>
<p>The test driver submits the work via <a href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x1ae7c010">HTTP</a>. What load balancing or degree of parallel serving of the requests is used is left up to the SUT.</p>
<p>The metric is expressed as queries per second, taking the total number of queries executed by completed browsing mixes and dividing this by the real time of the measurement window. The metric is called qpsSW, for <i>queries per second, socialweb</i>. The cost metric is $/qpsSW, calculated by the costing rules of the TPC. If compute-on-demand infrastructure is used, the costing will be $/qpsSW/day.</p>
<p>The test sponsor is the party contributing the result. The contribution consists of the metric and of a full disclosure report (FDR), written following a template given in the benchmark specification. The disclosure requirements follow the TPC practices, including publishing any configuration scripts, data definition language statements, timing for warm-up and test window, times for individual queries etc. All details of the hardware and software are disclosed.</p>
<h4>Test Support Software</h4>
<p>The software consists of the data generator and of a test driver. The test driver calls functions supplied by the test sponsor for performing the diverse operations in the test. Source code for any modifications of the test driver is to be published as part of the FDR.</p>
<h4>Rules for SUT</h4>
<p>Any hardware/software combination  â including single machines, clusters, clusters rented from computer providers like Amazon EC2 â is eligible.</p>
<p>The SUT must produce correct answers for the validation queries against the validation data set.</p>
<p>The implementation of the queries is not restricted. These can be any SPARQL or other queries, <a href="http://dbpedia.org/resource/Application_server" id="link-id0x1a38aee0">application server</a> based logic, stored procedures or other, in any language, provided full source code is provided in the FDR.</p>
<p>The data set is provided as serialized RDF. The means of storage are left up to the SUT. The basic intention is to use a triple store of some form, but the specific indexing, use of property tables, materialized views, and so forth, is left up to the test sponsor. All tuning and configuration is to be published in the FDR.</p>
<h4>Simulated Workload</h4>
<p>For each operation of each mix, the specification shall present:</p>
<ol>
 <li>
  <p>The logical intent of the operation, the business question, e.g., <i>What is the hot topic among my friends?</i>
  </p>
</li>
<li>
  <p>The question or update expressed in terms of the data in the data set.</p>
</li>
<li>
  <p>Sample text of a query answering the question or pseudo-code for deriving the answer.</p>
</li>
<li>
  <p>Result set layout, if applicable.</p>
</li>
</ol>
<p>The relative frequencies of the queries are given in the query mix summary.</p>
<h4>Browsing Mix</h4>
<p>The browsing mix consists of the following operations:</p>
<h5>Updates</h5>
<p></p>
<ul>
<li>
  <p>Make a <a href="http://dbpedia.org/resource/Blog" id="link-id0x1e0f6470">blog</a> post.</p>
</li>
<li>
  <p>Make a blog comment.</p>
</li>
<li>
  <p>Make a new social contact.</p>
</li>
</ul>
<p>For one new social contact, there are 10 posts and 20 comments.</p>
<h5>Queries</h5>
<ul>
 <li>
  <p>
    <i>What are the 10 most recent posts by somebody in my friends or their friends?</i> This would be a typical dashboard item.</p>
 </li>
<li>
  <p>
    <i>What are the authoritative bloggers on topic x?</i> This is a moderately complex ad-hoc query. Take posts tagged with the topic, count links to them, take the blogs containing them, show the 10 most cited blogs with the most recent posts with the <a href="http://dbpedia.org/resource/Tag" id="link-id0xbf5ace8">tag</a>. This would be typical of a stored query, like a parameterizable report.</p>
</li>
<li>
  <p>
    <i>How do I contact person x?</i> Calculate the chain of common acquaintances best for reaching person x. For practicality, we do not do a full walk of anything but just take the distinct persons in 2 steps of the user and in 2 steps of x and see the intersection.</p>
</li>
<li>
  <p>
    <i>Who are the people like me?</i> Find the top 10 people ranked by count of tags in common in the person&#39;s tag cloud. The tag cloud is the set of interests and the set of tags in blog posts of the person.</p>
</li>
<li>
  <p>
    <i>Who react to or talk about me?</i> Count of replies to material by the user, grouped by the commenting user and the site of the comment, top 20, sorted by count descending.</p>
</li>
<li>
  <p>
    <i>Who are my fans that I do not know?</i> Same as above, excluding people within 2 steps.</p>
</li>
<li>
  <p>
    <i>Who are my competitors?</i> Most prolific posters on topics of my interest that do not cite me.</p>
</li>
<li>
  <p>
    <i>Where is the action?</i> On forums where I participate, what are the top 5 threads, as measured by posts in the last day. Show count of posts in the last day and the day before that.</p>
</li>
<li>
  <p>
    <i>How do I get there? Who are the people active around both topic x and y?</i> This is defined by a person having participated during the last year in forums of x as well as of y. Forums are tagged by topics. The most active users are first. The ranking is proportional to the sum of the number of posts in x and y.</p>
</li>
</ul>
<h4>Analytic Mix</h4>
<p>These queries are typical questions about the state of the conversation space as a whole and can for example be published as a weekly summary page.</p>
<ul>
<li>
  <p>
    <b>The fastest propagating idea</b> - <i>What is the topic with the most users who have joined in the last day?</i> A user is considered to have joined if the user was not discussing this in the past 10 days.</p>
</li>
<li>
  <p>
    <b>Prime movers</b> - <i>What users start conversations?</i> A conversation is the set of material in reply to or citing a post. The reply distance can be arbitrarily long, the citing distance is a direct link to the original post or a reply there to. The number and extent of conversations contribute towards the score.</p>
</li>
<li>
  <p>
    <b>Geography</b> - Over the last 10 days, for each geographic area, show the top 50 tags. The location is the location of the poster.</p>
</li>
<li>
  <p>
    <b>Social hubs</b> - For each community, get the top 5 people who are central to it in terms of number of links to other members of the same community and in terms of being linked from posts. A community is the set of forums that have a specific topic.</p>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2006-08-10#1024">
  <rss:title>Virtuoso and ODS Update</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2006-08-10T11:06:01Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We have released an update of Virtuoso Open Source Edition and the OpenLink Data Spaces suite. This marks the coming of age of our RDF and SPARQL efforts. We have the new SQL cost model with SPARQL awareness, we have applications which present much of their data as SIOC, FOAF, ATOM OWL and other formats. We continue refining these technologies. Our next roadmap item is mapping relational data into RDF and offering SPARQL access to relational data without data duplication. Expect a white paper about this soon.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We have released an update of <a href="http://virtuoso.openlinksw.com" id="link-id0xddc9c48">Virtuoso</a> Open Source Edition and the <a href="http://dbpedia.org/resource/OpenLink_Data_Spaces" id="link-id0x199d1fc0">OpenLink Data Spaces</a> suite.</p>
<p>This marks the coming of age of our <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x19347570">RDF</a> and <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1b202218">SPARQL</a> efforts. We have the new <a href="http://dbpedia.org/resource/SQL" id="link-id0x18bf3c08">SQL</a> cost model with SPARQL awareness, we have applications which present much of their <a href="http://dbpedia.org/resource/Data" id="link-id0x1a161428">data</a> as SIOC, FOAF, ATOM OWL and other formats.</p>
<p>We continue refining these technologies. Our next roadmap item is mapping relational data into RDF and offering SPARQL access to relational data without data duplication. Expect a white paper about this soon.</p>]]></content:encoded>
 </rss:item>
</rdf:RDF>