<?xml version="1.0" encoding="UTF-8" ?>
<!--RDF based XML document generated By OpenLink Virtuoso-->
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
 <rss:channel xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/dav/dav-blog-1/">
  <rss:title>OpenLink Community Blog</rss:title>
  <rss:link>http://www.openlinksw.com/weblog/dav/dav-blog-1/</rss:link>
  <rss:description>A Collection of blogs by OpenLink Staff</rss:description>
  <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">kidehen@openlinksw.com</dc:creator>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-11-23T12:05:22Z</dc:date>
  <rss:items>
   <rdf:Seq>
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2009-04-30#1555" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2009-04-30#1554" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2009-03-25#1538" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2009-03-25#1537" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-12-17#1505" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-12-17#1504" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-12-16#1499" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-12-16#1498" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-11-20#1485" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-11-20#1484" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-11-04#1481" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-11-04#1479" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-11-04#1480" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-11-04#1478" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-10-26#1467" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-10-26#1465" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-10-26#1466" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-10-26#1464" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2008-10-24#1461" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2008-10-10#1456" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-10-02#1450" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-10-02#1448" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-09-08#1435" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-09-08#1433" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2008-09-05#1430" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2008-08-28#1425" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-07-17#1393" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-07-17#1392" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-06-09#1381" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-06-09#1380" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2008-06-09#1379" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-06-09#1376" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-06-09#1375" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2008-06-09#1374" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2008-05-02#1357" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2008-03-27#1330" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2007-11-21#1274" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2007-08-28#1248" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2007-08-28#1246" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2007-02-09#1137" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2007-02-08#1134" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2007-01-18#1122" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2006-12-07#1095" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2006-07-15#1006" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2006-06-01#988" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2006-05-26#986" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2006-05-26#985" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2006-05-26#983" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2006-05-26#982" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2006-05-25#981" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2006-05-15#974" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/vdb/blog/?date=2006-04-24#962" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2006-04-24#961" />
      <rdf:li rdf:resource="http://www.openlinksw.com/weblog/oerling/?date=2006-04-17#958" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2006-04-11#951" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2006-03-19#941" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2006-02-09#932" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2005-11-16#904" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2005-10-28#887" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2005-09-16#867" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2005-05-20#849" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2005-05-01#831" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2005-04-29#825" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2005-03-17#754" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2005-03-08#746" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2005-02-14#687" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2005-02-12#685" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2005-02-11#684" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2005-01-27#668" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2004-12-20#651" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2004-06-09#559" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2004-06-09#1100" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2004-06-09#557" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2004-06-04#555" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2004-04-23#526" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2004-03-24#493" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2004-02-03#462" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2003-11-11#423" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2003-10-31#410" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2003-10-24#399" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2003-09-04#247" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2003-08-21#241" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2003-07-07#201" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2003-06-25#182" />
      <rdf:li rdf:resource="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2003-06-17#279" />
   </rdf:Seq>
  </rss:items>
 </rss:channel>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2009-04-30#1555">
  <rss:title>Social Web Camp (#5 of 5)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-04-30T16:14:02Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">(Last of five posts related to the WWW 2009 conference, held the week of April 20, 2009.) The social networks camp was interesting, with a special meeting around Twitter. Half jokingly, we (that is, the OpenLink folks attending) concluded that societies would never be completely classless, although mobility between, as well as criteria for membership in, given classes would vary with time and circumstance. Now, there would be a new class division between people for whom micro-blogging is obligatory and those for whom it is an option. By my experience, a great deal is possible in a short time, but this possibility depends on focus and concentration. These are increasingly rare. I am a great believer in core competence and focus. This is not only for geeks â one can have a lot of breadth-of-scope but this too depends on not getting sidetracked by constant information overload. Insofar as personal success depends on constant reaction to online social media, this comes at a cost in time and focus and this cost will have to be managed somehow, for example by automation or outsourcing. But if the social media is only automated fronts twitting and re-twitting among themselves, a bit like electronic trading systems do with securities, with or without human operators, the value of the medium decreases. There are contradictory requirements. On one hand, what is said in electronic media is essentially permanent, so one had best only say things that are well considered. On the other hand, one must say these things without adequate time for reflection or analysis. To cope with this, one must have a well-rehearsed position that is compacted so that it fits in a short format and is easy to remember and unambiguous to express. A culture of pre-cooked fast-food advertising cuts down on depth. Real-world things are complex and multifaceted. Besides, prevalent patterns of communication train the brain for a certain mode of functioning. If we train for rapid-fire 140-character messaging, we optimize one side but probably at the expense of another. In the meantime, the world continues developing increased complexity by all kinds of emergent effects. Connectivity is good but don&#39;t get lost in it. There is a CIA memorandum about how analysts misinterpret data and see what they want to see. This is a relevant resource for understanding some psychology of perception and memory. With the information overload, largely driven by user generated content, interpreting fragmented and variously-biased real-time information is not only for the analyst but for everyone who needs to intelligently function in cyber-social space. I participated in discussions on security and privacy and on mobile social networks and context. For privacy, the main thing turned out to be whether people should be protected from themselves. Should information expire? Will it get buried by itself under huge volumes of new content? Well, for purposes of visibility, it will certainly get buried and will require constant management to stay visible. But for purposes of future finding of dirt, it will stay findable for those who are looking. There is also the corollary of setting security for resources, like documents, versus setting security for statements, i.e., structured data like social networks. As I have blogged before, policies Ã  la SQL do not work well when schema is fluid and end-users can&#39;t be expected to formulate or understand these. Remember Ted Nelson? A user interface should be such that a beginner understands it in 10 seconds in an emergency. The user interaction question is how to present things so that the user understands who will have access to what content. Also, users should themselves be able to check what potentially sensitive information can be found out about them. A service along the lines of Garlic&#39;s Data Patrol should be a part of the social web infrastructure of the future. People at MIT have developed AIR (Accountability In RDF) for expressing policies about what can be done with data and for explaining why access is denied if it is denied. However, if we at all look at the history of secrets, it is rather seldom that one hears that access to information about X is restricted to compartment so-and-so; it is much more common to hear that there is no X. I would say that a policy system that just leaves out information that is not supposed to be available will please the users more. This is not only so for organizations; it is fully plausible that an individual might not wish to expose even the existence of some selected inner circle of friends, their parties together, or whatever. In conclusion, there is no self-evident solution for careless use of social media. A site that requires people to confirm multiple times that they know what they are doing when publishing a photo will not get much use. We will see. For mobility, there was some talk about the context of usage. Again, this is difficult. For different contexts, one would for example disclose one&#39;s location at the granularity of the city; for some other purposes, one would say which conference room one is in. Embarrassing social situations may arise if mobile devices are too clever: If information about travel is pushed into the social network, one would feel like having to explain why one does not call on such-and-such a person and so on. Too much initiative in the mobile phone seems like a recipe for problems. There is a thin line between convenience and having IT infrastructure rule one&#39;s life. The complexities and subtleties of social situations ought not to be reduced to the level of if-then rules. People and their interactions are more complex than they themselves often realize. A system is not its own metasystem, as GÃ¶del put it. Similarly, human self-knowledge, let alone knowledge about another, is by this very principle only approximate. Not to forget what psychology tells us about state-dependent recall and of how circumstance can evoke patterns of behavior before one even notices. The history of expert systems did show that people do not do very well at putting their skills in the form of if-then rules. Thus automating sociality past a certain point seems a problematic proposition.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>(Last of five posts related to the <a href="http://www2009.org/" id="link-id0x112efd58">WWW 2009</a> conference, held the week of April 20, 2009.)

</p>
<p>The social networks camp was interesting, with a special meeting around Twitter. Half jokingly, we (that is, the OpenLink folks attending) concluded that societies would never be completely classless, although mobility between, as well as criteria for membership in, given classes would vary with time and circumstance. Now, there would be a new class division between people for whom micro-blogging is obligatory and those for whom it is an option.</p>

<p>By my experience, a great deal is possible in a short time, but this possibility depends on focus and concentration. These are increasingly rare. I am a great believer in core competence and focus. This is not only for geeks â one can have a lot of breadth-of-scope but this too depends on not getting sidetracked by constant <a href="http://dbpedia.org/resource/Information" id="link-id0x14e380b8">information</a> overload.</p>

<p>Insofar as personal success depends on constant reaction to online social media, this comes at a cost in time and focus and this cost will have to be managed somehow, for example by automation or outsourcing. But if the social media is only automated fronts twitting and re-twitting among themselves, a bit like electronic trading systems do with securities, with or without human operators, the value of the medium decreases.</p>

<p>There are contradictory requirements. On one hand, what is said in electronic media is essentially permanent, so one had best only say things that are well considered. On the other hand, one must say these things without adequate time for reflection or analysis. To cope with this, one must have a well-rehearsed position that is compacted so that it fits in a short format and is easy to remember and unambiguous to express. A culture of pre-cooked fast-food advertising cuts down on depth. Real-world things are complex and multifaceted. Besides, prevalent patterns of communication train the brain for a certain mode of functioning. If we train for rapid-fire 140-character messaging, we optimize one side but probably at the expense of another. In the meantime, the world continues developing increased complexity by all kinds of emergent effects. Connectivity is good but don&#39;t get lost in it.</p>

<p>There is <a href="https://www.cia.gov/library/center-for-the-study-of-intelligence/csi-publications/books-and-monographs/psychology-of-intelligence-analysis/index.html" id="link-id170cb010">a CIA memorandum about how analysts misinterpret data and see what they want to see</a>. This is a relevant resource for understanding some psychology of perception and memory. With the information overload, largely driven by user generated content, interpreting fragmented and variously-biased real-time information is not only for the analyst but for everyone who needs to intelligently function in cyber-social space.</p>

<p>I participated in discussions on security and privacy and on mobile social networks and context.</p>

<p>For privacy, the main thing turned out to be whether people should be protected from themselves. Should information expire? Will it get buried by itself under huge volumes of new content? Well, for purposes of visibility, it will certainly get buried and will require constant management to stay visible. But for purposes of future finding of dirt, it will stay findable for those who are looking.</p>

<p>There is also the corollary of setting security for resources, like documents, versus setting security for statements, i.e., structured data like social networks. As I have blogged before, policies <a id="link-id14aaff90">Ã  la</a> <a href="http://dbpedia.org/resource/SQL" id="link-id0x13d77830">SQL</a> do not work well when schema is fluid and end-users can&#39;t be expected to formulate or understand these. Remember <a href="http://dbpedia.org/resource/Ted_Nelson" id="link-id0x156ceae0">Ted Nelson</a>? A user interface should be such that a beginner understands it in 10 seconds in an emergency. The user interaction question is how to present things so that the user understands who will have access to what content. Also, users should themselves be able to check what potentially sensitive information can be found out about them. A service along the lines of Garlic&#39;s Data Patrol should be a part of the social web infrastructure of the future.</p>

<p>People at MIT have developed AIR (Accountability In <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x14e2abc0">RDF</a>) for expressing policies about what can be done with data and for explaining why access is denied if it is denied. However, if we at all look at the history of secrets, it is rather seldom that one hears that access to information about X is restricted to compartment so-and-so; it is much more common to hear that there is no X. I would say that a policy system that just leaves out information that is not supposed to be available will please the users more. This is not only so for organizations; it is fully plausible that an individual might not wish to expose even the existence of some selected inner circle of friends, their parties together, or whatever.</p>

<p>In conclusion, there is no self-evident solution for careless use of social media. A site that requires people to confirm multiple times that they know what they are doing when publishing a photo will not get much use. We will see.</p>

<p>For mobility, there was some talk about the context of usage. Again, this is difficult. For different contexts, one would for example disclose one&#39;s location at the granularity of the city; for some other purposes, one would say which conference room one is in.</p>

<p>Embarrassing social situations may arise if mobile devices are too clever: If information about travel is pushed into the social network, one would feel like having to explain why one does not call on such-and-such a person and so on. Too much initiative in the mobile phone seems like a recipe for problems.</p>

<p>There is a thin line between convenience and having IT infrastructure rule one&#39;s life. The complexities and subtleties of social situations ought not to be reduced to the level of if-then rules. People and their interactions are more complex than they themselves often realize. A system is not its own metasystem, as GÃ¶del put it. Similarly, human self-<a href="http://dbpedia.org/resource/Knowledge" id="link-id0x70d82ff8">knowledge</a>, let alone knowledge about another, is by this very principle only approximate. Not to forget what psychology tells us about state-dependent recall and of how circumstance can evoke patterns of behavior before one even notices. The history of expert systems did show that people do not do very well at putting their skills in the form of if-then rules. Thus automating sociality past a certain point seems a problematic proposition.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2009-04-30#1554">
  <rss:title>Social Web Camp (#5 of 5)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-04-30T16:14:02Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">(Last of five posts related to the WWW 2009 conference, held the week of April 20, 2009.) The social networks camp was interesting, with a special meeting around Twitter. Half jokingly, we (that is, the OpenLink folks attending) concluded that societies would never be completely classless, although mobility between, as well as criteria for membership in, given classes would vary with time and circumstance. Now, there would be a new class division between people for whom micro-blogging is obligatory and those for whom it is an option. By my experience, a great deal is possible in a short time, but this possibility depends on focus and concentration. These are increasingly rare. I am a great believer in core competence and focus. This is not only for geeks â one can have a lot of breadth-of-scope but this too depends on not getting sidetracked by constant information overload. Insofar as personal success depends on constant reaction to online social media, this comes at a cost in time and focus and this cost will have to be managed somehow, for example by automation or outsourcing. But if the social media is only automated fronts twitting and re-twitting among themselves, a bit like electronic trading systems do with securities, with or without human operators, the value of the medium decreases. There are contradictory requirements. On one hand, what is said in electronic media is essentially permanent, so one had best only say things that are well considered. On the other hand, one must say these things without adequate time for reflection or analysis. To cope with this, one must have a well-rehearsed position that is compacted so that it fits in a short format and is easy to remember and unambiguous to express. A culture of pre-cooked fast-food advertising cuts down on depth. Real-world things are complex and multifaceted. Besides, prevalent patterns of communication train the brain for a certain mode of functioning. If we train for rapid-fire 140-character messaging, we optimize one side but probably at the expense of another. In the meantime, the world continues developing increased complexity by all kinds of emergent effects. Connectivity is good but don&#39;t get lost in it. There is a CIA memorandum about how analysts misinterpret data and see what they want to see. This is a relevant resource for understanding some psychology of perception and memory. With the information overload, largely driven by user generated content, interpreting fragmented and variously-biased real-time information is not only for the analyst but for everyone who needs to intelligently function in cyber-social space. I participated in discussions on security and privacy and on mobile social networks and context. For privacy, the main thing turned out to be whether people should be protected from themselves. Should information expire? Will it get buried by itself under huge volumes of new content? Well, for purposes of visibility, it will certainly get buried and will require constant management to stay visible. But for purposes of future finding of dirt, it will stay findable for those who are looking. There is also the corollary of setting security for resources, like documents, versus setting security for statements, i.e., structured data like social networks. As I have blogged before, policies Ã  la SQL do not work well when schema is fluid and end-users can&#39;t be expected to formulate or understand these. Remember Ted Nelson? A user interface should be such that a beginner understands it in 10 seconds in an emergency. The user interaction question is how to present things so that the user understands who will have access to what content. Also, users should themselves be able to check what potentially sensitive information can be found out about them. A service along the lines of Garlic&#39;s Data Patrol should be a part of the social web infrastructure of the future. People at MIT have developed AIR (Accountability In RDF) for expressing policies about what can be done with data and for explaining why access is denied if it is denied. However, if we at all look at the history of secrets, it is rather seldom that one hears that access to information about X is restricted to compartment so-and-so; it is much more common to hear that there is no X. I would say that a policy system that just leaves out information that is not supposed to be available will please the users more. This is not only so for organizations; it is fully plausible that an individual might not wish to expose even the existence of some selected inner circle of friends, their parties together, or whatever. In conclusion, there is no self-evident solution for careless use of social media. A site that requires people to confirm multiple times that they know what they are doing when publishing a photo will not get much use. We will see. For mobility, there was some talk about the context of usage. Again, this is difficult. For different contexts, one would for example disclose one&#39;s location at the granularity of the city; for some other purposes, one would say which conference room one is in. Embarrassing social situations may arise if mobile devices are too clever: If information about travel is pushed into the social network, one would feel like having to explain why one does not call on such-and-such a person and so on. Too much initiative in the mobile phone seems like a recipe for problems. There is a thin line between convenience and having IT infrastructure rule one&#39;s life. The complexities and subtleties of social situations ought not to be reduced to the level of if-then rules. People and their interactions are more complex than they themselves often realize. A system is not its own metasystem, as GÃ¶del put it. Similarly, human self-knowledge, let alone knowledge about another, is by this very principle only approximate. Not to forget what psychology tells us about state-dependent recall and of how circumstance can evoke patterns of behavior before one even notices. The history of expert systems did show that people do not do very well at putting their skills in the form of if-then rules. Thus automating sociality past a certain point seems a problematic proposition.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>(Last of five posts related to the <a href="http://www2009.org/" id="link-id0xd28c860">WWW 2009</a> conference, held the week of April 20, 2009.)

</p>
<p>The social networks camp was interesting, with a special meeting around Twitter. Half jokingly, we (that is, the OpenLink folks attending) concluded that societies would never be completely classless, although mobility between, as well as criteria for membership in, given classes would vary with time and circumstance. Now, there would be a new class division between people for whom micro-blogging is obligatory and those for whom it is an option.</p>

<p>By my experience, a great deal is possible in a short time, but this possibility depends on focus and concentration. These are increasingly rare. I am a great believer in core competence and focus. This is not only for geeks â one can have a lot of breadth-of-scope but this too depends on not getting sidetracked by constant <a href="http://dbpedia.org/resource/Information" id="link-id0x10019a70">information</a> overload.</p>

<p>Insofar as personal success depends on constant reaction to online social media, this comes at a cost in time and focus and this cost will have to be managed somehow, for example by automation or outsourcing. But if the social media is only automated fronts twitting and re-twitting among themselves, a bit like electronic trading systems do with securities, with or without human operators, the value of the medium decreases.</p>

<p>There are contradictory requirements. On one hand, what is said in electronic media is essentially permanent, so one had best only say things that are well considered. On the other hand, one must say these things without adequate time for reflection or analysis. To cope with this, one must have a well-rehearsed position that is compacted so that it fits in a short format and is easy to remember and unambiguous to express. A culture of pre-cooked fast-food advertising cuts down on depth. Real-world things are complex and multifaceted. Besides, prevalent patterns of communication train the brain for a certain mode of functioning. If we train for rapid-fire 140-character messaging, we optimize one side but probably at the expense of another. In the meantime, the world continues developing increased complexity by all kinds of emergent effects. Connectivity is good but don&#39;t get lost in it.</p>

<p>There is <a href="https://www.cia.gov/library/center-for-the-study-of-intelligence/csi-publications/books-and-monographs/psychology-of-intelligence-analysis/index.html" id="link-id170cb010">a CIA memorandum about how analysts misinterpret data and see what they want to see</a>. This is a relevant resource for understanding some psychology of perception and memory. With the information overload, largely driven by user generated content, interpreting fragmented and variously-biased real-time information is not only for the analyst but for everyone who needs to intelligently function in cyber-social space.</p>

<p>I participated in discussions on security and privacy and on mobile social networks and context.</p>

<p>For privacy, the main thing turned out to be whether people should be protected from themselves. Should information expire? Will it get buried by itself under huge volumes of new content? Well, for purposes of visibility, it will certainly get buried and will require constant management to stay visible. But for purposes of future finding of dirt, it will stay findable for those who are looking.</p>

<p>There is also the corollary of setting security for resources, like documents, versus setting security for statements, i.e., structured data like social networks. As I have blogged before, policies <a id="link-id14aaff90">Ã  la</a> <a href="http://dbpedia.org/resource/SQL" id="link-id0x10b058d0">SQL</a> do not work well when schema is fluid and end-users can&#39;t be expected to formulate or understand these. Remember <a href="http://dbpedia.org/resource/Ted_Nelson" id="link-id0x145b3070">Ted Nelson</a>? A user interface should be such that a beginner understands it in 10 seconds in an emergency. The user interaction question is how to present things so that the user understands who will have access to what content. Also, users should themselves be able to check what potentially sensitive information can be found out about them. A service along the lines of Garlic&#39;s Data Patrol should be a part of the social web infrastructure of the future.</p>

<p>People at MIT have developed AIR (Accountability In <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x10dec8f8">RDF</a>) for expressing policies about what can be done with data and for explaining why access is denied if it is denied. However, if we at all look at the history of secrets, it is rather seldom that one hears that access to information about X is restricted to compartment so-and-so; it is much more common to hear that there is no X. I would say that a policy system that just leaves out information that is not supposed to be available will please the users more. This is not only so for organizations; it is fully plausible that an individual might not wish to expose even the existence of some selected inner circle of friends, their parties together, or whatever.</p>

<p>In conclusion, there is no self-evident solution for careless use of social media. A site that requires people to confirm multiple times that they know what they are doing when publishing a photo will not get much use. We will see.</p>

<p>For mobility, there was some talk about the context of usage. Again, this is difficult. For different contexts, one would for example disclose one&#39;s location at the granularity of the city; for some other purposes, one would say which conference room one is in.</p>

<p>Embarrassing social situations may arise if mobile devices are too clever: If information about travel is pushed into the social network, one would feel like having to explain why one does not call on such-and-such a person and so on. Too much initiative in the mobile phone seems like a recipe for problems.</p>

<p>There is a thin line between convenience and having IT infrastructure rule one&#39;s life. The complexities and subtleties of social situations ought not to be reduced to the level of if-then rules. People and their interactions are more complex than they themselves often realize. A system is not its own metasystem, as GÃ¶del put it. Similarly, human self-<a href="http://dbpedia.org/resource/Knowledge" id="link-id0xd7b1808">knowledge</a>, let alone knowledge about another, is by this very principle only approximate. Not to forget what psychology tells us about state-dependent recall and of how circumstance can evoke patterns of behavior before one even notices. The history of expert systems did show that people do not do very well at putting their skills in the form of if-then rules. Thus automating sociality past a certain point seems a problematic proposition.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2009-03-25#1538">
  <rss:title>Beyond Applications - Introducing the Planetary Datasphere (Part 2)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-03-25T15:50:56Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We have looked at the general implications of the DataSphere, a universal, ubiquitous database infrastructure, on end-user experience and application development and content. Now we will look at what this means at the back end, from hosting to security to server software and hardware. Application Hosting For the infrastructure provider, hosting the DataSphere is no different from hosting large Web 2.0 sites. This may be paid for by users, as in the cloud computing model where users rent capacity for their own purposes, or by advertisers, as in most of Web 2.0. Clouds play a role in this as places with high local connectivity. The DataSphere is the atmosphere; the Cloud is an atmospheric phenomenon. What of Proprietary Data and its Security? Having proprietary data does not imply using a proprietary language. I would say that for any domain of discourse, no matter how private or specialized, at least some structural concepts can be borrowed from public, more generic sources. This lowers training thresholds and facilitates integration. Being able to integrate does not imply opening one&#39;s own data. To take an analogy, if you have a bunker with closed circuit air recycling, you still breathe air, even if that air is cut off from the atmosphere at large. For places with complex existing RDBMS security, the best is to map the RDBMS to RDF on the fly, always running all requests through the RDBMS. This implicitly preserves any policy or label based security schemes. What of Individual Privacy on the Open Web? The more complex situations will be found in environments with mixed security needs, as in social networking with partly-open and partly-closed profiles. The FOAF+SSL solution with https:// URIs is one approach. For query processing, we have a question of enforcing instance-level policies. In the DataSphere, granting privileges on tables and views no longer makes sense. In SQL, a policy means that behind the scenes the DBMS will add extra criteria to queries and updates depending on who is issuing them. The query processor adds conditions like getting the user&#39;s department ID and comparing it to the department ID on the payroll record. Labeled security is a scheme where data rows themselves contain security tags and the DBMS enforces these, row by row. I would say that these techniques are suited for highly-structured situations where the roles, compartments, and needs are clear, and where the organization has the database know-how to write, test, and deploy such rules by the table, row, and column. This does not sit well with schema-last. I would not bet much on an average developer&#39;s capacity for making airtight policies on RDF data where not even 100% schema-adherence is guaranteed. Doing security at the RDF graph level seems more appropriate. In many use cases, the graph is analogous to a photo album or a file system directory. A Data Space can be divided into graphs to provide more granularity for expressing topic, provenance, or security. If policy conditions apply mostly to the graph, then things are not as likely to slip by, for example, policy rules missing some infrequent misuse of the schema. In these cases, the burden on the query processor is also not excessive: Just as with documents, the container (table, graph) is the object of access grants, not the individual sentences (DBMS records, RDF triples) in the document. It is left to the application to present a choice of graph level policies to the user. Exactly what these will be depends on the domain of discourse. A policy might restrict access to a meeting in a calendar to people whose OpenIDs figure in the attendee list, or limit access to a photo album to people mentioned in the owner&#39;s social network. Defining such policies is typically a task for the application developer. The difference between the Document Web and the Linked Data Web is that while the Document Web enforces security when a thing is returned to the user, Linked Data Web enforcement must occur whenever a query references something, even if this is an intermediate result not directly shown to the user. The DataSphere will offer a generic policy scheme, filtering what graphs are accessed in a given query situation. Other applications may then verify the safety of one&#39;s disclosed information using the same DataSphere infrastructure. Of course, the user must rely on the infrastructure provider to correctly enforce these rules. Then again, some users will operate and audit their own infrastructure anyway. Federation vs. Centralization On the open web, there is the question of federation vs. centralization. If an application is seen to be an interface to a vocabulary, it becomes more agnostic with respect to this. In practice, if we are talking about hosted services, what is hosted together joins much faster. Data Spaces with lots of interlinking, such as closely connected social networks, will tend to cluster together on the same cloud to facilitate joint operation. Data is ubiquitous and not location-conscious, but what one can efficiently do with it depends on location. Joint access patterns favor joint location. Due to technicalities of the matter, single database clusters will run complex queries within the cluster 100 to 1000 times faster than between clusters. The size of such data clouds may be in the hundreds-of-billions of triples. It seems to make sense to have data belonging to same-type or jointly-used applications close together. In practice, there will arise partitioning by type of usage, user profile, etc., but this is no longer airtight and applications more-or-less float on top of all of this. A search engine can host a copy of the Document Web and allow text lookups on it. But a text lookup is a single well-defined query that happens to parallelize and partition very well. A search engine can also have all the structured public data copied, but the problem there is that queries are a lot less predictable and may take orders of magnitude more resources than a single text lookup. As a partial answer, even now, we can set up a database so that the first million single-row joins cost the user nothing, but doing more requires a special subscription. The cost for hosting a trillion triples will vary radically in function of what throughput is promised. This may result in pricing per service level, a bit like ISP pricing varies in function of promised connectivity. Queries can be run for free if no throughput guarantee applies, and might cost more if the host promises at least five-million joins-per-second including infrequently-accessed data. Performance and cost dynamics will probably lead to the emergence of domain-specific clusters of colocated Data Spaces. The landscape will be hybrid, where usage drives data colocation. A single Google is not a practical solution to the world&#39;s spectrum of query needs. What is the Cost of Schema-Last? The DataSphere proposition is predicated on a worldwide database fabric that can store anything, just like a network can transport anything. It cannot enforce a fixed schema, just like TCP/IP cannot say that it will transport only email. This is continuous schema evolution. Well, TCP/IP can transport anything but it does transport a lot of HTML and email. Similarly, the DataSphere can optimize for some common vocabularies. We have seen that an application-specific relational schema is often 10 times more efficient than an equivalent completely generic RDF representation of the same thing. The gap may narrow, but task specific representations will keep an edge. We ought to know, as we do both. While anything can be represented, the masses are not that creative. For any data-hosting provider, making a specialized representation for the top 100 entities may cut data size in half or better. This is a behind-the-scenes optimization that will in time be a matter of course. Historically, our industry has been driven by two phenomena: New PCs every 2 years. To make this necessary, Windows has been getting bigger and bigger, and not upgrading is not an option if one must exchange documents with new data formats and keep up with security. Agility, or ad hoc over planned. The reason the RDBMS won over CODASYL network databases was that one did not have to define what queries could be made when creating the database. With the Linked Data Web, we have one more step in this direction when we say that one does not have to decide what can be represented when creating the database. To summarize, there is some cost to schema-last, but then our industry needs more complexity to keep justifying constant investment. The cost is in this sense not all bad. Building the DataSphere may be the next great driver of server demand. As a case in point, Cisco, whose fortune was made when the network became ubiquitous, just entered the server game. It&#39;s in the air. DataSphere Precursors Right now, we have the Linked Open Data movement with lots of new data being added. We have the drive for data- and reputation-portability. We have Freebase as a demonstrator of end-users actually producing structured data. We have convergence of terminology around DBpedia, FOAF, SIOC, and more. We have demonstrators of useful data integration on the RDF stack in diverse fields, especially life sciences. We have a totally ubiquitous network for the distribution of this, plus database technology to make this work. We have a practical need for semantics, as search is getting saturated, email is getting killed by spam, and information overload is a constant. Social networks can be leveraged for solving a lot of this, if they can only be opened. Of course, there is a call for transparency in society at large. Well, the battle of transparency vs. spin is a permanent feature of human existence but even there, we cannot ignore the possibilities of open data. Databases and Servers Technically, what does this take? Mostly, this takes a lot of memory. The software is there and we are productizing it as we speak. As with other data intensive things, the key is scalable querying over clusters of commodity servers. Nothing we have not heard before. Of course, the DBMS must know about RDF specifics to get the right query plans and so on but this we have explained elsewhere. This all comes down to the cost of memory. No amount of CPU or network speed will make any difference if data is not in memory. Right now, a board with 8G and a dual core AMD X86-64 and 4 disks may cost about $700. 2 x 4 core Xeon and 16G and 8 disks may be $4000, counting just the components. In our experience, about 32G per billion triples is a minimum. This must be backed by a few independent disks so as to fill the cache in parallel. A cluster with 1 TB of RAM would be under $100K if built from low end boards. The workload is all about large joins across partitions. The queries parallelize well, thus using the largest and most expensive machines for building blocks is not cost efficient. Having absolutely everything in RAM is also not cost efficient, but it is necessary to have many disks to absorb the random access load. Disk access is predominantly random, unlike some analytics workloads that can read serially. If SSD&#39;s get a bit cheaper, one could have SSD for the database and disk for backup. With large data centers, redundancy becomes an issue. The most cost effective redundancy is simply storing partitions in duplicate or triplicate on different commodity servers. The DBMS software should handle the replication and fail-over. For operating such systems, scaling-on-demand is necessary. Data must move between servers, and adding or replacing servers should be an on-the-fly operation. Also, since access is essentially never uniform, the most commonly accessed partitions may benefit from being kept in more copies than less frequently accessed ones. The DBMS must be essentially self administrating since these things are quite complex and easily intractable if one does not have in depth understanding of this rather complex field. The best price point for hardware varies with time. Right now, the optimum is to have many basic motherboards with maximum memory in a rack unit, then another unit with local disks for each motherboard. Much cheaper than SAN&#39;s and Infiniband fabrics. Conclusions and Next Steps The ingredients and use cases are there. If server clusters with 1TB RAM begin under $100K, the cost of deployment is small compared to personnel costs. Bootstrapping the DataSphere from current Linked Open Data, such as DBpedia, OpenCYC, Freebase, and every sort of social network, is feasible. Aside from private data integration and analytics efforts and E-science, the use cases are liberating social networks and C2C and some aspects of search from silos, overcoming spam, and mass use of semantics extracted from text. Emergent effects will then carry the ball to places we have not yet been. The Linked Data Web has its origins in Semantic Web research, and many of the present participants come from these circles. Things may have been slowed down by a disconnect, only too typical of human activity, between Semantic Web research on one hand and database engineering on the other. Right now, the challenge is one of engineering. As documented on this blog, we have worked quite a bit on cluster databases, mostly but not exclusively with RDF use cases. The actual challenges of this are however not at all what is discussed in Semantic Web conferences. These have to do with complexities of parallelism, timing, message bottlenecks, transactions, and the like, i.e., hardcore engineering. These are difficult beyond what the casual onlooker might guess but not impossible. The details that remain to be worked out are nothing semantic, they are hardcore database, concerning automatic provisioning and such matters. It is as if the Semantic Web people look with envy at the Web 2.0 side where there are big deployments in production, yet they do not seem quite ready to take the step themselves. Well, I will write some other time about research and engineering. For now, the message is &amp;mdash go for it. Stay tuned for more announcements, as we near production with our next generation of software. Related Beyond Applications - Introducing the Planetary Datasphere (Part 1) Serendipitous Discovery Quotient (SDQ) How Linked Data will change Advertising The Time for RDBMS Primacy Downgrade is Nigh! Data Spaces</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>
<a href="http://www.openlinksw.com/weblog/oerling/?id=1535" id="link-id155e3bd0">We have looked at the general implications of the DataSphere</a>, a universal, ubiquitous database infrastructure, on end-user experience and application development and content. Now we will look at what this means at the back end, from hosting to security to server software and hardware.</p>

<h2>Application Hosting</h2>

<p>For the infrastructure provider, hosting the DataSphere is no different from hosting large Web 2.0 sites. This may be paid for by users, as in the cloud computing model where users rent capacity for their own purposes, or by advertisers, as in most of Web 2.0.</p>

<p>Clouds play a role in this as places with high local connectivity. The DataSphere is the atmosphere; the Cloud is an atmospheric phenomenon.</p>

<h2>What of Proprietary <a href="http://dbpedia.org/resource/Data" id="link-id0x10fd3e18">Data</a> and its Security?</h2>

<p>Having proprietary data does not imply using a proprietary language. I would say that for any domain of discourse, no matter how private or specialized, at least some structural concepts can be borrowed from public, more generic sources. This lowers training thresholds and facilitates integration. Being able to integrate does not imply opening one&#39;s own data. To take an analogy, if you have a bunker with closed circuit air recycling, you still breathe air, even if that air is cut off from the atmosphere at large. For places with complex existing <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x13cae0b0">RDBMS</a> security, the best is to map the RDBMS to <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x13deb7d8">RDF</a> on the fly, always running all requests through the RDBMS. This implicitly preserves any policy or label based security schemes.</p>

<h2>What of Individual Privacy on the Open Web?</h2>

<p>The more complex situations will be found in environments with mixed security needs, as in social networking with partly-open and partly-closed profiles. The FOAF+SSL solution with <code>https://</code> URIs is one approach. For query processing, we have a question of enforcing instance-level policies. In the DataSphere, granting privileges on tables and views no longer makes sense. In <a href="http://dbpedia.org/resource/SQL" id="link-id0x1211a490">SQL</a>, a policy means that behind the scenes the DBMS will add extra criteria to queries and updates depending on who is issuing them. The query processor adds conditions like getting the user&#39;s department ID and comparing it to the department ID on the payroll record. Labeled security is a scheme where data rows themselves contain security tags and the DBMS enforces these, row by row.</p>

<p>I would say that these techniques are suited for highly-structured situations where the roles, compartments, and needs are clear, and where the organization has the database know-how to write, test, and deploy such rules by the table, row, and column. This does not sit well with schema-last. I would not bet much on an average developer&#39;s capacity for making airtight policies on RDF data where not even 100% schema-adherence is guaranteed.</p>

<p>Doing security at the RDF graph level seems more appropriate. In many use cases, the graph is analogous to a photo album or a file system directory. A Data <a href="http://en.wikipedia.org/wiki/Data_Spaces" id="link-id0x13beff18">Space</a> can be divided into graphs to provide more granularity for expressing topic, provenance, or security. If policy conditions apply mostly to the graph, then things are not as likely to slip by, for example, policy rules missing some infrequent misuse of the schema. In these cases, the burden on the query processor is also not excessive: Just as with documents, the container (table, graph) is the object of access grants, not the individual sentences (DBMS records, RDF triples) in the document.</p>

<p>It is left to the application to present a choice of graph level policies to the user. Exactly what these will be depends on the domain of discourse. A policy might restrict access to a meeting in a calendar to people whose OpenIDs figure in the attendee list, or limit access to a photo album to people mentioned in the owner&#39;s social network. Defining such policies is typically a task for the application developer.</p>

<p>The difference between the Document Web and the <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x13106cd0">Linked Data</a> <a href="http://dbpedia.org/resource/Giant_Global_Graph" id="link-id0x13ca1050">Web</a> is that while the Document Web enforces security when a thing is returned to the user, Linked Data Web enforcement must occur whenever a query references something, even if this is an intermediate result not directly shown to the user.</p>

<p>The DataSphere will offer a generic policy scheme, filtering what graphs are accessed in a given query situation. Other applications may then verify the safety of one&#39;s disclosed <a href="http://dbpedia.org/resource/Information" id="link-id0x13a02d60">information</a> using the same DataSphere infrastructure. Of course, the user must rely on the infrastructure provider to correctly enforce these rules. Then again, some users will operate and audit their own infrastructure anyway.</p>

<h2>Federation vs. Centralization</h2>

<p>On the open web, there is the question of federation vs. centralization. If an application is seen to be an interface to a vocabulary, it becomes more agnostic with respect to this. In practice, if we are talking about hosted services, what is hosted together joins much faster. Data Spaces with lots of interlinking, such as closely connected social networks, will tend to cluster together on the same cloud to facilitate joint operation. Data is ubiquitous and not location-conscious, but what one can efficiently do with it depends on location. Joint access patterns favor joint location. Due to technicalities of the matter, single database clusters will run complex queries within the cluster 100 to 1000 times faster than between clusters. The size of such data clouds may be in the hundreds-of-billions of triples. It seems to make sense to have data belonging to same-type or jointly-used applications close together. In practice, there will arise partitioning by type of usage, user profile, etc., but this is no longer airtight and applications more-or-less float on top of all of this.</p>

<p>A search engine can host a copy of the Document Web and allow text lookups on it. But a text lookup is a single well-defined query that happens to parallelize and partition very well. A search engine can also have all the structured public data copied, but the problem there is that queries are a lot less predictable and may take orders of magnitude more resources than a single text lookup. As a partial answer, even now, we can set up a database so that the first million single-row joins cost the user nothing, but doing more requires a special subscription.</p>

<p>The cost for hosting a trillion triples will vary radically in function of what throughput is promised. This may result in pricing per service level, a bit like ISP pricing varies in function of promised connectivity. Queries can be run for free if no throughput guarantee applies, and might cost more if the host promises at least five-million joins-per-second including infrequently-accessed data.</p>

<p>Performance and cost dynamics will probably lead to the emergence of domain-specific clusters of colocated Data Spaces. The landscape will be hybrid, where usage drives data colocation. A single Google is not a practical solution to the world&#39;s spectrum of query needs.</p>

<h2>What is the Cost of Schema-Last?</h2>

<p>The DataSphere proposition is predicated on a worldwide database fabric that can store anything, just like a network can transport anything. It cannot enforce a fixed schema, just like TCP/IP cannot say that it will transport only email. This is continuous schema evolution. Well, TCP/IP can transport anything but it does transport a lot of HTML and email. Similarly, the DataSphere can optimize for some common vocabularies.</p>

<p>We have seen that an application-specific relational schema is often 10 times more efficient than an equivalent completely generic RDF representation of the same thing. The gap may narrow, but task specific representations will keep an edge. We ought to know, as we do both.</p>

<p>While anything can be represented, the masses are not that creative. For any data-hosting provider, making a specialized representation for the top 100 entities may cut data size in half or better. This is a behind-the-scenes optimization that will in time be a matter of course.</p>

<p>Historically, our industry has been driven by two phenomena:</p>

<ol>
<li>
  <b>New PCs every 2 years.</b> To make this necessary, Windows has been getting bigger and bigger, and not upgrading is not an option if one must exchange documents with new data formats and keep up with security.</li>

<li>
  <b>Agility, or <i>ad hoc</i> over planned.</b> The reason the RDBMS won over <a href="http://dbpedia.org/resource/CODASYL" id="link-id0x24ee5098">CODASYL</a> network databases was that one did not have to define what queries could be made when creating the database. With the Linked Data Web, we have one more step in this direction when we say that one does not have to decide what can be represented when creating the database.</li>
</ol>

<p>To summarize, there is some cost to schema-last, but then our industry needs more complexity to keep justifying constant investment. The cost is in this sense not all bad.</p>

<p>Building the DataSphere may be the next great driver of server demand. As a case in point, Cisco, whose fortune was made when the network became ubiquitous, just entered the server game. It&#39;s in the air.</p>

<h2>DataSphere Precursors</h2>

<p>Right now, we have the <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x13ea7938">Linked Open Data</a> movement with lots of new data being added. We have the drive for data- and reputation-portability. We have Freebase as a demonstrator of end-users actually producing structured data. We have convergence of terminology around <a href="http://dbpedia.org/resource/DBpedia" id="link-id0x13ae45e8">DBpedia</a>, FOAF, SIOC, and more. We have demonstrators of useful data integration on the RDF stack in diverse fields, especially life sciences.</p>

<p>We have a totally ubiquitous network for the distribution of this, plus database technology to make this work.</p>

<p>We have a practical need for semantics, as search is getting saturated, email is getting killed by spam, and information overload is a constant. Social networks can be leveraged for solving a lot of this, if they can only be opened.</p>

<p>Of course, there is a call for transparency in society at large. Well, the battle of transparency vs. spin is a permanent feature of human existence but even there, we cannot ignore the possibilities of open data.</p>

<h2>Databases and Servers</h2>

<p>Technically, what does this take? Mostly, this takes a lot of memory. The software is there and we are productizing it as we speak. As with other data intensive things, the key is scalable querying over clusters of commodity servers. Nothing we have not heard before. Of course, the DBMS must know about RDF specifics to get the right query plans and so on but this we have explained elsewhere.</p>

<p>This all comes down to the cost of memory. No amount of CPU or network speed will make any difference if data is not in memory. Right now, a board with 8G and a dual core AMD X86-64 and 4 disks may cost about $700. 2 x 4 core Xeon and 16G and 8 disks may be $4000, counting just the components. In our experience, about 32G per billion triples is a minimum. This must be backed by a few independent disks so as to fill the cache in parallel. A cluster with 1 TB of RAM would be under $100K if built from low end boards.</p>

<p>The workload is all about large joins across partitions. The queries parallelize well, thus using the largest and most expensive machines for building blocks is not cost efficient. Having absolutely everything in RAM is also not cost efficient, but it is necessary to have many disks to absorb the random access load. Disk access is predominantly random, unlike some analytics workloads that can read serially. If SSD&#39;s get a bit cheaper, one could have SSD for the database and disk for backup.</p>

<p>With large data centers, redundancy becomes an issue. The most cost effective redundancy is simply storing partitions in duplicate or triplicate on different commodity servers. The DBMS software should handle the replication and fail-over.</p>

<p>For operating such systems, scaling-on-demand is necessary. Data must move between servers, and adding or replacing servers should be an on-the-fly operation. Also, since access is essentially never uniform, the most commonly accessed partitions may benefit from being kept in more copies than less frequently accessed ones. The DBMS must be essentially self administrating since these things are quite complex and easily intractable if one does not have in depth understanding of this rather complex field.</p>

<p>The best price point for hardware varies with time. Right now, the optimum is to have many basic motherboards with maximum memory in a rack unit, then another unit with local disks for each motherboard. Much cheaper than SAN&#39;s and Infiniband fabrics.</p>

<h2>Conclusions and Next Steps</h2>

<p>The ingredients and use cases are there. If server clusters with 1TB RAM begin under $100K, the cost of deployment is small compared to personnel costs.</p>

<p>Bootstrapping the DataSphere from current Linked Open Data, such as DBpedia, <a href="http://dbpedia.org/resource/Cyc" id="link-id0x13c36da8">OpenCYC</a>, Freebase, and every sort of social network, is feasible. Aside from private data integration and analytics efforts and E-science, the use cases are liberating social networks and C2C and some aspects of search from silos, overcoming spam, and mass use of semantics extracted from text. Emergent effects will then carry the ball to places we have not yet been.</p>

<p>The Linked Data Web has its origins in <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x1405f0a8">Semantic Web</a> research, and many of the present participants come from these circles. Things may have been slowed down by a disconnect, only too typical of human activity, between Semantic Web research on one hand and database engineering on the other. Right now, the challenge is one of engineering. As documented on this <a href="http://dbpedia.org/resource/Blog" id="link-id0x2329f1a8">blog</a>, we have worked quite a bit on cluster databases, mostly but not exclusively with RDF use cases. The actual challenges of this are however not at all what is discussed in Semantic Web conferences. These have to do with complexities of parallelism, timing, message bottlenecks, transactions, and the like, i.e., hardcore engineering. These are difficult beyond what the casual onlooker might guess but not impossible. The details that remain to be worked out are nothing semantic, they are hardcore database, concerning automatic provisioning and such matters.</p>

<p>It is as if the Semantic Web people look with envy at the Web 2.0 side where there are big deployments in production, yet they do not seem quite ready to take the step themselves. Well, I will write some other time about research and engineering. For now, the message is &amp;mdash <i><b>go for it</b></i>. Stay tuned for more announcements, as we near production with our next generation of software.</p>


<h2>Related</h2>
<ul>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1535" id="link-id14e02bb0">Beyond Applications - Introducing the Planetary Datasphere (Part 1)</a>
</li>
<li>
  <a href="http://www.openlinksw.com/blog/~kidehen/?id=1442" id="link-id117dc518">Serendipitous Discovery Quotient (SDQ)</a>
</li>
<li>
  <a href="http://www.openlinksw.com/blog/~kidehen/?id=1534" id="link-id15c52410">How Linked Data will change Advertising</a>
</li>
<li>
  <a href="http://www.openlinksw.com/blog/~kidehen/?id=1519" id="link-id11e93658">The Time for RDBMS Primacy Downgrade is Nigh!</a>
</li>
<li>
  <a href="http://www.openlinksw.com/blog/~kidehen/?tag=DataSpace" id="link-id1491a588">Data Spaces</a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2009-03-25#1537">
  <rss:title>Beyond Applications - Introducing the Planetary Datasphere (Part 2)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-03-25T15:50:56Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We have looked at the general implications of the DataSphere, a universal, ubiquitous database infrastructure, on end-user experience and application development and content. Now we will look at what this means at the back end, from hosting to security to server software and hardware. Application Hosting For the infrastructure provider, hosting the DataSphere is no different from hosting large Web 2.0 sites. This may be paid for by users, as in the cloud computing model where users rent capacity for their own purposes, or by advertisers, as in most of Web 2.0. Clouds play a role in this as places with high local connectivity. The DataSphere is the atmosphere; the Cloud is an atmospheric phenomenon. What of Proprietary Data and its Security? Having proprietary data does not imply using a proprietary language. I would say that for any domain of discourse, no matter how private or specialized, at least some structural concepts can be borrowed from public, more generic sources. This lowers training thresholds and facilitates integration. Being able to integrate does not imply opening one&#39;s own data. To take an analogy, if you have a bunker with closed circuit air recycling, you still breathe air, even if that air is cut off from the atmosphere at large. For places with complex existing RDBMS security, the best is to map the RDBMS to RDF on the fly, always running all requests through the RDBMS. This implicitly preserves any policy or label based security schemes. What of Individual Privacy on the Open Web? The more complex situations will be found in environments with mixed security needs, as in social networking with partly-open and partly-closed profiles. The FOAF+SSL solution with https:// URIs is one approach. For query processing, we have a question of enforcing instance-level policies. In the DataSphere, granting privileges on tables and views no longer makes sense. In SQL, a policy means that behind the scenes the DBMS will add extra criteria to queries and updates depending on who is issuing them. The query processor adds conditions like getting the user&#39;s department ID and comparing it to the department ID on the payroll record. Labeled security is a scheme where data rows themselves contain security tags and the DBMS enforces these, row by row. I would say that these techniques are suited for highly-structured situations where the roles, compartments, and needs are clear, and where the organization has the database know-how to write, test, and deploy such rules by the table, row, and column. This does not sit well with schema-last. I would not bet much on an average developer&#39;s capacity for making airtight policies on RDF data where not even 100% schema-adherence is guaranteed. Doing security at the RDF graph level seems more appropriate. In many use cases, the graph is analogous to a photo album or a file system directory. A Data Space can be divided into graphs to provide more granularity for expressing topic, provenance, or security. If policy conditions apply mostly to the graph, then things are not as likely to slip by, for example, policy rules missing some infrequent misuse of the schema. In these cases, the burden on the query processor is also not excessive: Just as with documents, the container (table, graph) is the object of access grants, not the individual sentences (DBMS records, RDF triples) in the document. It is left to the application to present a choice of graph level policies to the user. Exactly what these will be depends on the domain of discourse. A policy might restrict access to a meeting in a calendar to people whose OpenIDs figure in the attendee list, or limit access to a photo album to people mentioned in the owner&#39;s social network. Defining such policies is typically a task for the application developer. The difference between the Document Web and the Linked Data Web is that while the Document Web enforces security when a thing is returned to the user, Linked Data Web enforcement must occur whenever a query references something, even if this is an intermediate result not directly shown to the user. The DataSphere will offer a generic policy scheme, filtering what graphs are accessed in a given query situation. Other applications may then verify the safety of one&#39;s disclosed information using the same DataSphere infrastructure. Of course, the user must rely on the infrastructure provider to correctly enforce these rules. Then again, some users will operate and audit their own infrastructure anyway. Federation vs. Centralization On the open web, there is the question of federation vs. centralization. If an application is seen to be an interface to a vocabulary, it becomes more agnostic with respect to this. In practice, if we are talking about hosted services, what is hosted together joins much faster. Data Spaces with lots of interlinking, such as closely connected social networks, will tend to cluster together on the same cloud to facilitate joint operation. Data is ubiquitous and not location-conscious, but what one can efficiently do with it depends on location. Joint access patterns favor joint location. Due to technicalities of the matter, single database clusters will run complex queries within the cluster 100 to 1000 times faster than between clusters. The size of such data clouds may be in the hundreds-of-billions of triples. It seems to make sense to have data belonging to same-type or jointly-used applications close together. In practice, there will arise partitioning by type of usage, user profile, etc., but this is no longer airtight and applications more-or-less float on top of all of this. A search engine can host a copy of the Document Web and allow text lookups on it. But a text lookup is a single well-defined query that happens to parallelize and partition very well. A search engine can also have all the structured public data copied, but the problem there is that queries are a lot less predictable and may take orders of magnitude more resources than a single text lookup. As a partial answer, even now, we can set up a database so that the first million single-row joins cost the user nothing, but doing more requires a special subscription. The cost for hosting a trillion triples will vary radically in function of what throughput is promised. This may result in pricing per service level, a bit like ISP pricing varies in function of promised connectivity. Queries can be run for free if no throughput guarantee applies, and might cost more if the host promises at least five-million joins-per-second including infrequently-accessed data. Performance and cost dynamics will probably lead to the emergence of domain-specific clusters of colocated Data Spaces. The landscape will be hybrid, where usage drives data colocation. A single Google is not a practical solution to the world&#39;s spectrum of query needs. What is the Cost of Schema-Last? The DataSphere proposition is predicated on a worldwide database fabric that can store anything, just like a network can transport anything. It cannot enforce a fixed schema, just like TCP/IP cannot say that it will transport only email. This is continuous schema evolution. Well, TCP/IP can transport anything but it does transport a lot of HTML and email. Similarly, the DataSphere can optimize for some common vocabularies. We have seen that an application-specific relational schema is often 10 times more efficient than an equivalent completely generic RDF representation of the same thing. The gap may narrow, but task specific representations will keep an edge. We ought to know, as we do both. While anything can be represented, the masses are not that creative. For any data-hosting provider, making a specialized representation for the top 100 entities may cut data size in half or better. This is a behind-the-scenes optimization that will in time be a matter of course. Historically, our industry has been driven by two phenomena: New PCs every 2 years. To make this necessary, Windows has been getting bigger and bigger, and not upgrading is not an option if one must exchange documents with new data formats and keep up with security. Agility, or ad hoc over planned. The reason the RDBMS won over CODASYL network databases was that one did not have to define what queries could be made when creating the database. With the Linked Data Web, we have one more step in this direction when we say that one does not have to decide what can be represented when creating the database. To summarize, there is some cost to schema-last, but then our industry needs more complexity to keep justifying constant investment. The cost is in this sense not all bad. Building the DataSphere may be the next great driver of server demand. As a case in point, Cisco, whose fortune was made when the network became ubiquitous, just entered the server game. It&#39;s in the air. DataSphere Precursors Right now, we have the Linked Open Data movement with lots of new data being added. We have the drive for data- and reputation-portability. We have Freebase as a demonstrator of end-users actually producing structured data. We have convergence of terminology around DBpedia, FOAF, SIOC, and more. We have demonstrators of useful data integration on the RDF stack in diverse fields, especially life sciences. We have a totally ubiquitous network for the distribution of this, plus database technology to make this work. We have a practical need for semantics, as search is getting saturated, email is getting killed by spam, and information overload is a constant. Social networks can be leveraged for solving a lot of this, if they can only be opened. Of course, there is a call for transparency in society at large. Well, the battle of transparency vs. spin is a permanent feature of human existence but even there, we cannot ignore the possibilities of open data. Databases and Servers Technically, what does this take? Mostly, this takes a lot of memory. The software is there and we are productizing it as we speak. As with other data intensive things, the key is scalable querying over clusters of commodity servers. Nothing we have not heard before. Of course, the DBMS must know about RDF specifics to get the right query plans and so on but this we have explained elsewhere. This all comes down to the cost of memory. No amount of CPU or network speed will make any difference if data is not in memory. Right now, a board with 8G and a dual core AMD X86-64 and 4 disks may cost about $700. 2 x 4 core Xeon and 16G and 8 disks may be $4000, counting just the components. In our experience, about 32G per billion triples is a minimum. This must be backed by a few independent disks so as to fill the cache in parallel. A cluster with 1 TB of RAM would be under $100K if built from low end boards. The workload is all about large joins across partitions. The queries parallelize well, thus using the largest and most expensive machines for building blocks is not cost efficient. Having absolutely everything in RAM is also not cost efficient, but it is necessary to have many disks to absorb the random access load. Disk access is predominantly random, unlike some analytics workloads that can read serially. If SSD&#39;s get a bit cheaper, one could have SSD for the database and disk for backup. With large data centers, redundancy becomes an issue. The most cost effective redundancy is simply storing partitions in duplicate or triplicate on different commodity servers. The DBMS software should handle the replication and fail-over. For operating such systems, scaling-on-demand is necessary. Data must move between servers, and adding or replacing servers should be an on-the-fly operation. Also, since access is essentially never uniform, the most commonly accessed partitions may benefit from being kept in more copies than less frequently accessed ones. The DBMS must be essentially self administrating since these things are quite complex and easily intractable if one does not have in depth understanding of this rather complex field. The best price point for hardware varies with time. Right now, the optimum is to have many basic motherboards with maximum memory in a rack unit, then another unit with local disks for each motherboard. Much cheaper than SAN&#39;s and Infiniband fabrics. Conclusions and Next Steps The ingredients and use cases are there. If server clusters with 1TB RAM begin under $100K, the cost of deployment is small compared to personnel costs. Bootstrapping the DataSphere from current Linked Open Data, such as DBpedia, OpenCYC, Freebase, and every sort of social network, is feasible. Aside from private data integration and analytics efforts and E-science, the use cases are liberating social networks and C2C and some aspects of search from silos, overcoming spam, and mass use of semantics extracted from text. Emergent effects will then carry the ball to places we have not yet been. The Linked Data Web has its origins in Semantic Web research, and many of the present participants come from these circles. Things may have been slowed down by a disconnect, only too typical of human activity, between Semantic Web research on one hand and database engineering on the other. Right now, the challenge is one of engineering. As documented on this blog, we have worked quite a bit on cluster databases, mostly but not exclusively with RDF use cases. The actual challenges of this are however not at all what is discussed in Semantic Web conferences. These have to do with complexities of parallelism, timing, message bottlenecks, transactions, and the like, i.e., hardcore engineering. These are difficult beyond what the casual onlooker might guess but not impossible. The details that remain to be worked out are nothing semantic, they are hardcore database, concerning automatic provisioning and such matters. It is as if the Semantic Web people look with envy at the Web 2.0 side where there are big deployments in production, yet they do not seem quite ready to take the step themselves. Well, I will write some other time about research and engineering. For now, the message is &amp;mdash go for it. Stay tuned for more announcements, as we near production with our next generation of software. Related Beyond Applications - Introducing the Planetary Datasphere (Part 1) Serendipitous Discovery Quotient (SDQ) How Linked Data will change Advertising The Time for RDBMS Primacy Downgrade is Nigh! Data Spaces</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>
<a href="http://www.openlinksw.com/weblog/oerling/?id=1535" id="link-id155e3bd0">We have looked at the general implications of the DataSphere</a>, a universal, ubiquitous database infrastructure, on end-user experience and application development and content. Now we will look at what this means at the back end, from hosting to security to server software and hardware.</p>

<h2>Application Hosting</h2>

<p>For the infrastructure provider, hosting the DataSphere is no different from hosting large Web 2.0 sites. This may be paid for by users, as in the cloud computing model where users rent capacity for their own purposes, or by advertisers, as in most of Web 2.0.</p>

<p>Clouds play a role in this as places with high local connectivity. The DataSphere is the atmosphere; the Cloud is an atmospheric phenomenon.</p>

<h2>What of Proprietary <a href="http://dbpedia.org/resource/Data" id="link-id0x13b5b4a0">Data</a> and its Security?</h2>

<p>Having proprietary data does not imply using a proprietary language. I would say that for any domain of discourse, no matter how private or specialized, at least some structural concepts can be borrowed from public, more generic sources. This lowers training thresholds and facilitates integration. Being able to integrate does not imply opening one&#39;s own data. To take an analogy, if you have a bunker with closed circuit air recycling, you still breathe air, even if that air is cut off from the atmosphere at large. For places with complex existing <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x24db80e0">RDBMS</a> security, the best is to map the RDBMS to <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x24ea7c40">RDF</a> on the fly, always running all requests through the RDBMS. This implicitly preserves any policy or label based security schemes.</p>

<h2>What of Individual Privacy on the Open Web?</h2>

<p>The more complex situations will be found in environments with mixed security needs, as in social networking with partly-open and partly-closed profiles. The FOAF+SSL solution with <code>https://</code> URIs is one approach. For query processing, we have a question of enforcing instance-level policies. In the DataSphere, granting privileges on tables and views no longer makes sense. In <a href="http://dbpedia.org/resource/SQL" id="link-id0x24aaccc0">SQL</a>, a policy means that behind the scenes the DBMS will add extra criteria to queries and updates depending on who is issuing them. The query processor adds conditions like getting the user&#39;s department ID and comparing it to the department ID on the payroll record. Labeled security is a scheme where data rows themselves contain security tags and the DBMS enforces these, row by row.</p>

<p>I would say that these techniques are suited for highly-structured situations where the roles, compartments, and needs are clear, and where the organization has the database know-how to write, test, and deploy such rules by the table, row, and column. This does not sit well with schema-last. I would not bet much on an average developer&#39;s capacity for making airtight policies on RDF data where not even 100% schema-adherence is guaranteed.</p>

<p>Doing security at the RDF graph level seems more appropriate. In many use cases, the graph is analogous to a photo album or a file system directory. A Data <a href="http://en.wikipedia.org/wiki/Data_Spaces" id="link-id0x2396c058">Space</a> can be divided into graphs to provide more granularity for expressing topic, provenance, or security. If policy conditions apply mostly to the graph, then things are not as likely to slip by, for example, policy rules missing some infrequent misuse of the schema. In these cases, the burden on the query processor is also not excessive: Just as with documents, the container (table, graph) is the object of access grants, not the individual sentences (DBMS records, RDF triples) in the document.</p>

<p>It is left to the application to present a choice of graph level policies to the user. Exactly what these will be depends on the domain of discourse. A policy might restrict access to a meeting in a calendar to people whose OpenIDs figure in the attendee list, or limit access to a photo album to people mentioned in the owner&#39;s social network. Defining such policies is typically a task for the application developer.</p>

<p>The difference between the Document Web and the <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x238a0098">Linked Data</a> <a href="http://dbpedia.org/resource/Giant_Global_Graph" id="link-id0x23882280">Web</a> is that while the Document Web enforces security when a thing is returned to the user, Linked Data Web enforcement must occur whenever a query references something, even if this is an intermediate result not directly shown to the user.</p>

<p>The DataSphere will offer a generic policy scheme, filtering what graphs are accessed in a given query situation. Other applications may then verify the safety of one&#39;s disclosed <a href="http://dbpedia.org/resource/Information" id="link-id0x2388e458">information</a> using the same DataSphere infrastructure. Of course, the user must rely on the infrastructure provider to correctly enforce these rules. Then again, some users will operate and audit their own infrastructure anyway.</p>

<h2>Federation vs. Centralization</h2>

<p>On the open web, there is the question of federation vs. centralization. If an application is seen to be an interface to a vocabulary, it becomes more agnostic with respect to this. In practice, if we are talking about hosted services, what is hosted together joins much faster. Data Spaces with lots of interlinking, such as closely connected social networks, will tend to cluster together on the same cloud to facilitate joint operation. Data is ubiquitous and not location-conscious, but what one can efficiently do with it depends on location. Joint access patterns favor joint location. Due to technicalities of the matter, single database clusters will run complex queries within the cluster 100 to 1000 times faster than between clusters. The size of such data clouds may be in the hundreds-of-billions of triples. It seems to make sense to have data belonging to same-type or jointly-used applications close together. In practice, there will arise partitioning by type of usage, user profile, etc., but this is no longer airtight and applications more-or-less float on top of all of this.</p>

<p>A search engine can host a copy of the Document Web and allow text lookups on it. But a text lookup is a single well-defined query that happens to parallelize and partition very well. A search engine can also have all the structured public data copied, but the problem there is that queries are a lot less predictable and may take orders of magnitude more resources than a single text lookup. As a partial answer, even now, we can set up a database so that the first million single-row joins cost the user nothing, but doing more requires a special subscription.</p>

<p>The cost for hosting a trillion triples will vary radically in function of what throughput is promised. This may result in pricing per service level, a bit like ISP pricing varies in function of promised connectivity. Queries can be run for free if no throughput guarantee applies, and might cost more if the host promises at least five-million joins-per-second including infrequently-accessed data.</p>

<p>Performance and cost dynamics will probably lead to the emergence of domain-specific clusters of colocated Data Spaces. The landscape will be hybrid, where usage drives data colocation. A single Google is not a practical solution to the world&#39;s spectrum of query needs.</p>

<h2>What is the Cost of Schema-Last?</h2>

<p>The DataSphere proposition is predicated on a worldwide database fabric that can store anything, just like a network can transport anything. It cannot enforce a fixed schema, just like TCP/IP cannot say that it will transport only email. This is continuous schema evolution. Well, TCP/IP can transport anything but it does transport a lot of HTML and email. Similarly, the DataSphere can optimize for some common vocabularies.</p>

<p>We have seen that an application-specific relational schema is often 10 times more efficient than an equivalent completely generic RDF representation of the same thing. The gap may narrow, but task specific representations will keep an edge. We ought to know, as we do both.</p>

<p>While anything can be represented, the masses are not that creative. For any data-hosting provider, making a specialized representation for the top 100 entities may cut data size in half or better. This is a behind-the-scenes optimization that will in time be a matter of course.</p>

<p>Historically, our industry has been driven by two phenomena:</p>

<ol>
<li>
  <b>New PCs every 2 years.</b> To make this necessary, Windows has been getting bigger and bigger, and not upgrading is not an option if one must exchange documents with new data formats and keep up with security.</li>

<li>
  <b>Agility, or <i>ad hoc</i> over planned.</b> The reason the RDBMS won over <a href="http://dbpedia.org/resource/CODASYL" id="link-id0x13b23460">CODASYL</a> network databases was that one did not have to define what queries could be made when creating the database. With the Linked Data Web, we have one more step in this direction when we say that one does not have to decide what can be represented when creating the database.</li>
</ol>

<p>To summarize, there is some cost to schema-last, but then our industry needs more complexity to keep justifying constant investment. The cost is in this sense not all bad.</p>

<p>Building the DataSphere may be the next great driver of server demand. As a case in point, Cisco, whose fortune was made when the network became ubiquitous, just entered the server game. It&#39;s in the air.</p>

<h2>DataSphere Precursors</h2>

<p>Right now, we have the <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x236a9be8">Linked Open Data</a> movement with lots of new data being added. We have the drive for data- and reputation-portability. We have Freebase as a demonstrator of end-users actually producing structured data. We have convergence of terminology around <a href="http://dbpedia.org/resource/DBpedia" id="link-id0x24db8350">DBpedia</a>, FOAF, SIOC, and more. We have demonstrators of useful data integration on the RDF stack in diverse fields, especially life sciences.</p>

<p>We have a totally ubiquitous network for the distribution of this, plus database technology to make this work.</p>

<p>We have a practical need for semantics, as search is getting saturated, email is getting killed by spam, and information overload is a constant. Social networks can be leveraged for solving a lot of this, if they can only be opened.</p>

<p>Of course, there is a call for transparency in society at large. Well, the battle of transparency vs. spin is a permanent feature of human existence but even there, we cannot ignore the possibilities of open data.</p>

<h2>Databases and Servers</h2>

<p>Technically, what does this take? Mostly, this takes a lot of memory. The software is there and we are productizing it as we speak. As with other data intensive things, the key is scalable querying over clusters of commodity servers. Nothing we have not heard before. Of course, the DBMS must know about RDF specifics to get the right query plans and so on but this we have explained elsewhere.</p>

<p>This all comes down to the cost of memory. No amount of CPU or network speed will make any difference if data is not in memory. Right now, a board with 8G and a dual core AMD X86-64 and 4 disks may cost about $700. 2 x 4 core Xeon and 16G and 8 disks may be $4000, counting just the components. In our experience, about 32G per billion triples is a minimum. This must be backed by a few independent disks so as to fill the cache in parallel. A cluster with 1 TB of RAM would be under $100K if built from low end boards.</p>

<p>The workload is all about large joins across partitions. The queries parallelize well, thus using the largest and most expensive machines for building blocks is not cost efficient. Having absolutely everything in RAM is also not cost efficient, but it is necessary to have many disks to absorb the random access load. Disk access is predominantly random, unlike some analytics workloads that can read serially. If SSD&#39;s get a bit cheaper, one could have SSD for the database and disk for backup.</p>

<p>With large data centers, redundancy becomes an issue. The most cost effective redundancy is simply storing partitions in duplicate or triplicate on different commodity servers. The DBMS software should handle the replication and fail-over.</p>

<p>For operating such systems, scaling-on-demand is necessary. Data must move between servers, and adding or replacing servers should be an on-the-fly operation. Also, since access is essentially never uniform, the most commonly accessed partitions may benefit from being kept in more copies than less frequently accessed ones. The DBMS must be essentially self administrating since these things are quite complex and easily intractable if one does not have in depth understanding of this rather complex field.</p>

<p>The best price point for hardware varies with time. Right now, the optimum is to have many basic motherboards with maximum memory in a rack unit, then another unit with local disks for each motherboard. Much cheaper than SAN&#39;s and Infiniband fabrics.</p>

<h2>Conclusions and Next Steps</h2>

<p>The ingredients and use cases are there. If server clusters with 1TB RAM begin under $100K, the cost of deployment is small compared to personnel costs.</p>

<p>Bootstrapping the DataSphere from current Linked Open Data, such as DBpedia, <a href="http://dbpedia.org/resource/Cyc" id="link-id0x2396a038">OpenCYC</a>, Freebase, and every sort of social network, is feasible. Aside from private data integration and analytics efforts and E-science, the use cases are liberating social networks and C2C and some aspects of search from silos, overcoming spam, and mass use of semantics extracted from text. Emergent effects will then carry the ball to places we have not yet been.</p>

<p>The Linked Data Web has its origins in <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x13ea7110">Semantic Web</a> research, and many of the present participants come from these circles. Things may have been slowed down by a disconnect, only too typical of human activity, between Semantic Web research on one hand and database engineering on the other. Right now, the challenge is one of engineering. As documented on this <a href="http://dbpedia.org/resource/Blog" id="link-id0x2388e368">blog</a>, we have worked quite a bit on cluster databases, mostly but not exclusively with RDF use cases. The actual challenges of this are however not at all what is discussed in Semantic Web conferences. These have to do with complexities of parallelism, timing, message bottlenecks, transactions, and the like, i.e., hardcore engineering. These are difficult beyond what the casual onlooker might guess but not impossible. The details that remain to be worked out are nothing semantic, they are hardcore database, concerning automatic provisioning and such matters.</p>

<p>It is as if the Semantic Web people look with envy at the Web 2.0 side where there are big deployments in production, yet they do not seem quite ready to take the step themselves. Well, I will write some other time about research and engineering. For now, the message is &amp;mdash <i><b>go for it</b></i>. Stay tuned for more announcements, as we near production with our next generation of software.</p>


<h2>Related</h2>
<ul>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1535" id="link-id14e02bb0">Beyond Applications - Introducing the Planetary Datasphere (Part 1)</a>
</li>
<li>
  <a href="http://www.openlinksw.com/blog/~kidehen/?id=1442" id="link-id117dc518">Serendipitous Discovery Quotient (SDQ)</a>
</li>
<li>
  <a href="http://www.openlinksw.com/blog/~kidehen/?id=1534" id="link-id15c52410">How Linked Data will change Advertising</a>
</li>
<li>
  <a href="http://www.openlinksw.com/blog/~kidehen/?id=1519" id="link-id11e93658">The Time for RDBMS Primacy Downgrade is Nigh!</a>
</li>
<li>
  <a href="http://www.openlinksw.com/blog/~kidehen/?tag=DataSpace" id="link-id1491a588">Data Spaces</a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-12-17#1505">
  <rss:title>Virtuoso RDF:  A Getting Started Guide for the Developer</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-12-17T12:31:34Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">It is a long standing promise of mine to dispel the false impression that using Virtuoso to work with RDF is complicated. The purpose of this presentation is to show a programmer how to put RDF into Virtuoso and how to query it. This is done programmatically, with no confusing user interfaces. You should have a Virtuoso Open Source tree built and installed. We will look at the LUBM benchmark demo that comes with the package. All you need is a Unix shell. Running the shell under emacs (m-x shell) is the best. But the open source isql utility should have command line editing also. The emacs shell is however convenient for cutting and pasting things between shell and files. To get started, cd into binsrc/tests/lubm. To verify that this works, you can do ./test_server.sh virtuoso-t This will test the server with the LUBM queries. This should report 45 tests passed. After this we will do the tests step-by-step. Loading the Data The file lubm-load.sql contains the commands for loading the LUBM single university qualification database. The data files themselves are in lubm_8000, 15 files in RDFXML. There is also a little ontology called inf.nt. This declares the subclass and subproperty relations used in the benchmark. So now let&#39;s go through this procedure. Start the server: $ virtuoso-t -f &amp; This starts the server in foreground mode, and puts it in the background of the shell. Now we connect to it with the isql utility. $ isql 1111 dba dba This gives a SQL&gt; prompt. The default username and password are both dba. When a command is SQL, it is entered directly. If it is SPARQL, it is prefixed with the keyword sparql. This is how all the SQL clients work. Any SQL client, such as any ODBC or JDBC application, can use SPARQL if the SQL string starts with this keyword. The lubm-load.sql file is quite self-explanatory. It begins with defining an SQL procedure that calls the RDF/XML load function, DB..RDF_LOAD_RDFXML, for each file in a directory. Next it calls this function for the lubm_8000 directory under the server&#39;s working directory. sparql CLEAR GRAPH &lt;lubm&gt;; sparql CLEAR GRAPH &lt;inf&gt;; load_lubm ( server_root() || &#39;/lubm_8000/&#39; ); Then it verifies that the right number of triples is found in the &lt;lubm&gt; graph. sparql SELECT COUNT(*) FROM &lt;lubm&gt; WHERE { ?x ?y ?z } ; The echo commands below this are interpreted by the isql utility, and produce output to show whether the test was passed. They can be ignored for now. Then it adds some implied subOrganizationOf triples. This is part of setting up the LUBM test database. sparql PREFIX ub: &lt;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#&gt; INSERT INTO GRAPH &lt;lubm&gt; { ?x ub:subOrganizationOf ?z } FROM &lt;lubm&gt; WHERE { ?x ub:subOrganizationOf ?y . ?y ub:subOrganizationOf ?z . }; Then it loads the ontology file, inf.nt, using the Turtle load function, DB.DBA.TTLP. The arguments of the function are the text to load, the default namespace prefix, and the URI of the target graph. DB.DBA.TTLP ( file_to_string ( &#39;inf.nt&#39; ), &#39;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl&#39;, &#39;inf&#39; ) ; sparql SELECT COUNT(*) FROM &lt;inf&gt; WHERE { ?x ?y ?z } ; Then we declare that the triples in the &lt;inf&gt; graph can be used for inference at run time. To enable this, a SPARQL query will declare that it uses the &#39;inft&#39; rule set. Otherwise this has no effect. rdfs_rule_set (&#39;inft&#39;, &#39;inf&#39;); This is just a log checkpoint to finalize the work and truncate the transaction log. The server would also eventually do this in its own time. checkpoint; Now we are ready for querying. Querying the Data The queries are given in 3 different versions: The first file, lubm.sql, has the queries with most inference open coded as UNIONs. The second file, lubm-inf.sql, has the inference performed at run time using the ontology information in the &lt;inf&gt; graph we just loaded. The last, lubm-phys.sql, relies on having the entailed triples physically present in the &lt;lubm&gt; graph. These entailed triples are inserted by the SPARUL commands in the lubm-cp.sql file. If you wish to run all the commands in a SQL file, you can type load &lt;filename&gt;; (e.g., load lubm-cp.sql;) at the SQL&gt; prompt. If you wish to try individual statements, you can paste them to the command line. For example: SQL&gt; sparql PREFIX ub: &lt;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#&gt; SELECT * FROM &lt;lubm&gt; WHERE { ?x a ub:Publication . ?x ub:publicationAuthor &lt;http://www.Department0.University0.edu/AssistantProfessor0&gt; }; VARCHAR _______________________________________________________________________ http://www.Department0.University0.edu/AssistantProfessor0/Publication0 http://www.Department0.University0.edu/AssistantProfessor0/Publication1 http://www.Department0.University0.edu/AssistantProfessor0/Publication2 http://www.Department0.University0.edu/AssistantProfessor0/Publication3 http://www.Department0.University0.edu/AssistantProfessor0/Publication4 http://www.Department0.University0.edu/AssistantProfessor0/Publication5 6 Rows. -- 4 msec. To stop the server, simply type shutdown; at the SQL&gt; prompt. If you wish to use a SPARQL protocol end point, just enable the HTTP listener. This is done by adding a stanza like â [HTTPServer] ServerPort = 8421 ServerRoot = . ServerThreads = 2 â to the end of the virtuoso.ini file in the lubm directory. Then shutdown and restart (type shutdown; at the SQL&gt; prompt and then virtuoso-t -f &amp; at the shell prompt). Now you can connect to the end point with a web browser. The URL is http://localhost:8421/sparql. Without parameters, this will show a human readable form. With parameters, this will execute SPARQL. We have shown how to load and query RDF with Virtuoso using the most basic SQL tools. Next you can access RDF from, for example, PHP, using the PHP ODBC interface. To see how to use Jena or Sesame with Virtuoso, look at Native RDF Storage Providers. To see how RDF data types are supported, see Extension datatype for RDF To work with large volumes of data, you must add memory to the configuration file and use the row-autocommit mode, i.e., do log_enableÂ (2); before the load command. Otherwise Virtuoso will do the entire load as a single transaction, and will run out of rollback space. See documentation for more.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[
<p>It is a long standing promise of mine to dispel the false impression that using <a href="http://virtuoso.openlinksw.com/" id="link-id113506d0">Virtuoso</a> to work with <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id115d9528">RDF</a> is complicated.</p>

<p>The purpose of this presentation is to show a programmer how to put RDF into Virtuoso and how to query it.  This is done programmatically, with no confusing user interfaces.</p>

<p>You should have a Virtuoso Open Source tree built and installed.  We will look at the LUBM benchmark demo that comes with the package.  All you need is a Unix shell.  Running the shell under emacs (<code>m-x shell</code>) is the best.  But the open source <code>isql</code> utility should have command line editing also.  The emacs shell is however convenient for cutting and pasting things between shell and files.</p>

<p>To get started, cd into <code>binsrc/tests/lubm</code>.</p>

<p>To verify that this works, you can do </p>

<blockquote>
<pre>./test_server.sh virtuoso-t</pre></blockquote>

<p>This will test the server with the LUBM queries.  This should report 45 tests passed.  After this we will do the tests step-by-step.</p>

<h2>Loading the <a href="http://dbpedia.org/resource/Data" id="link-id10f7bd90">Data</a>
</h2> 

<p>The file <code>lubm-load.sql</code> contains the commands for loading the LUBM single university qualification database.</p>

<p>The data files themselves are in <code>lubm_8000</code>, 15 files in RDFXML.</p>

<p>There is also a little ontology called <code>inf.nt</code>.  This declares the subclass and subproperty relations used in the benchmark.</p>

<p>So now let&#39;s go through this procedure.</p>

<p>Start the server:</p>

<blockquote>
<pre>$ virtuoso-t -f &amp;
</pre></blockquote>

<p>This starts the server in foreground mode, and puts it in the background of the shell.</p>

<p>Now we connect to it with the isql utility.</p>

<blockquote>
<pre>$ isql 1111 dba dba 
</pre></blockquote>

<p>This gives a <code>SQL&gt;</code> prompt.  The default username and password are both <code>dba</code>.</p>

<p>When a command is <a href="http://dbpedia.org/resource/SQL" id="link-id1176ce70">SQL</a>, it is entered directly.  If it is <a href="http://dbpedia.org/resource/SPARQL" id="link-id156df468">SPARQL</a>, it is prefixed with the keyword <code>sparql</code>.  This is how all the SQL clients work.  Any SQL client, such as any <a href="http://dbpedia.org/resource/Open_Database_Connectivity" id="link-id152d0a00">ODBC</a> or <a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id157ad6a0">JDBC</a> application, can use SPARQL if the SQL string starts with this keyword.</p>

<p>The <code>lubm-load.sql</code> file is quite self-explanatory. It begins with defining an SQL procedure that calls the RDF/XML load function, <code>DB..RDF_LOAD_RDFXML</code>, for each file in a directory.</p>

<p>Next it calls this function for the <code>lubm_8000</code> directory under the server&#39;s working directory.</p>

<blockquote>
<pre>sparql 
   CLEAR GRAPH &lt;lubm&gt;;

sparql 
   CLEAR GRAPH &lt;inf&gt;;

load_lubm ( server_root() || &#39;/lubm_8000/&#39; );
</pre></blockquote>

<p>Then it verifies that the right number of triples is found in the &lt;lubm&gt; graph.</p>

<blockquote>
<pre>sparql 
   SELECT COUNT(*) 
     FROM &lt;lubm&gt; 
    WHERE { ?x ?y ?z } ;
</pre></blockquote>

<p>The echo commands below this are interpreted by the isql utility, and produce output to show whether the test was passed.  They can be ignored for now.</p>

<p>Then it adds some implied <code>subOrganizationOf</code> triples.  This is part of setting up the LUBM test database.</p>

<blockquote>
<pre>sparql 
   PREFIX  ub:  &lt;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#&gt;
   INSERT 
      INTO GRAPH &lt;lubm&gt; 
      { ?x  ub:subOrganizationOf  ?z } 
   FROM &lt;lubm&gt; 
   WHERE { ?x  ub:subOrganizationOf  ?y  . 
           ?y  ub:subOrganizationOf  ?z  . 
         };
</pre></blockquote>

<p>Then it loads the ontology file, <code>inf.nt</code>, using the Turtle load function, <code>DB.DBA.TTLP</code>.  The arguments of the function are the text to load, the default namespace prefix, and the <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id15835550">URI</a> of the target graph.</p>

<blockquote>
<pre>DB.DBA.TTLP ( file_to_string ( &#39;inf.nt&#39; ), 
              &#39;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl&#39;, 
              &#39;inf&#39; 
            ) ;
sparql 
   SELECT COUNT(*) 
     FROM &lt;inf&gt; 
    WHERE { ?x ?y ?z } ;
</pre></blockquote>

<p>Then we declare that the triples in the <code>&lt;inf&gt;</code> graph can be used for inference at run time.  To enable this, a SPARQL query will declare that it uses the <code>&#39;inft&#39;</code> rule set.  Otherwise this has no effect.</p>

<blockquote>
<pre>rdfs_rule_set (&#39;inft&#39;, &#39;inf&#39;);
</pre></blockquote>

<p>This is just a log checkpoint to finalize the work and truncate the transaction log.  The server would also eventually do this in its own time.</p>

<blockquote>
<pre>checkpoint;
</pre></blockquote>

<p>Now we are ready for querying.</p>

<h2>Querying the Data</h2> 

<p>The queries are given in 3 different versions: The first file, <code>lubm.sql</code>, has the queries with most inference open coded as <code>UNIONs</code>. The second file, <code>lubm-inf.sql</code>, has the inference performed at run time using the ontology <a href="http://dbpedia.org/resource/Information" id="link-id1109faf0">information</a> in the <code>&lt;inf&gt;</code> graph we just loaded.  The last, <code>lubm-phys.sql</code>, relies on having the entailed triples physically present in the <code>&lt;lubm&gt;</code> graph.  These entailed triples are inserted by the SPARUL commands in the <code>lubm-cp.sql</code> file.</p>

<p>If you wish to run all the commands in a SQL file, you can type <code>load &lt;filename&gt;;</code> (e.g., <code>load lubm-cp.sql;</code>) at the <code>SQL&gt;</code> prompt. If you wish to try individual statements, you can paste them to the command line.</p>

<p>For example: </p>

<blockquote>
<pre>SQL&gt; sparql 
   PREFIX ub: &lt;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#&gt;
   SELECT * 
     FROM &lt;lubm&gt;
    WHERE { ?x  a                     ub:Publication                                                . 
            ?x  ub:publicationAuthor  &lt;http://www.Department0.University0.edu/AssistantProfessor0&gt; 
          };

VARCHAR
_______________________________________________________________________

http://www.Department0.University0.edu/AssistantProfessor0/Publication0
http://www.Department0.University0.edu/AssistantProfessor0/Publication1
http://www.Department0.University0.edu/AssistantProfessor0/Publication2
http://www.Department0.University0.edu/AssistantProfessor0/Publication3
http://www.Department0.University0.edu/AssistantProfessor0/Publication4
http://www.Department0.University0.edu/AssistantProfessor0/Publication5

6 Rows. -- 4 msec.
</pre></blockquote>


<p>To stop the server, simply type <code>shutdown;</code> at the <code>SQL&gt;</code> prompt.</p>

<p>If you wish to use a <a href="http://www.w3.org/TR/rdf-sparql-protocol/" id="link-id11384668">SPARQL protocol</a> end point, just enable the HTTP listener.  This is done by adding a stanza like â</p>

<blockquote>
<pre>[HTTPServer]
ServerPort    = 8421
ServerRoot    = .
ServerThreads = 2
</pre></blockquote>

<p>â to the end of the <code>virtuoso.ini</code> file in the <code>lubm</code> directory.  Then shutdown and restart (type <code>shutdown;</code> at the <code>SQL&gt;</code> prompt and then <code>virtuoso-t -f &amp;</code> at the shell prompt).</p>

<p>Now you can connect to the end point with a web browser.  The <a href="http://dbpedia.org/resource/Uniform_Resource_Locator" id="link-id113d02d8">URL</a> is <code>http://localhost:8421/sparql</code>. Without parameters, this will show a human readable form.  With parameters, this will execute SPARQL.</p>

<p>We have shown how to load and query RDF with Virtuoso using the most basic SQL tools. Next you can access RDF from, for example, <a href="http://dbpedia.org/resource/PHP" id="link-id142d0ba0">PHP</a>, using the PHP ODBC interface.</p>

<p>To see how to use <a href="http://jena.sourceforge.net/" id="link-id117074f0">Jena</a> or <a href="http://sourceforge.net/projects/sesame/" id="link-id1103c9b0">Sesame</a> with Virtuoso, look at <a href="http://docs.openlinksw.com/virtuoso/rdfnativestorageproviders.html" id="link-id15488ce8">Native RDF Storage Providers</a>. To see how RDF data types are supported, see <a href="http://docs.openlinksw.com/virtuoso/VirtuosoDriverJDBC.html#jdbcrdf" id="link-id15784a40">Extension datatype for RDF</a>
</p>

<p>To work with large volumes of data, you must add memory to the configuration file and use the row-autocommit mode, i.e., do <code>log_enableÂ (2);</code> before the load command. Otherwise Virtuoso will do the entire load as a single transaction, and will run out of rollback space.  See <a href="http://docs.openlinksw.com/virtuoso/" id="link-id111410f0">documentation</a> for more.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-12-17#1504">
  <rss:title>Virtuoso RDF:  A Getting Started Guide for the Developer</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-12-17T12:31:34Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">It is a long standing promise of mine to dispel the false impression that using Virtuoso to work with RDF is complicated. The purpose of this presentation is to show a programmer how to put RDF into Virtuoso and how to query it. This is done programmatically, with no confusing user interfaces. You should have a Virtuoso Open Source tree built and installed. We will look at the LUBM benchmark demo that comes with the package. All you need is a Unix shell. Running the shell under emacs (m-x shell) is the best. But the open source isql utility should have command line editing also. The emacs shell is however convenient for cutting and pasting things between shell and files. To get started, cd into binsrc/tests/lubm. To verify that this works, you can do ./test_server.sh virtuoso-t This will test the server with the LUBM queries. This should report 45 tests passed. After this we will do the tests step-by-step. Loading the Data The file lubm-load.sql contains the commands for loading the LUBM single university qualification database. The data files themselves are in lubm_8000, 15 files in RDFXML. There is also a little ontology called inf.nt. This declares the subclass and subproperty relations used in the benchmark. So now let&#39;s go through this procedure. Start the server: $ virtuoso-t -f &amp; This starts the server in foreground mode, and puts it in the background of the shell. Now we connect to it with the isql utility. $ isql 1111 dba dba This gives a SQL&gt; prompt. The default username and password are both dba. When a command is SQL, it is entered directly. If it is SPARQL, it is prefixed with the keyword sparql. This is how all the SQL clients work. Any SQL client, such as any ODBC or JDBC application, can use SPARQL if the SQL string starts with this keyword. The lubm-load.sql file is quite self-explanatory. It begins with defining an SQL procedure that calls the RDF/XML load function, DB..RDF_LOAD_RDFXML, for each file in a directory. Next it calls this function for the lubm_8000 directory under the server&#39;s working directory. sparql CLEAR GRAPH &lt;lubm&gt;; sparql CLEAR GRAPH &lt;inf&gt;; load_lubm ( server_root() || &#39;/lubm_8000/&#39; ); Then it verifies that the right number of triples is found in the &lt;lubm&gt; graph. sparql SELECT COUNT(*) FROM &lt;lubm&gt; WHERE { ?x ?y ?z } ; The echo commands below this are interpreted by the isql utility, and produce output to show whether the test was passed. They can be ignored for now. Then it adds some implied subOrganizationOf triples. This is part of setting up the LUBM test database. sparql PREFIX ub: &lt;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#&gt; INSERT INTO GRAPH &lt;lubm&gt; { ?x ub:subOrganizationOf ?z } FROM &lt;lubm&gt; WHERE { ?x ub:subOrganizationOf ?y . ?y ub:subOrganizationOf ?z . }; Then it loads the ontology file, inf.nt, using the Turtle load function, DB.DBA.TTLP. The arguments of the function are the text to load, the default namespace prefix, and the URI of the target graph. DB.DBA.TTLP ( file_to_string ( &#39;inf.nt&#39; ), &#39;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl&#39;, &#39;inf&#39; ) ; sparql SELECT COUNT(*) FROM &lt;inf&gt; WHERE { ?x ?y ?z } ; Then we declare that the triples in the &lt;inf&gt; graph can be used for inference at run time. To enable this, a SPARQL query will declare that it uses the &#39;inft&#39; rule set. Otherwise this has no effect. rdfs_rule_set (&#39;inft&#39;, &#39;inf&#39;); This is just a log checkpoint to finalize the work and truncate the transaction log. The server would also eventually do this in its own time. checkpoint; Now we are ready for querying. Querying the Data The queries are given in 3 different versions: The first file, lubm.sql, has the queries with most inference open coded as UNIONs. The second file, lubm-inf.sql, has the inference performed at run time using the ontology information in the &lt;inf&gt; graph we just loaded. The last, lubm-phys.sql, relies on having the entailed triples physically present in the &lt;lubm&gt; graph. These entailed triples are inserted by the SPARUL commands in the lubm-cp.sql file. If you wish to run all the commands in a SQL file, you can type load &lt;filename&gt;; (e.g., load lubm-cp.sql;) at the SQL&gt; prompt. If you wish to try individual statements, you can paste them to the command line. For example: SQL&gt; sparql PREFIX ub: &lt;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#&gt; SELECT * FROM &lt;lubm&gt; WHERE { ?x a ub:Publication . ?x ub:publicationAuthor &lt;http://www.Department0.University0.edu/AssistantProfessor0&gt; }; VARCHAR _______________________________________________________________________ http://www.Department0.University0.edu/AssistantProfessor0/Publication0 http://www.Department0.University0.edu/AssistantProfessor0/Publication1 http://www.Department0.University0.edu/AssistantProfessor0/Publication2 http://www.Department0.University0.edu/AssistantProfessor0/Publication3 http://www.Department0.University0.edu/AssistantProfessor0/Publication4 http://www.Department0.University0.edu/AssistantProfessor0/Publication5 6 Rows. -- 4 msec. To stop the server, simply type shutdown; at the SQL&gt; prompt. If you wish to use a SPARQL protocol end point, just enable the HTTP listener. This is done by adding a stanza like â [HTTPServer] ServerPort = 8421 ServerRoot = . ServerThreads = 2 â to the end of the virtuoso.ini file in the lubm directory. Then shutdown and restart (type shutdown; at the SQL&gt; prompt and then virtuoso-t -f &amp; at the shell prompt). Now you can connect to the end point with a web browser. The URL is http://localhost:8421/sparql. Without parameters, this will show a human readable form. With parameters, this will execute SPARQL. We have shown how to load and query RDF with Virtuoso using the most basic SQL tools. Next you can access RDF from, for example, PHP, using the PHP ODBC interface. To see how to use Jena or Sesame with Virtuoso, look at Native RDF Storage Providers. To see how RDF data types are supported, see Extension datatype for RDF To work with large volumes of data, you must add memory to the configuration file and use the row-autocommit mode, i.e., do log_enableÂ (2); before the load command. Otherwise Virtuoso will do the entire load as a single transaction, and will run out of rollback space. See documentation for more.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[
<p>It is a long standing promise of mine to dispel the false impression that using <a href="http://virtuoso.openlinksw.com/" id="link-id113506d0">Virtuoso</a> to work with <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id115d9528">RDF</a> is complicated.</p>

<p>The purpose of this presentation is to show a programmer how to put RDF into Virtuoso and how to query it.  This is done programmatically, with no confusing user interfaces.</p>

<p>You should have a Virtuoso Open Source tree built and installed.  We will look at the LUBM benchmark demo that comes with the package.  All you need is a Unix shell.  Running the shell under emacs (<code>m-x shell</code>) is the best.  But the open source <code>isql</code> utility should have command line editing also.  The emacs shell is however convenient for cutting and pasting things between shell and files.</p>

<p>To get started, cd into <code>binsrc/tests/lubm</code>.</p>

<p>To verify that this works, you can do </p>

<blockquote>
<pre>./test_server.sh virtuoso-t</pre></blockquote>

<p>This will test the server with the LUBM queries.  This should report 45 tests passed.  After this we will do the tests step-by-step.</p>

<h2>Loading the <a href="http://dbpedia.org/resource/Data" id="link-id10f7bd90">Data</a>
</h2> 

<p>The file <code>lubm-load.sql</code> contains the commands for loading the LUBM single university qualification database.</p>

<p>The data files themselves are in <code>lubm_8000</code>, 15 files in RDFXML.</p>

<p>There is also a little ontology called <code>inf.nt</code>.  This declares the subclass and subproperty relations used in the benchmark.</p>

<p>So now let&#39;s go through this procedure.</p>

<p>Start the server:</p>

<blockquote>
<pre>$ virtuoso-t -f &amp;
</pre></blockquote>

<p>This starts the server in foreground mode, and puts it in the background of the shell.</p>

<p>Now we connect to it with the isql utility.</p>

<blockquote>
<pre>$ isql 1111 dba dba 
</pre></blockquote>

<p>This gives a <code>SQL&gt;</code> prompt.  The default username and password are both <code>dba</code>.</p>

<p>When a command is <a href="http://dbpedia.org/resource/SQL" id="link-id1176ce70">SQL</a>, it is entered directly.  If it is <a href="http://dbpedia.org/resource/SPARQL" id="link-id156df468">SPARQL</a>, it is prefixed with the keyword <code>sparql</code>.  This is how all the SQL clients work.  Any SQL client, such as any <a href="http://dbpedia.org/resource/Open_Database_Connectivity" id="link-id152d0a00">ODBC</a> or <a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id157ad6a0">JDBC</a> application, can use SPARQL if the SQL string starts with this keyword.</p>

<p>The <code>lubm-load.sql</code> file is quite self-explanatory. It begins with defining an SQL procedure that calls the RDF/XML load function, <code>DB..RDF_LOAD_RDFXML</code>, for each file in a directory.</p>

<p>Next it calls this function for the <code>lubm_8000</code> directory under the server&#39;s working directory.</p>

<blockquote>
<pre>sparql 
   CLEAR GRAPH &lt;lubm&gt;;

sparql 
   CLEAR GRAPH &lt;inf&gt;;

load_lubm ( server_root() || &#39;/lubm_8000/&#39; );
</pre></blockquote>

<p>Then it verifies that the right number of triples is found in the &lt;lubm&gt; graph.</p>

<blockquote>
<pre>sparql 
   SELECT COUNT(*) 
     FROM &lt;lubm&gt; 
    WHERE { ?x ?y ?z } ;
</pre></blockquote>

<p>The echo commands below this are interpreted by the isql utility, and produce output to show whether the test was passed.  They can be ignored for now.</p>

<p>Then it adds some implied <code>subOrganizationOf</code> triples.  This is part of setting up the LUBM test database.</p>

<blockquote>
<pre>sparql 
   PREFIX  ub:  &lt;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#&gt;
   INSERT 
      INTO GRAPH &lt;lubm&gt; 
      { ?x  ub:subOrganizationOf  ?z } 
   FROM &lt;lubm&gt; 
   WHERE { ?x  ub:subOrganizationOf  ?y  . 
           ?y  ub:subOrganizationOf  ?z  . 
         };
</pre></blockquote>

<p>Then it loads the ontology file, <code>inf.nt</code>, using the Turtle load function, <code>DB.DBA.TTLP</code>.  The arguments of the function are the text to load, the default namespace prefix, and the <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id15835550">URI</a> of the target graph.</p>

<blockquote>
<pre>DB.DBA.TTLP ( file_to_string ( &#39;inf.nt&#39; ), 
              &#39;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl&#39;, 
              &#39;inf&#39; 
            ) ;
sparql 
   SELECT COUNT(*) 
     FROM &lt;inf&gt; 
    WHERE { ?x ?y ?z } ;
</pre></blockquote>

<p>Then we declare that the triples in the <code>&lt;inf&gt;</code> graph can be used for inference at run time.  To enable this, a SPARQL query will declare that it uses the <code>&#39;inft&#39;</code> rule set.  Otherwise this has no effect.</p>

<blockquote>
<pre>rdfs_rule_set (&#39;inft&#39;, &#39;inf&#39;);
</pre></blockquote>

<p>This is just a log checkpoint to finalize the work and truncate the transaction log.  The server would also eventually do this in its own time.</p>

<blockquote>
<pre>checkpoint;
</pre></blockquote>

<p>Now we are ready for querying.</p>

<h2>Querying the Data</h2> 

<p>The queries are given in 3 different versions: The first file, <code>lubm.sql</code>, has the queries with most inference open coded as <code>UNIONs</code>. The second file, <code>lubm-inf.sql</code>, has the inference performed at run time using the ontology <a href="http://dbpedia.org/resource/Information" id="link-id1109faf0">information</a> in the <code>&lt;inf&gt;</code> graph we just loaded.  The last, <code>lubm-phys.sql</code>, relies on having the entailed triples physically present in the <code>&lt;lubm&gt;</code> graph.  These entailed triples are inserted by the SPARUL commands in the <code>lubm-cp.sql</code> file.</p>

<p>If you wish to run all the commands in a SQL file, you can type <code>load &lt;filename&gt;;</code> (e.g., <code>load lubm-cp.sql;</code>) at the <code>SQL&gt;</code> prompt. If you wish to try individual statements, you can paste them to the command line.</p>

<p>For example: </p>

<blockquote>
<pre>SQL&gt; sparql 
   PREFIX ub: &lt;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#&gt;
   SELECT * 
     FROM &lt;lubm&gt;
    WHERE { ?x  a                     ub:Publication                                                . 
            ?x  ub:publicationAuthor  &lt;http://www.Department0.University0.edu/AssistantProfessor0&gt; 
          };

VARCHAR
_______________________________________________________________________

http://www.Department0.University0.edu/AssistantProfessor0/Publication0
http://www.Department0.University0.edu/AssistantProfessor0/Publication1
http://www.Department0.University0.edu/AssistantProfessor0/Publication2
http://www.Department0.University0.edu/AssistantProfessor0/Publication3
http://www.Department0.University0.edu/AssistantProfessor0/Publication4
http://www.Department0.University0.edu/AssistantProfessor0/Publication5

6 Rows. -- 4 msec.
</pre></blockquote>


<p>To stop the server, simply type <code>shutdown;</code> at the <code>SQL&gt;</code> prompt.</p>

<p>If you wish to use a <a href="http://www.w3.org/TR/rdf-sparql-protocol/" id="link-id11384668">SPARQL protocol</a> end point, just enable the HTTP listener.  This is done by adding a stanza like â</p>

<blockquote>
<pre>[HTTPServer]
ServerPort    = 8421
ServerRoot    = .
ServerThreads = 2
</pre></blockquote>

<p>â to the end of the <code>virtuoso.ini</code> file in the <code>lubm</code> directory.  Then shutdown and restart (type <code>shutdown;</code> at the <code>SQL&gt;</code> prompt and then <code>virtuoso-t -f &amp;</code> at the shell prompt).</p>

<p>Now you can connect to the end point with a web browser.  The <a href="http://dbpedia.org/resource/Uniform_Resource_Locator" id="link-id113d02d8">URL</a> is <code>http://localhost:8421/sparql</code>. Without parameters, this will show a human readable form.  With parameters, this will execute SPARQL.</p>

<p>We have shown how to load and query RDF with Virtuoso using the most basic SQL tools. Next you can access RDF from, for example, <a href="http://dbpedia.org/resource/PHP" id="link-id142d0ba0">PHP</a>, using the PHP ODBC interface.</p>

<p>To see how to use <a href="http://jena.sourceforge.net/" id="link-id117074f0">Jena</a> or <a href="http://sourceforge.net/projects/sesame/" id="link-id1103c9b0">Sesame</a> with Virtuoso, look at <a href="http://docs.openlinksw.com/virtuoso/rdfnativestorageproviders.html" id="link-id15488ce8">Native RDF Storage Providers</a>. To see how RDF data types are supported, see <a href="http://docs.openlinksw.com/virtuoso/VirtuosoDriverJDBC.html#jdbcrdf" id="link-id15784a40">Extension datatype for RDF</a>
</p>

<p>To work with large volumes of data, you must add memory to the configuration file and use the row-autocommit mode, i.e., do <code>log_enableÂ (2);</code> before the load command. Otherwise Virtuoso will do the entire load as a single transaction, and will run out of rollback space.  See <a href="http://docs.openlinksw.com/virtuoso/" id="link-id111410f0">documentation</a> for more.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-12-16#1499">
  <rss:title>&quot;E Pluribus Unum&quot;, or &quot;Inversely Functional Identity&quot;, or &quot;Smooshing Without the Stickiness&quot; (re-updated)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-12-16T14:14:43Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">What a terrible word, smooshing... I have understood it to mean that when you have two names for one thing, you give each all the attributes of the other. This smooshes them together, makes them interchangeable. This is complex, so I will begin with the point and the interested may read on for the details and implications. Starting with soon to be released version 6, Virtuoso allows you to say that two things, if they share a uniquely identifying property, are the same. Examples of uniquely identifying properties would be a book&#39;s ISBN number, or a person&#39;s social security plus full name. In relational language this is a unique key, and in RDF parlance, an inverse functional property. In most systems, such problems are dealt with as a preprocessing step before querying. For example, all the items that are considered the same will get the same properties or at load time all identifiers will be normalized according to some application rules. This is good if the rules are clear and understood. This is so in closed situations, where things tend to have standard identifiers to begin with. But on the open web this is not so clear cut. In this post, we show how to do these things ad hoc, without materializing anything. At the end, we also show how to materialize identity and what the consequences of this are with open web data. We use real live web crawls from the Billion Triples Challenge data set. On the linked data web, there are independently arising descriptions of the same thing and thus arises the need to smoosh, if these are to be somehow integrated. But this is only the beginning of the problems. To address these, we have added the option of specifying that some property will be considered inversely functional in a query. This is done at run time and the property does not really have to be inversely functional in the pure sense. foaf:name will do for an example. This simply means that for purposes of the query concerned, two subjects which have at least one foaf:name in common are considered the same. In this way, we can join between FOAF files. With the same database, a query about music preferences might consider having the same name as &quot;same enough,&quot; but a query about criminal prosecution would obviously need to be more precise about sameness. Our ontology is defined like this: -- Populate a named graph with the triples you want to use in query time inferencing ttlp ( &#39; @prefix foaf: &lt;xmlns=&quot;http&quot; xmlns.com=&quot;xmlns.com&quot; foaf=&quot;foaf&quot;&gt; &lt;/&gt; @prefix owl: &lt;xmlns=&quot;http&quot; www.w3.org=&quot;www.w3.org&quot; owl=&quot;owl&quot;&gt; &lt;/&gt; foaf:mbox_sha1sum a owl:InverseFunctionalProperty . foaf:name a owl:InverseFunctionalProperty . &#39;, &#39;xx&#39;, &#39;b3sifp&#39; ); -- Declare that the graph contains an ontology for use in query time inferencing rdfs_rule_set ( &#39;http://example.com/rules/b3sifp#&#39;, &#39;b3sifp&#39; ); Then use it: sparql DEFINE input:inference &quot;http://example.com/rules/b3sifp#&quot; SELECT DISTINCT ?k ?f1 ?f2 WHERE { ?k foaf:name ?n . ?n bif:contains &quot;&#39;Kjetil Kjernsmo&#39;&quot; . ?k foaf:knows ?f1 . ?f1 foaf:knows ?f2 }; VARCHAR VARCHAR VARCHAR ______________________________________ _______________________________________________ ______________________________ http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/dajobe http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/net_twitter http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/amyvdh http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/pom http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/mattb http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/davorg http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/distobj http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/perigrin .... Without the inference, we get no matches. This is because the data in question has one graph per FOAF file, and blank nodes for persons. No graph references any person outside the ones in the graph. So if somebody is mentioned as known, then without the inference there is no way to get to what that person&#39;s FOAF file says, since the same individual will be a different blank node there. The declaration in the context named b3sifp just means that all things with a matching foaf:name or foaf:mbox_sha1sum are the same. Sameness means that two are the same for purposes of DISTINCT or GROUP BY, and if two are the same, then both have the UNION of all of the properties of both. If this were a naive smoosh, then the individuals would have all the same properties but would not be the same for DISTINCT. If we have complex application rules for determining whether individuals are the same, then one can materialize owl:sameAs triples and keep them in a separate graph. In this way, the original data is not contaminated and the materialized volume stays reasonable â nothing like the blow-up of duplicating properties across instances. The pro-smoosh argument is that if every duplicate makes exactly the same statements, then there is no great blow-up. Best and worst cases will always depend on the data. In rough terms, the more ad hoc the use, the less desirable the materialization. If the usage pattern is really set, then a relational-style application-specific representation with identity resolved at load time will perform best. We can do that too, but so can others. The principal point is about agility as concerns the inference. Run time is more agile than materialization, and if the rules change or if different users have different needs, then materialization runs into trouble. When talking web scale, having multiple users is a given; it is very uneconomical to give everybody their own copy, and the likelihood of a user accessing any significant part of the corpus is minimal. Even if the queries were not limited, the user would typically not wait for the answer of a query doing a scan or aggregation over 1 billion blog posts or something of the sort. So queries will typically be selective. Selective means that they do not access all of the data, hence do not benefit from ready-made materialization for things they do not even look at. The exception is corpus-wide statistics queries. But these will not be done in interactive time anyway, and will not be done very often. Plus, since these do not typically run all in memory, these are disk bound. And when things are disk bound, size matters. Reading extra entailment on the way is just a performance penalty. Enough talk. Time for an experiment. We take the Yahoo and Falcon web crawls from the Billion Triples Challenge set, and do two things with the FOAF data in them: Resolve identity at insert time. We remove duplicate person URIs, and give the single URI all the properties of all the duplicate URIs. We expect these to be most often repeats. If a person references another person, we normalize this reference to go to the single URI of the referenced person. Give every duplicate URI of a person all the properties of all the duplicates. If these are the same value, the data should not get much bigger, or so we think. For the experiment, we will consider two people the same if they have the same foaf:name and are both instances of foaf:Person. This gets some extra hits but should not be statistically significant. The following is a commented SQL script performing the smoosh. We play with internal IDs of things, thus some of these operations cannot be done in SPARQL alone. We use SPARQL where possible for readability. As the documentation states, iri_to_id converts from the qualified name of an IRI to its ID and id_to_iri does the reverse. We count the triples that enter into the smoosh: -- the name is an existence because else we&#39;d get several times more due to -- the names occurring in many graphs sparql SELECT COUNT(*) WHERE { { SELECT DISTINCT ?person WHERE { ?person a foaf:Person } } . FILTER ( bif:exists ( SELECT (1) WHERE { ?person foaf:name ?nn } ) ) . ?person ?p ?o }; -- We get 3284674 We make a few tables for intermediate results. -- For each distinct name, gather the properties and objects from -- all subjects with this name CREATE TABLE name_prop ( np_name ANY, np_p IRI_ID_8, np_o ANY, PRIMARY KEY ( np_name, np_p, np_o ) ); ALTER INDEX name_prop ON name_prop PARTITION ( np_name VARCHAR (-1, 0hexffff) ); -- Map from name to canonical IRI used for the name CREATE TABLE name_iri ( ni_name ANY PRIMARY KEY, ni_s IRI_ID_8 ); ALTER INDEX name_iri ON name_iri PARTITION ( ni_name VARCHAR (-1, 0hexffff) ); -- Map from person IRI to canonical person IRI CREATE TABLE pref_iri ( i IRI_ID_8, pref IRI_ID_8, PRIMARY KEY ( i ) ); ALTER INDEX pref_iri ON pref_iri PARTITION ( i INT (0hexffff00) ); -- a table for the materialization where all aliases get all properties of every other CREATE TABLE smoosh_ct ( s IRI_ID_8, p IRI_ID_8, o ANY, PRIMARY KEY ( s, p, o ) ); ALTER INDEX smoosh_ct ON smoosh_ct PARTITION ( s INT (0hexffff00) ); -- disable transaction log and enable row auto-commit. This is necessary, otherwise -- bulk operations are done transactionally and they will run out of rollback space. LOG_ENABLE (2); -- Gather all the properties of all persons with a name under that name. -- INSERT SOFT means that duplicates are ignored INSERT SOFT name_prop SELECT &quot;n&quot;, &quot;p&quot;, &quot;o&quot; FROM ( sparql DEFINE output:valmode &quot;LONG&quot; SELECT ?n ?p ?o WHERE { ?x a foaf:Person . ?x foaf:name ?n . ?x ?p ?o } ) xx ; -- Now choose for each name the canonical IRI INSERT INTO name_iri SELECT np_name, ( SELECT MIN (s) FROM rdf_quad WHERE o = np_name AND p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ) AS mini FROM name_prop WHERE np_p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ; -- For each person IRI, map to the canonical IRI of that person INSERT SOFT pref_iri (i, pref) SELECT s, ni_s FROM name_iri, rdf_quad WHERE o = ni_name AND p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ; -- Make a graph where all persons have one iri with all the properties of all aliases -- and where person-to-person refs are canonicalized INSERT SOFT rdf_quad (g,s,p,o) SELECT IRI_TO_ID (&#39;psmoosh&#39;), ni_s, np_p, COALESCE ( ( SELECT pref FROM pref_iri WHERE i = np_o ), np_o ) FROM name_prop, name_iri WHERE ni_name = np_name OPTION ( loop, quietcast ) ; -- A little explanation: The properties of names are copied into rdf_quad with the name -- replaced with its canonical IRI. If the object has a canonical IRI, this is used as -- the object, else the object is unmodified. This is the COALESCE with the sub-query. -- This takes a little time. To check on the progress, take another connection to the -- server and do STATUS (&#39;cluster&#39;); -- It will return something like -- Cluster 4 nodes, 35 s. 108 m/s 1001 KB/s 75% cpu 186% read 12% clw threads 5r 0w 0i -- buffers 549481 253929 d 8 w 0 pfs -- Now finalize the state; this makes it permanent. Else the work will be lost on server -- failure, since there was no transaction log CL_EXEC (&#39;checkpoint&#39;); -- See what we got sparql SELECT COUNT (*) FROM &lt;psmoosh&gt; WHERE {?s ?p ?o}; -- This is 2253102 -- Now make the copy where all have the properties of all synonyms. This takes so much -- space we do not insert it as RDF quads, but make a special table for it so that we can -- run some statistics. This saves time. INSERT SOFT smoosh_ct (s, p, o) SELECT s, np_p, np_o FROM name_prop, rdf_quad WHERE o = np_name AND p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ; -- as above, INSERT SOFT so as to ignore duplicates SELECT COUNT (*) FROM smoosh_ct; -- This is 167360324 -- Find out where the bloat comes from SELECT TOP 20 COUNT (*), ID_TO_IRI (p) FROM smoosh_ct GROUP BY p ORDER BY 1 DESC; The results are: 54728777 http://www.w3.org/2002/07/owl#sameAs 48543153 http://xmlns.com/foaf/0.1/knows 13930234 http://www.w3.org/2000/01/rdf-schema#seeAlso 12268512 http://xmlns.com/foaf/0.1/interest 11415867 http://xmlns.com/foaf/0.1/nick 6683963 http://xmlns.com/foaf/0.1/weblog 6650093 http://xmlns.com/foaf/0.1/depiction 4231946 http://xmlns.com/foaf/0.1/mbox_sha1sum 4129629 http://xmlns.com/foaf/0.1/homepage 1776555 http://xmlns.com/foaf/0.1/holdsAccount 1219525 http://xmlns.com/foaf/0.1/based_near 305522 http://www.w3.org/1999/02/22-rdf-syntax-ns#type 274965 http://xmlns.com/foaf/0.1/name 155131 http://xmlns.com/foaf/0.1/dateOfBirth 153001 http://xmlns.com/foaf/0.1/img 111130 http://www.w3.org/2001/vcard-rdf/3.0#ADR 52930 http://xmlns.com/foaf/0.1/gender 48517 http://www.w3.org/2004/02/skos/core#subject 45697 http://www.w3.org/2000/01/rdf-schema#label 44860 http://purl.org/vocab/bio/0.1/olb Now compare with the predicate distribution of the smoosh with identities canonicalized sparql SELECT COUNT (*) ?p FROM &lt;psmoosh&gt; WHERE { ?s ?p ?o } GROUP BY ?p ORDER BY 1 DESC LIMIT 20; Results are: 748311 http://xmlns.com/foaf/0.1/knows 548391 http://xmlns.com/foaf/0.1/interest 140531 http://www.w3.org/2000/01/rdf-schema#seeAlso 105273 http://www.w3.org/1999/02/22-rdf-syntax-ns#type 78497 http://xmlns.com/foaf/0.1/name 48099 http://www.w3.org/2004/02/skos/core#subject 45179 http://xmlns.com/foaf/0.1/depiction 40229 http://www.w3.org/2000/01/rdf-schema#comment 38272 http://www.w3.org/2000/01/rdf-schema#label 37378 http://xmlns.com/foaf/0.1/nick 37186 http://dbpedia.org/property/abstract 34003 http://xmlns.com/foaf/0.1/img 26182 http://xmlns.com/foaf/0.1/homepage 23795 http://www.w3.org/2002/07/owl#sameAs 17651 http://xmlns.com/foaf/0.1/mbox_sha1sum 17430 http://xmlns.com/foaf/0.1/dateOfBirth 15586 http://xmlns.com/foaf/0.1/page 12869 http://dbpedia.org/property/reference 12497 http://xmlns.com/foaf/0.1/weblog 12329 http://blogs.yandex.ru/schema/foaf/school We can drop the owl:sameAs triples from the count, so the bloat is a bit less by that but it still is tens of times larger than the canonicalized copy or the initial state. Now, when we try using the psmoosh graph, we still get different results from the results with the original data. This is because foaf:knows relations to things with no foaf:name are not represented in the smoosh. The exist: sparql SELECT COUNT (*) WHERE { ?s foaf:knows ?thing . FILTER ( !bif:exists ( SELECT (1) WHERE { ?thing foaf:name ?nn } ) ) }; -- 1393940 So the smoosh graph is not an accurate rendition of the social network. It would have to be smooshed further to be that, since the data in the sample is quite irregular. But we do not go that far here. Finally, we calculate the smoosh blow up factors. We do not include owl:sameAs triples in the counts. select (167360324 - 54728777) / 3284674.0; 34.290022997716059 select 2229307 / 3284674.0; = 0.678699621332284 So, to get a smoosh that is not really the equivalent of the original, either multiply the original triple count by 34 or 0.68, depending on whether synonyms are collapsed or not. Making the smooshes does not take very long, some minutes for the small one. Inserting the big one would be longer, a couple of hours maybe. It was 33 minutes for filling the smoosh_ct table. The metrics were not with optimal tuning so the performance numbers just serve to show that smooshing takes time. Probably more time than allowable in an interactive situation, no matter how the process is optimized.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>What a terrible word, smooshing...  I have understood it to mean that when you have two names for one thing, you give each all the attributes of the other.  This smooshes them together, makes them interchangeable.</p>

<p>This is complex, so I will begin with the point and the interested may read on for the details and implications.  Starting with soon to be released version 6, <a href="http://virtuoso.openlinksw.com" id="link-id15718cb8">Virtuoso</a> allows you to say that two things, if they share a uniquely identifying property, are the same.  Examples of uniquely identifying properties would be a book&#39;s ISBN number, or a person&#39;s social security plus full name.  In relational language this is a <i>unique key</i>, and in <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id145ed998">RDF</a> parlance, an <i>inverse functional property</i>.</p>

<p>In most systems, such problems are dealt with as a preprocessing step before querying.  For example, all the items that are considered the same will get the same properties or at load time all identifiers will be normalized according to some application rules.  This is good if the rules are clear and understood.  This is so in closed situations, where things tend to have standard identifiers to begin with.  But on the open web this is not so clear cut.</p>

<p>In this post, we show how to do these things <i>ad hoc</i>, without materializing anything.  At the end, we also show how to materialize identity and what the consequences of this are with open web <a href="http://dbpedia.org/resource/Data" id="link-id11726358">data</a>.  We use real live web crawls from the <a href="http://challenge.semanticweb.org/" id="link-id14f40448">Billion Triples Challenge</a> data set.</p>

<p>On the <a href="http://dbpedia.org/resource/Linked_Data" id="link-id156e2b10">linked data</a> <a href="http://dbpedia.org/resource/Giant_Global_Graph" id="link-id1106ce08">web</a>, there are independently arising descriptions of the same thing and thus arises the need to smoosh, if these are to be somehow integrated.  But this is only the beginning of the problems.</p>

<p>To address these, we have added the option of specifying that some property will be considered inversely functional in a query.  This is done at run time and the property does not really have to be inversely functional in the pure sense.  <code>foaf:name</code> will do for an example.  This simply means that for purposes of the query concerned, two subjects which have at least one <code>foaf:name</code> in common are considered the same. In this way, we can join between FOAF files.  With the same database, a query about music preferences might consider having the same name as &quot;same enough,&quot; but a query about criminal prosecution would obviously need to be more precise about sameness.</p>

<p>Our ontology is defined like this:</p>

<blockquote>
<pre>-- Populate a named graph with the triples you want to use in query time inferencing<br />
ttlp ( &#39;
        @prefix foaf: &lt;xmlns=&quot;http&quot; xmlns.com=&quot;xmlns.com&quot; foaf=&quot;foaf&quot;&gt;
                      &lt;/&gt;
        @prefix owl:  &lt;xmlns=&quot;http&quot; www.w3.org=&quot;www.w3.org&quot; owl=&quot;owl&quot;&gt;
                      &lt;/&gt;
        foaf:mbox_sha1sum  a  owl:InverseFunctionalProperty  .
        foaf:name          a  owl:InverseFunctionalProperty  .
       &#39;,
       &#39;xx&#39;,
       &#39;b3sifp&#39;
     );<br />
-- Declare that the graph contains an ontology for use in query time inferencing <br />
rdfs_rule_set ( &#39;http://example.com/rules/b3sifp#&#39;,
                &#39;b3sifp&#39;
              );
</pre></blockquote>

<p>Then use it:</p>

<blockquote>
<pre>sparql 
   DEFINE input:inference &quot;http://example.com/rules/b3sifp#&quot; 
   SELECT DISTINCT ?k ?f1 ?f2 
   WHERE { ?k   foaf:name     ?n                   . 
           ?n   bif:contains  &quot;&#39;Kjetil Kjernsmo&#39;&quot;  . 
           ?k   foaf:knows    ?f1                  . 
           ?f1  foaf:knows    ?f2 
         };<br />
VARCHAR                                  VARCHAR                                           VARCHAR
______________________________________   _______________________________________________   ______________________________<br />
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/dajobe
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/net_twitter
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/amyvdh
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/pom
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/mattb
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/davorg
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/distobj
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/perigrin
....
</pre></blockquote>

<p>Without the inference, we get no matches.  This is because the data in question has one graph per FOAF file, and blank nodes for persons.  No graph references any person outside the ones in the graph.  So if somebody is mentioned as known, then without the inference there is no way to get to what that person&#39;s FOAF file says, since the same individual will be a different blank node there.  The declaration in the context named <code>b3sifp</code> just means that all things with a matching <code>foaf:name</code> or <code>foaf:mbox_sha1sum</code> are the same.</p>

<p>Sameness means that two are the same for purposes of <code>DISTINCT</code> or <code>GROUP BY</code>, and if two are the same, then both have the <code>UNION</code> of all of the properties of both.</p>

<p>If this were a naive smoosh, then the individuals would have all the same properties but would not be the same for <code>DISTINCT</code>.</p>

<p>If we have complex application rules for determining whether individuals are the same, then one can materialize <code>owl:sameAs</code> triples and keep them in a separate graph.  In this way, the original data is not contaminated and the materialized volume stays reasonable â nothing like the blow-up of duplicating properties across instances.</p>

<p>The pro-smoosh argument is that if every duplicate makes exactly the same statements, then there is no great blow-up.  Best and worst cases will always depend on the data.  In rough terms, the more <i>ad hoc</i> the use, the less desirable the materialization.  If the usage pattern is really set, then a relational-style application-specific representation with identity resolved at load time will perform best.  We can do that too, but so can others.</p>

<p>The principal point is about agility as concerns the inference.  Run time is more agile than materialization, and if the rules change or if different users have different needs, then materialization runs into trouble.  When talking web scale, having multiple users is a given; it is very uneconomical to give everybody their own copy, and the likelihood of a user accessing any significant part of the corpus is minimal.  Even if the queries were not limited, the user would typically not wait for the answer of a query doing a scan or aggregation over 1 billion <a href="http://dbpedia.org/resource/Blog" id="link-id1156a550">blog</a> posts or something of the sort.  So queries will typically be selective.  Selective means that they do not access all of the data, hence do not benefit from ready-made materialization for things they do not even look at. </p>

<p>The exception is corpus-wide statistics queries.  But these will not be done in interactive time anyway, and will not be done very often. Plus, since these do not typically run all in memory, these are disk bound.  And when things are disk bound, size matters.  Reading extra entailment on the way is just a performance penalty.</p>

<p>Enough talk. Time for an experiment.  We take the Yahoo and Falcon web crawls from the Billion Triples Challenge set, and do two things with the FOAF data in them:</p>

<ol>
<li>Resolve identity at insert time.  We remove duplicate person URIs, and give the single <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id11317008">URI</a> all the properties of all the duplicate URIs.  We expect these to be most often repeats.  If a person references another person, we normalize this reference to go to the single URI of the referenced person.</li>

<li>Give every duplicate URI of a person all the properties of all the duplicates.  If these are the same value, the data should not get much bigger, or so we think.</li>
</ol>

<p>For the experiment, we will consider two people the same if they have the same <code>foaf:name</code> and are both instances of <code>foaf:Person</code>.  This gets some extra hits but should not be statistically significant.</p>

<p>The following is a commented <a href="http://dbpedia.org/resource/SQL" id="link-id110945b0">SQL</a> script performing the smoosh.  We play with internal IDs of things, thus some of these operations cannot be done in SPARQL alone.  We use SPARQL where possible for readability.  As the documentation states, <code>iri_to_id</code> converts from the qualified name of an IRI to its ID and <code>id_to_iri</code> does the reverse.</p>

<p>We count the triples that enter into the smoosh:</p>

<blockquote>
<pre>-- the name is an existence because else we&#39;d get several times more due to 
-- the names occurring in many graphs <br />
sparql 
   SELECT COUNT(*) 
    WHERE { { SELECT DISTINCT ?person 
               WHERE { ?person a foaf:Person }
            } . 
            FILTER ( bif:exists ( SELECT (1) 
                                   WHERE { ?person foaf:name ?nn } 
                                )
                       ) . 
            ?person ?p ?o
          };<br />
-- We get 3284674
</pre></blockquote>

<p>We make a few tables for intermediate results.</p>

<blockquote>
<pre>-- For each distinct name, gather the properties and objects from 
-- all subjects with this name <br />
CREATE TABLE name_prop 
   ( np_name  ANY, 
     np_p     IRI_ID_8, 
     np_o     ANY, 
     PRIMARY KEY ( np_name, 
                   np_p, 
                   np_o
                 )
   );
ALTER INDEX name_prop 
   ON name_prop 
   PARTITION ( np_name VARCHAR (-1, 0hexffff) );<br />
-- Map from name to canonical IRI used for the name <br />
CREATE TABLE name_iri ( ni_name  ANY PRIMARY KEY, 
                        ni_s     IRI_ID_8
                      );
ALTER INDEX name_iri 
   ON name_iri 
   PARTITION ( ni_name VARCHAR (-1, 0hexffff) );<br />
-- Map from person IRI to canonical person IRI<br />
CREATE TABLE pref_iri 
   ( i     IRI_ID_8, 
     pref  IRI_ID_8, 
     PRIMARY KEY ( i )
   );
ALTER INDEX pref_iri 
   ON pref_iri 
   PARTITION ( i INT (0hexffff00) );<br />
-- a table for the materialization where all aliases get all properties of every other <br />
CREATE TABLE smoosh_ct 
   ( s  IRI_ID_8, 
     p  IRI_ID_8, 
     o  ANY, 
     PRIMARY KEY ( s, 
                   p, 
                   o
                 ) 
   );
ALTER INDEX smoosh_ct 
   ON smoosh_ct 
   PARTITION ( s INT (0hexffff00) );<br />
-- disable transaction log and enable row auto-commit.  This is necessary, otherwise 
-- bulk operations are done transactionally and they will run out of rollback space.<br />
LOG_ENABLE (2);<br />
-- Gather all the properties of all persons with a name under that name.  
-- INSERT SOFT means that duplicates are ignored <br />
INSERT SOFT name_prop 
   SELECT &quot;n&quot;, &quot;p&quot;, &quot;o&quot; 
   FROM ( sparql 
          DEFINE output:valmode &quot;LONG&quot; 
          SELECT ?n ?p ?o 
          WHERE { ?x a foaf:Person . 
                 ?x foaf:name ?n . 
                 ?x ?p ?o
               }
        ) xx ;<br />
-- Now choose for each name the canonical IRI <br />
INSERT INTO name_iri 
   SELECT np_name, 
          ( SELECT MIN (s) 
              FROM rdf_quad 
             WHERE o = np_name 
                   AND p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;)
          ) AS mini 
     FROM name_prop 
    WHERE np_p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ;<br />
-- For each person IRI, map to the canonical IRI of that person <br />
INSERT SOFT pref_iri (i, pref) 
   SELECT s, 
          ni_s 
     FROM name_iri, 
          rdf_quad 
    WHERE o = ni_name 
          AND p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ;<br />
-- Make a graph where all persons have one iri with all the properties of all aliases 
-- and where person-to-person refs are canonicalized<br />
INSERT SOFT rdf_quad (g,s,p,o) 
   SELECT IRI_TO_ID (&#39;psmoosh&#39;), 
          ni_s, 
          np_p, 
 COALESCE ( ( SELECT pref 
              FROM pref_iri 
              WHERE i = np_o
            ), 
            np_o 
          )
     FROM name_prop, 
          name_iri 
    WHERE ni_name = np_name 
   OPTION ( loop, quietcast ) ;<br />
-- A little explanation:  The properties of names are copied into rdf_quad with the name 
-- replaced with its canonical IRI.  If the object has a canonical IRI, this is used as 
-- the object, else the object is unmodified.  This is the COALESCE with the sub-query.<br />
-- This takes a little time.  To check on the progress, take another connection to the 
-- server and do <br />
STATUS (&#39;cluster&#39;);<br />
-- It will return something like 
-- Cluster 4 nodes, 35 s. 108 m/s 1001 KB/s  75% cpu 186%  read 12% clw threads 5r 0w 0i 
-- buffers 549481 253929 d 8 w 0 pfs<br />
-- Now finalize the state; this makes it permanent.  Else the work will be lost on server 
-- failure, since there was no transaction log <br />
CL_EXEC (&#39;checkpoint&#39;);<br />
-- See what we got<br />
sparql 
   SELECT COUNT (*) 
     FROM &lt;psmoosh&gt; 
     WHERE {?s ?p ?o};<br />
-- This is 2253102<br />
-- Now make the copy where all have the properties of all synonyms.  This takes so much 
-- space we do not insert it as RDF quads, but make a special table for it so that we can 
-- run some statistics.  This saves time.<br />
INSERT SOFT smoosh_ct (s, p, o)  
   SELECT s, np_p, np_o 
     FROM name_prop, 
          rdf_quad 
    WHERE o = np_name 
          AND p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ;<br />
-- as above, INSERT SOFT so as to ignore duplicates <br />
SELECT COUNT (*) 
   FROM smoosh_ct;<br />
-- This is  167360324<br />
-- Find out where the bloat comes from <br />
SELECT TOP 20 COUNT (*), 
              ID_TO_IRI (p) 
   FROM smoosh_ct 
   GROUP BY p 
   ORDER BY 1 DESC;
</pre></blockquote>
<p>The results are:</p>

<blockquote>
<pre>54728777          http://www.w3.org/2002/07/owl#sameAs
48543153          http://xmlns.com/foaf/0.1/knows
13930234          http://www.w3.org/2000/01/rdf-schema#seeAlso
12268512          http://xmlns.com/foaf/0.1/interest
11415867          http://xmlns.com/foaf/0.1/nick
6683963           http://xmlns.com/foaf/0.1/weblog
6650093           http://xmlns.com/foaf/0.1/depiction
4231946           http://xmlns.com/foaf/0.1/mbox_sha1sum
4129629           http://xmlns.com/foaf/0.1/homepage
1776555           http://xmlns.com/foaf/0.1/holdsAccount
1219525           http://xmlns.com/foaf/0.1/based_near
305522            http://www.w3.org/1999/02/22-rdf-syntax-ns#type
274965            http://xmlns.com/foaf/0.1/name
155131            http://xmlns.com/foaf/0.1/dateOfBirth
153001            http://xmlns.com/foaf/0.1/img
111130            http://www.w3.org/2001/vcard-rdf/3.0#ADR
52930             http://xmlns.com/foaf/0.1/gender
48517             http://www.w3.org/2004/02/skos/core#subject
45697             http://www.w3.org/2000/01/rdf-schema#label
44860             http://purl.org/vocab/bio/0.1/olb
</pre></blockquote>

<p>Now compare with the predicate distribution of the smoosh with identities canonicalized </p>

<blockquote>
<pre>sparql 
     SELECT COUNT (*) ?p 
       FROM &lt;psmoosh&gt; 
      WHERE { ?s ?p ?o } 
   GROUP BY ?p 
   ORDER BY 1 DESC 
      LIMIT 20;</pre></blockquote>

<p>Results are:</p>
<blockquote>
<pre>748311            http://xmlns.com/foaf/0.1/knows
548391            http://xmlns.com/foaf/0.1/interest
140531            http://www.w3.org/2000/01/rdf-schema#seeAlso
105273            http://www.w3.org/1999/02/22-rdf-syntax-ns#type
78497             http://xmlns.com/foaf/0.1/name
48099             http://www.w3.org/2004/02/skos/core#subject
45179             http://xmlns.com/foaf/0.1/depiction
40229             http://www.w3.org/2000/01/rdf-schema#comment
38272             http://www.w3.org/2000/01/rdf-schema#label
37378             http://xmlns.com/foaf/0.1/nick
37186             http://dbpedia.org/property/abstract
34003             http://xmlns.com/foaf/0.1/img
26182             http://xmlns.com/foaf/0.1/homepage
23795             http://www.w3.org/2002/07/owl#sameAs
17651             http://xmlns.com/foaf/0.1/mbox_sha1sum
17430             http://xmlns.com/foaf/0.1/dateOfBirth
15586             http://xmlns.com/foaf/0.1/page
12869             http://dbpedia.org/property/reference
12497             http://xmlns.com/foaf/0.1/weblog
12329             http://blogs.yandex.ru/schema/foaf/school
</pre></blockquote>

<p>We can drop the <code>owl:sameAs</code> triples from the count, so the bloat is a bit less by that but it still is tens of times larger than the canonicalized copy or the initial state.</p>

<p>Now, when we try using the psmoosh graph, we still get different results from the results with the original data.  This is because <code>foaf:knows</code> relations to things with no <code>foaf:name</code> are not represented in the smoosh.  The exist:</p>

<blockquote>
<pre>sparql 
SELECT COUNT (*) 
   WHERE { ?s foaf:knows ?thing . 
           FILTER ( !bif:exists ( SELECT (1) 
                                   WHERE { ?thing foaf:name ?nn }
                                )
                  ) 
         };<br />
-- 1393940
</pre></blockquote>

<p>So the smoosh graph is not an accurate rendition of the social network.  It would have to be smooshed further to be that, since the data in the sample is quite irregular.  But we do not go that far here.</p>

<p>Finally, we calculate the smoosh blow up factors.  We do not include <code>owl:sameAs</code> triples in the counts.</p>

<blockquote>
<pre>select (167360324 - 54728777) / 3284674.0;
34.290022997716059<br />
select 2229307 / 3284674.0;
= 0.678699621332284
</pre></blockquote>

<p>So, to get a smoosh that is not really the equivalent of the original, either multiply the original triple count by 34 or 0.68, depending on whether synonyms are collapsed or not.</p>

<p>Making the smooshes does not take very long, some minutes for the small one.  Inserting the big one would be longer, a couple of hours maybe.  It was 33 minutes for filling the <code>smoosh_ct</code> table.  The metrics were not with optimal tuning so the performance numbers just serve to show that smooshing takes time.  Probably more time than allowable in an interactive situation, no matter how the process is optimized.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-12-16#1498">
  <rss:title>&quot;E Pluribus Unum&quot;, or &quot;Inversely Functional Identity&quot;, or &quot;Smooshing Without the Stickiness&quot; (re-updated)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-12-16T14:14:43Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">What a terrible word, smooshing... I have understood it to mean that when you have two names for one thing, you give each all the attributes of the other. This smooshes them together, makes them interchangeable. This is complex, so I will begin with the point and the interested may read on for the details and implications. Starting with soon to be released version 6, Virtuoso allows you to say that two things, if they share a uniquely identifying property, are the same. Examples of uniquely identifying properties would be a book&#39;s ISBN number, or a person&#39;s social security plus full name. In relational language this is a unique key, and in RDF parlance, an inverse functional property. In most systems, such problems are dealt with as a preprocessing step before querying. For example, all the items that are considered the same will get the same properties or at load time all identifiers will be normalized according to some application rules. This is good if the rules are clear and understood. This is so in closed situations, where things tend to have standard identifiers to begin with. But on the open web this is not so clear cut. In this post, we show how to do these things ad hoc, without materializing anything. At the end, we also show how to materialize identity and what the consequences of this are with open web data. We use real live web crawls from the Billion Triples Challenge data set. On the linked data web, there are independently arising descriptions of the same thing and thus arises the need to smoosh, if these are to be somehow integrated. But this is only the beginning of the problems. To address these, we have added the option of specifying that some property will be considered inversely functional in a query. This is done at run time and the property does not really have to be inversely functional in the pure sense. foaf:name will do for an example. This simply means that for purposes of the query concerned, two subjects which have at least one foaf:name in common are considered the same. In this way, we can join between FOAF files. With the same database, a query about music preferences might consider having the same name as &quot;same enough,&quot; but a query about criminal prosecution would obviously need to be more precise about sameness. Our ontology is defined like this: -- Populate a named graph with the triples you want to use in query time inferencing ttlp ( &#39; @prefix foaf: &lt;xmlns=&quot;http&quot; xmlns.com=&quot;xmlns.com&quot; foaf=&quot;foaf&quot;&gt; &lt;/&gt; @prefix owl: &lt;xmlns=&quot;http&quot; www.w3.org=&quot;www.w3.org&quot; owl=&quot;owl&quot;&gt; &lt;/&gt; foaf:mbox_sha1sum a owl:InverseFunctionalProperty . foaf:name a owl:InverseFunctionalProperty . &#39;, &#39;xx&#39;, &#39;b3sifp&#39; ); -- Declare that the graph contains an ontology for use in query time inferencing rdfs_rule_set ( &#39;http://example.com/rules/b3sifp#&#39;, &#39;b3sifp&#39; ); Then use it: sparql DEFINE input:inference &quot;http://example.com/rules/b3sifp#&quot; SELECT DISTINCT ?k ?f1 ?f2 WHERE { ?k foaf:name ?n . ?n bif:contains &quot;&#39;Kjetil Kjernsmo&#39;&quot; . ?k foaf:knows ?f1 . ?f1 foaf:knows ?f2 }; VARCHAR VARCHAR VARCHAR ______________________________________ _______________________________________________ ______________________________ http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/dajobe http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/net_twitter http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/amyvdh http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/pom http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/mattb http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/davorg http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/distobj http://www.kjetil.kjernsmo.net/foaf#me http://norman.walsh.name/knows/who/robin-berjon http://twitter.com/perigrin .... Without the inference, we get no matches. This is because the data in question has one graph per FOAF file, and blank nodes for persons. No graph references any person outside the ones in the graph. So if somebody is mentioned as known, then without the inference there is no way to get to what that person&#39;s FOAF file says, since the same individual will be a different blank node there. The declaration in the context named b3sifp just means that all things with a matching foaf:name or foaf:mbox_sha1sum are the same. Sameness means that two are the same for purposes of DISTINCT or GROUP BY, and if two are the same, then both have the UNION of all of the properties of both. If this were a naive smoosh, then the individuals would have all the same properties but would not be the same for DISTINCT. If we have complex application rules for determining whether individuals are the same, then one can materialize owl:sameAs triples and keep them in a separate graph. In this way, the original data is not contaminated and the materialized volume stays reasonable â nothing like the blow-up of duplicating properties across instances. The pro-smoosh argument is that if every duplicate makes exactly the same statements, then there is no great blow-up. Best and worst cases will always depend on the data. In rough terms, the more ad hoc the use, the less desirable the materialization. If the usage pattern is really set, then a relational-style application-specific representation with identity resolved at load time will perform best. We can do that too, but so can others. The principal point is about agility as concerns the inference. Run time is more agile than materialization, and if the rules change or if different users have different needs, then materialization runs into trouble. When talking web scale, having multiple users is a given; it is very uneconomical to give everybody their own copy, and the likelihood of a user accessing any significant part of the corpus is minimal. Even if the queries were not limited, the user would typically not wait for the answer of a query doing a scan or aggregation over 1 billion blog posts or something of the sort. So queries will typically be selective. Selective means that they do not access all of the data, hence do not benefit from ready-made materialization for things they do not even look at. The exception is corpus-wide statistics queries. But these will not be done in interactive time anyway, and will not be done very often. Plus, since these do not typically run all in memory, these are disk bound. And when things are disk bound, size matters. Reading extra entailment on the way is just a performance penalty. Enough talk. Time for an experiment. We take the Yahoo and Falcon web crawls from the Billion Triples Challenge set, and do two things with the FOAF data in them: Resolve identity at insert time. We remove duplicate person URIs, and give the single URI all the properties of all the duplicate URIs. We expect these to be most often repeats. If a person references another person, we normalize this reference to go to the single URI of the referenced person. Give every duplicate URI of a person all the properties of all the duplicates. If these are the same value, the data should not get much bigger, or so we think. For the experiment, we will consider two people the same if they have the same foaf:name and are both instances of foaf:Person. This gets some extra hits but should not be statistically significant. The following is a commented SQL script performing the smoosh. We play with internal IDs of things, thus some of these operations cannot be done in SPARQL alone. We use SPARQL where possible for readability. As the documentation states, iri_to_id converts from the qualified name of an IRI to its ID and id_to_iri does the reverse. We count the triples that enter into the smoosh: -- the name is an existence because else we&#39;d get several times more due to -- the names occurring in many graphs sparql SELECT COUNT(*) WHERE { { SELECT DISTINCT ?person WHERE { ?person a foaf:Person } } . FILTER ( bif:exists ( SELECT (1) WHERE { ?person foaf:name ?nn } ) ) . ?person ?p ?o }; -- We get 3284674 We make a few tables for intermediate results. -- For each distinct name, gather the properties and objects from -- all subjects with this name CREATE TABLE name_prop ( np_name ANY, np_p IRI_ID_8, np_o ANY, PRIMARY KEY ( np_name, np_p, np_o ) ); ALTER INDEX name_prop ON name_prop PARTITION ( np_name VARCHAR (-1, 0hexffff) ); -- Map from name to canonical IRI used for the name CREATE TABLE name_iri ( ni_name ANY PRIMARY KEY, ni_s IRI_ID_8 ); ALTER INDEX name_iri ON name_iri PARTITION ( ni_name VARCHAR (-1, 0hexffff) ); -- Map from person IRI to canonical person IRI CREATE TABLE pref_iri ( i IRI_ID_8, pref IRI_ID_8, PRIMARY KEY ( i ) ); ALTER INDEX pref_iri ON pref_iri PARTITION ( i INT (0hexffff00) ); -- a table for the materialization where all aliases get all properties of every other CREATE TABLE smoosh_ct ( s IRI_ID_8, p IRI_ID_8, o ANY, PRIMARY KEY ( s, p, o ) ); ALTER INDEX smoosh_ct ON smoosh_ct PARTITION ( s INT (0hexffff00) ); -- disable transaction log and enable row auto-commit. This is necessary, otherwise -- bulk operations are done transactionally and they will run out of rollback space. LOG_ENABLE (2); -- Gather all the properties of all persons with a name under that name. -- INSERT SOFT means that duplicates are ignored INSERT SOFT name_prop SELECT &quot;n&quot;, &quot;p&quot;, &quot;o&quot; FROM ( sparql DEFINE output:valmode &quot;LONG&quot; SELECT ?n ?p ?o WHERE { ?x a foaf:Person . ?x foaf:name ?n . ?x ?p ?o } ) xx ; -- Now choose for each name the canonical IRI INSERT INTO name_iri SELECT np_name, ( SELECT MIN (s) FROM rdf_quad WHERE o = np_name AND p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ) AS mini FROM name_prop WHERE np_p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ; -- For each person IRI, map to the canonical IRI of that person INSERT SOFT pref_iri (i, pref) SELECT s, ni_s FROM name_iri, rdf_quad WHERE o = ni_name AND p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ; -- Make a graph where all persons have one iri with all the properties of all aliases -- and where person-to-person refs are canonicalized INSERT SOFT rdf_quad (g,s,p,o) SELECT IRI_TO_ID (&#39;psmoosh&#39;), ni_s, np_p, COALESCE ( ( SELECT pref FROM pref_iri WHERE i = np_o ), np_o ) FROM name_prop, name_iri WHERE ni_name = np_name OPTION ( loop, quietcast ) ; -- A little explanation: The properties of names are copied into rdf_quad with the name -- replaced with its canonical IRI. If the object has a canonical IRI, this is used as -- the object, else the object is unmodified. This is the COALESCE with the sub-query. -- This takes a little time. To check on the progress, take another connection to the -- server and do STATUS (&#39;cluster&#39;); -- It will return something like -- Cluster 4 nodes, 35 s. 108 m/s 1001 KB/s 75% cpu 186% read 12% clw threads 5r 0w 0i -- buffers 549481 253929 d 8 w 0 pfs -- Now finalize the state; this makes it permanent. Else the work will be lost on server -- failure, since there was no transaction log CL_EXEC (&#39;checkpoint&#39;); -- See what we got sparql SELECT COUNT (*) FROM &lt;psmoosh&gt; WHERE {?s ?p ?o}; -- This is 2253102 -- Now make the copy where all have the properties of all synonyms. This takes so much -- space we do not insert it as RDF quads, but make a special table for it so that we can -- run some statistics. This saves time. INSERT SOFT smoosh_ct (s, p, o) SELECT s, np_p, np_o FROM name_prop, rdf_quad WHERE o = np_name AND p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ; -- as above, INSERT SOFT so as to ignore duplicates SELECT COUNT (*) FROM smoosh_ct; -- This is 167360324 -- Find out where the bloat comes from SELECT TOP 20 COUNT (*), ID_TO_IRI (p) FROM smoosh_ct GROUP BY p ORDER BY 1 DESC; The results are: 54728777 http://www.w3.org/2002/07/owl#sameAs 48543153 http://xmlns.com/foaf/0.1/knows 13930234 http://www.w3.org/2000/01/rdf-schema#seeAlso 12268512 http://xmlns.com/foaf/0.1/interest 11415867 http://xmlns.com/foaf/0.1/nick 6683963 http://xmlns.com/foaf/0.1/weblog 6650093 http://xmlns.com/foaf/0.1/depiction 4231946 http://xmlns.com/foaf/0.1/mbox_sha1sum 4129629 http://xmlns.com/foaf/0.1/homepage 1776555 http://xmlns.com/foaf/0.1/holdsAccount 1219525 http://xmlns.com/foaf/0.1/based_near 305522 http://www.w3.org/1999/02/22-rdf-syntax-ns#type 274965 http://xmlns.com/foaf/0.1/name 155131 http://xmlns.com/foaf/0.1/dateOfBirth 153001 http://xmlns.com/foaf/0.1/img 111130 http://www.w3.org/2001/vcard-rdf/3.0#ADR 52930 http://xmlns.com/foaf/0.1/gender 48517 http://www.w3.org/2004/02/skos/core#subject 45697 http://www.w3.org/2000/01/rdf-schema#label 44860 http://purl.org/vocab/bio/0.1/olb Now compare with the predicate distribution of the smoosh with identities canonicalized sparql SELECT COUNT (*) ?p FROM &lt;psmoosh&gt; WHERE { ?s ?p ?o } GROUP BY ?p ORDER BY 1 DESC LIMIT 20; Results are: 748311 http://xmlns.com/foaf/0.1/knows 548391 http://xmlns.com/foaf/0.1/interest 140531 http://www.w3.org/2000/01/rdf-schema#seeAlso 105273 http://www.w3.org/1999/02/22-rdf-syntax-ns#type 78497 http://xmlns.com/foaf/0.1/name 48099 http://www.w3.org/2004/02/skos/core#subject 45179 http://xmlns.com/foaf/0.1/depiction 40229 http://www.w3.org/2000/01/rdf-schema#comment 38272 http://www.w3.org/2000/01/rdf-schema#label 37378 http://xmlns.com/foaf/0.1/nick 37186 http://dbpedia.org/property/abstract 34003 http://xmlns.com/foaf/0.1/img 26182 http://xmlns.com/foaf/0.1/homepage 23795 http://www.w3.org/2002/07/owl#sameAs 17651 http://xmlns.com/foaf/0.1/mbox_sha1sum 17430 http://xmlns.com/foaf/0.1/dateOfBirth 15586 http://xmlns.com/foaf/0.1/page 12869 http://dbpedia.org/property/reference 12497 http://xmlns.com/foaf/0.1/weblog 12329 http://blogs.yandex.ru/schema/foaf/school We can drop the owl:sameAs triples from the count, so the bloat is a bit less by that but it still is tens of times larger than the canonicalized copy or the initial state. Now, when we try using the psmoosh graph, we still get different results from the results with the original data. This is because foaf:knows relations to things with no foaf:name are not represented in the smoosh. The exist: sparql SELECT COUNT (*) WHERE { ?s foaf:knows ?thing . FILTER ( !bif:exists ( SELECT (1) WHERE { ?thing foaf:name ?nn } ) ) }; -- 1393940 So the smoosh graph is not an accurate rendition of the social network. It would have to be smooshed further to be that, since the data in the sample is quite irregular. But we do not go that far here. Finally, we calculate the smoosh blow up factors. We do not include owl:sameAs triples in the counts. select (167360324 - 54728777) / 3284674.0; 34.290022997716059 select 2229307 / 3284674.0; = 0.678699621332284 So, to get a smoosh that is not really the equivalent of the original, either multiply the original triple count by 34 or 0.68, depending on whether synonyms are collapsed or not. Making the smooshes does not take very long, some minutes for the small one. Inserting the big one would be longer, a couple of hours maybe. It was 33 minutes for filling the smoosh_ct table. The metrics were not with optimal tuning so the performance numbers just serve to show that smooshing takes time. Probably more time than allowable in an interactive situation, no matter how the process is optimized.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>What a terrible word, smooshing...  I have understood it to mean that when you have two names for one thing, you give each all the attributes of the other.  This smooshes them together, makes them interchangeable.</p>

<p>This is complex, so I will begin with the point and the interested may read on for the details and implications.  Starting with soon to be released version 6, <a href="http://virtuoso.openlinksw.com" id="link-id15718cb8">Virtuoso</a> allows you to say that two things, if they share a uniquely identifying property, are the same.  Examples of uniquely identifying properties would be a book&#39;s ISBN number, or a person&#39;s social security plus full name.  In relational language this is a <i>unique key</i>, and in <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id145ed998">RDF</a> parlance, an <i>inverse functional property</i>.</p>

<p>In most systems, such problems are dealt with as a preprocessing step before querying.  For example, all the items that are considered the same will get the same properties or at load time all identifiers will be normalized according to some application rules.  This is good if the rules are clear and understood.  This is so in closed situations, where things tend to have standard identifiers to begin with.  But on the open web this is not so clear cut.</p>

<p>In this post, we show how to do these things <i>ad hoc</i>, without materializing anything.  At the end, we also show how to materialize identity and what the consequences of this are with open web <a href="http://dbpedia.org/resource/Data" id="link-id11726358">data</a>.  We use real live web crawls from the <a href="http://challenge.semanticweb.org/" id="link-id14f40448">Billion Triples Challenge</a> data set.</p>

<p>On the <a href="http://dbpedia.org/resource/Linked_Data" id="link-id156e2b10">linked data</a> <a href="http://dbpedia.org/resource/Giant_Global_Graph" id="link-id1106ce08">web</a>, there are independently arising descriptions of the same thing and thus arises the need to smoosh, if these are to be somehow integrated.  But this is only the beginning of the problems.</p>

<p>To address these, we have added the option of specifying that some property will be considered inversely functional in a query.  This is done at run time and the property does not really have to be inversely functional in the pure sense.  <code>foaf:name</code> will do for an example.  This simply means that for purposes of the query concerned, two subjects which have at least one <code>foaf:name</code> in common are considered the same. In this way, we can join between FOAF files.  With the same database, a query about music preferences might consider having the same name as &quot;same enough,&quot; but a query about criminal prosecution would obviously need to be more precise about sameness.</p>

<p>Our ontology is defined like this:</p>

<blockquote>
<pre>-- Populate a named graph with the triples you want to use in query time inferencing<br />
ttlp ( &#39;
        @prefix foaf: &lt;xmlns=&quot;http&quot; xmlns.com=&quot;xmlns.com&quot; foaf=&quot;foaf&quot;&gt;
                      &lt;/&gt;
        @prefix owl:  &lt;xmlns=&quot;http&quot; www.w3.org=&quot;www.w3.org&quot; owl=&quot;owl&quot;&gt;
                      &lt;/&gt;
        foaf:mbox_sha1sum  a  owl:InverseFunctionalProperty  .
        foaf:name          a  owl:InverseFunctionalProperty  .
       &#39;,
       &#39;xx&#39;,
       &#39;b3sifp&#39;
     );<br />
-- Declare that the graph contains an ontology for use in query time inferencing <br />
rdfs_rule_set ( &#39;http://example.com/rules/b3sifp#&#39;,
                &#39;b3sifp&#39;
              );
</pre></blockquote>

<p>Then use it:</p>

<blockquote>
<pre>sparql 
   DEFINE input:inference &quot;http://example.com/rules/b3sifp#&quot; 
   SELECT DISTINCT ?k ?f1 ?f2 
   WHERE { ?k   foaf:name     ?n                   . 
           ?n   bif:contains  &quot;&#39;Kjetil Kjernsmo&#39;&quot;  . 
           ?k   foaf:knows    ?f1                  . 
           ?f1  foaf:knows    ?f2 
         };<br />
VARCHAR                                  VARCHAR                                           VARCHAR
______________________________________   _______________________________________________   ______________________________<br />
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/dajobe
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/net_twitter
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/amyvdh
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/pom
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/mattb
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/davorg
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/distobj
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/perigrin
....
</pre></blockquote>

<p>Without the inference, we get no matches.  This is because the data in question has one graph per FOAF file, and blank nodes for persons.  No graph references any person outside the ones in the graph.  So if somebody is mentioned as known, then without the inference there is no way to get to what that person&#39;s FOAF file says, since the same individual will be a different blank node there.  The declaration in the context named <code>b3sifp</code> just means that all things with a matching <code>foaf:name</code> or <code>foaf:mbox_sha1sum</code> are the same.</p>

<p>Sameness means that two are the same for purposes of <code>DISTINCT</code> or <code>GROUP BY</code>, and if two are the same, then both have the <code>UNION</code> of all of the properties of both.</p>

<p>If this were a naive smoosh, then the individuals would have all the same properties but would not be the same for <code>DISTINCT</code>.</p>

<p>If we have complex application rules for determining whether individuals are the same, then one can materialize <code>owl:sameAs</code> triples and keep them in a separate graph.  In this way, the original data is not contaminated and the materialized volume stays reasonable â nothing like the blow-up of duplicating properties across instances.</p>

<p>The pro-smoosh argument is that if every duplicate makes exactly the same statements, then there is no great blow-up.  Best and worst cases will always depend on the data.  In rough terms, the more <i>ad hoc</i> the use, the less desirable the materialization.  If the usage pattern is really set, then a relational-style application-specific representation with identity resolved at load time will perform best.  We can do that too, but so can others.</p>

<p>The principal point is about agility as concerns the inference.  Run time is more agile than materialization, and if the rules change or if different users have different needs, then materialization runs into trouble.  When talking web scale, having multiple users is a given; it is very uneconomical to give everybody their own copy, and the likelihood of a user accessing any significant part of the corpus is minimal.  Even if the queries were not limited, the user would typically not wait for the answer of a query doing a scan or aggregation over 1 billion <a href="http://dbpedia.org/resource/Blog" id="link-id1156a550">blog</a> posts or something of the sort.  So queries will typically be selective.  Selective means that they do not access all of the data, hence do not benefit from ready-made materialization for things they do not even look at. </p>

<p>The exception is corpus-wide statistics queries.  But these will not be done in interactive time anyway, and will not be done very often. Plus, since these do not typically run all in memory, these are disk bound.  And when things are disk bound, size matters.  Reading extra entailment on the way is just a performance penalty.</p>

<p>Enough talk. Time for an experiment.  We take the Yahoo and Falcon web crawls from the Billion Triples Challenge set, and do two things with the FOAF data in them:</p>

<ol>
<li>Resolve identity at insert time.  We remove duplicate person URIs, and give the single <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id11317008">URI</a> all the properties of all the duplicate URIs.  We expect these to be most often repeats.  If a person references another person, we normalize this reference to go to the single URI of the referenced person.</li>

<li>Give every duplicate URI of a person all the properties of all the duplicates.  If these are the same value, the data should not get much bigger, or so we think.</li>
</ol>

<p>For the experiment, we will consider two people the same if they have the same <code>foaf:name</code> and are both instances of <code>foaf:Person</code>.  This gets some extra hits but should not be statistically significant.</p>

<p>The following is a commented <a href="http://dbpedia.org/resource/SQL" id="link-id110945b0">SQL</a> script performing the smoosh.  We play with internal IDs of things, thus some of these operations cannot be done in SPARQL alone.  We use SPARQL where possible for readability.  As the documentation states, <code>iri_to_id</code> converts from the qualified name of an IRI to its ID and <code>id_to_iri</code> does the reverse.</p>

<p>We count the triples that enter into the smoosh:</p>

<blockquote>
<pre>-- the name is an existence because else we&#39;d get several times more due to 
-- the names occurring in many graphs <br />
sparql 
   SELECT COUNT(*) 
    WHERE { { SELECT DISTINCT ?person 
               WHERE { ?person a foaf:Person }
            } . 
            FILTER ( bif:exists ( SELECT (1) 
                                   WHERE { ?person foaf:name ?nn } 
                                )
                       ) . 
            ?person ?p ?o
          };<br />
-- We get 3284674
</pre></blockquote>

<p>We make a few tables for intermediate results.</p>

<blockquote>
<pre>-- For each distinct name, gather the properties and objects from 
-- all subjects with this name <br />
CREATE TABLE name_prop 
   ( np_name  ANY, 
     np_p     IRI_ID_8, 
     np_o     ANY, 
     PRIMARY KEY ( np_name, 
                   np_p, 
                   np_o
                 )
   );
ALTER INDEX name_prop 
   ON name_prop 
   PARTITION ( np_name VARCHAR (-1, 0hexffff) );<br />
-- Map from name to canonical IRI used for the name <br />
CREATE TABLE name_iri ( ni_name  ANY PRIMARY KEY, 
                        ni_s     IRI_ID_8
                      );
ALTER INDEX name_iri 
   ON name_iri 
   PARTITION ( ni_name VARCHAR (-1, 0hexffff) );<br />
-- Map from person IRI to canonical person IRI<br />
CREATE TABLE pref_iri 
   ( i     IRI_ID_8, 
     pref  IRI_ID_8, 
     PRIMARY KEY ( i )
   );
ALTER INDEX pref_iri 
   ON pref_iri 
   PARTITION ( i INT (0hexffff00) );<br />
-- a table for the materialization where all aliases get all properties of every other <br />
CREATE TABLE smoosh_ct 
   ( s  IRI_ID_8, 
     p  IRI_ID_8, 
     o  ANY, 
     PRIMARY KEY ( s, 
                   p, 
                   o
                 ) 
   );
ALTER INDEX smoosh_ct 
   ON smoosh_ct 
   PARTITION ( s INT (0hexffff00) );<br />
-- disable transaction log and enable row auto-commit.  This is necessary, otherwise 
-- bulk operations are done transactionally and they will run out of rollback space.<br />
LOG_ENABLE (2);<br />
-- Gather all the properties of all persons with a name under that name.  
-- INSERT SOFT means that duplicates are ignored <br />
INSERT SOFT name_prop 
   SELECT &quot;n&quot;, &quot;p&quot;, &quot;o&quot; 
   FROM ( sparql 
          DEFINE output:valmode &quot;LONG&quot; 
          SELECT ?n ?p ?o 
          WHERE { ?x a foaf:Person . 
                 ?x foaf:name ?n . 
                 ?x ?p ?o
               }
        ) xx ;<br />
-- Now choose for each name the canonical IRI <br />
INSERT INTO name_iri 
   SELECT np_name, 
          ( SELECT MIN (s) 
              FROM rdf_quad 
             WHERE o = np_name 
                   AND p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;)
          ) AS mini 
     FROM name_prop 
    WHERE np_p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ;<br />
-- For each person IRI, map to the canonical IRI of that person <br />
INSERT SOFT pref_iri (i, pref) 
   SELECT s, 
          ni_s 
     FROM name_iri, 
          rdf_quad 
    WHERE o = ni_name 
          AND p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ;<br />
-- Make a graph where all persons have one iri with all the properties of all aliases 
-- and where person-to-person refs are canonicalized<br />
INSERT SOFT rdf_quad (g,s,p,o) 
   SELECT IRI_TO_ID (&#39;psmoosh&#39;), 
          ni_s, 
          np_p, 
 COALESCE ( ( SELECT pref 
              FROM pref_iri 
              WHERE i = np_o
            ), 
            np_o 
          )
     FROM name_prop, 
          name_iri 
    WHERE ni_name = np_name 
   OPTION ( loop, quietcast ) ;<br />
-- A little explanation:  The properties of names are copied into rdf_quad with the name 
-- replaced with its canonical IRI.  If the object has a canonical IRI, this is used as 
-- the object, else the object is unmodified.  This is the COALESCE with the sub-query.<br />
-- This takes a little time.  To check on the progress, take another connection to the 
-- server and do <br />
STATUS (&#39;cluster&#39;);<br />
-- It will return something like 
-- Cluster 4 nodes, 35 s. 108 m/s 1001 KB/s  75% cpu 186%  read 12% clw threads 5r 0w 0i 
-- buffers 549481 253929 d 8 w 0 pfs<br />
-- Now finalize the state; this makes it permanent.  Else the work will be lost on server 
-- failure, since there was no transaction log <br />
CL_EXEC (&#39;checkpoint&#39;);<br />
-- See what we got<br />
sparql 
   SELECT COUNT (*) 
     FROM &lt;psmoosh&gt; 
     WHERE {?s ?p ?o};<br />
-- This is 2253102<br />
-- Now make the copy where all have the properties of all synonyms.  This takes so much 
-- space we do not insert it as RDF quads, but make a special table for it so that we can 
-- run some statistics.  This saves time.<br />
INSERT SOFT smoosh_ct (s, p, o)  
   SELECT s, np_p, np_o 
     FROM name_prop, 
          rdf_quad 
    WHERE o = np_name 
          AND p = IRI_TO_ID (&#39;http://xmlns.com/foaf/0.1/name&#39;) ;<br />
-- as above, INSERT SOFT so as to ignore duplicates <br />
SELECT COUNT (*) 
   FROM smoosh_ct;<br />
-- This is  167360324<br />
-- Find out where the bloat comes from <br />
SELECT TOP 20 COUNT (*), 
              ID_TO_IRI (p) 
   FROM smoosh_ct 
   GROUP BY p 
   ORDER BY 1 DESC;
</pre></blockquote>
<p>The results are:</p>

<blockquote>
<pre>54728777          http://www.w3.org/2002/07/owl#sameAs
48543153          http://xmlns.com/foaf/0.1/knows
13930234          http://www.w3.org/2000/01/rdf-schema#seeAlso
12268512          http://xmlns.com/foaf/0.1/interest
11415867          http://xmlns.com/foaf/0.1/nick
6683963           http://xmlns.com/foaf/0.1/weblog
6650093           http://xmlns.com/foaf/0.1/depiction
4231946           http://xmlns.com/foaf/0.1/mbox_sha1sum
4129629           http://xmlns.com/foaf/0.1/homepage
1776555           http://xmlns.com/foaf/0.1/holdsAccount
1219525           http://xmlns.com/foaf/0.1/based_near
305522            http://www.w3.org/1999/02/22-rdf-syntax-ns#type
274965            http://xmlns.com/foaf/0.1/name
155131            http://xmlns.com/foaf/0.1/dateOfBirth
153001            http://xmlns.com/foaf/0.1/img
111130            http://www.w3.org/2001/vcard-rdf/3.0#ADR
52930             http://xmlns.com/foaf/0.1/gender
48517             http://www.w3.org/2004/02/skos/core#subject
45697             http://www.w3.org/2000/01/rdf-schema#label
44860             http://purl.org/vocab/bio/0.1/olb
</pre></blockquote>

<p>Now compare with the predicate distribution of the smoosh with identities canonicalized </p>

<blockquote>
<pre>sparql 
     SELECT COUNT (*) ?p 
       FROM &lt;psmoosh&gt; 
      WHERE { ?s ?p ?o } 
   GROUP BY ?p 
   ORDER BY 1 DESC 
      LIMIT 20;</pre></blockquote>

<p>Results are:</p>
<blockquote>
<pre>748311            http://xmlns.com/foaf/0.1/knows
548391            http://xmlns.com/foaf/0.1/interest
140531            http://www.w3.org/2000/01/rdf-schema#seeAlso
105273            http://www.w3.org/1999/02/22-rdf-syntax-ns#type
78497             http://xmlns.com/foaf/0.1/name
48099             http://www.w3.org/2004/02/skos/core#subject
45179             http://xmlns.com/foaf/0.1/depiction
40229             http://www.w3.org/2000/01/rdf-schema#comment
38272             http://www.w3.org/2000/01/rdf-schema#label
37378             http://xmlns.com/foaf/0.1/nick
37186             http://dbpedia.org/property/abstract
34003             http://xmlns.com/foaf/0.1/img
26182             http://xmlns.com/foaf/0.1/homepage
23795             http://www.w3.org/2002/07/owl#sameAs
17651             http://xmlns.com/foaf/0.1/mbox_sha1sum
17430             http://xmlns.com/foaf/0.1/dateOfBirth
15586             http://xmlns.com/foaf/0.1/page
12869             http://dbpedia.org/property/reference
12497             http://xmlns.com/foaf/0.1/weblog
12329             http://blogs.yandex.ru/schema/foaf/school
</pre></blockquote>

<p>We can drop the <code>owl:sameAs</code> triples from the count, so the bloat is a bit less by that but it still is tens of times larger than the canonicalized copy or the initial state.</p>

<p>Now, when we try using the psmoosh graph, we still get different results from the results with the original data.  This is because <code>foaf:knows</code> relations to things with no <code>foaf:name</code> are not represented in the smoosh.  The exist:</p>

<blockquote>
<pre>sparql 
SELECT COUNT (*) 
   WHERE { ?s foaf:knows ?thing . 
           FILTER ( !bif:exists ( SELECT (1) 
                                   WHERE { ?thing foaf:name ?nn }
                                )
                  ) 
         };<br />
-- 1393940
</pre></blockquote>

<p>So the smoosh graph is not an accurate rendition of the social network.  It would have to be smooshed further to be that, since the data in the sample is quite irregular.  But we do not go that far here.</p>

<p>Finally, we calculate the smoosh blow up factors.  We do not include <code>owl:sameAs</code> triples in the counts.</p>

<blockquote>
<pre>select (167360324 - 54728777) / 3284674.0;
34.290022997716059<br />
select 2229307 / 3284674.0;
= 0.678699621332284
</pre></blockquote>

<p>So, to get a smoosh that is not really the equivalent of the original, either multiply the original triple count by 34 or 0.68, depending on whether synonyms are collapsed or not.</p>

<p>Making the smooshes does not take very long, some minutes for the small one.  Inserting the big one would be longer, a couple of hours maybe.  It was 33 minutes for filling the <code>smoosh_ct</code> table.  The metrics were not with optimal tuning so the performance numbers just serve to show that smooshing takes time.  Probably more time than allowable in an interactive situation, no matter how the process is optimized.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-11-20#1485">
  <rss:title>Virtuoso Vs. MySQL:  Setting the Berlin Record Straight (update 2)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-11-20T11:06:11Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">In the context of the Berlin SPARQL Benchmark, I have repeatedly written about measurement procedures and steady state. The point is that the numbers at larger scales are unreliable due to cache behavior if one is not careful about measurement and does not have adequate warmup. Thus it came to pass that one cut of the BSBM paper had 3 seconds for MySQL and 100 for Virtuoso, basically through ignoring cache effects. So we decided to do it ourselves. The score is (updated with revised innodb_buffer_pool_size setting, based on advice noted down below): n-clients Virtuoso MySQL (with increased buffer pool size) MySQL (with default buffer poll size) 1 41,161.33 27,023.11 12,171.41 4 127,918.30 (pending) 37,566.82 8 218,162.29 105,524.23 51,104.39 16 214,763.58 98,852.42 47,589.18 The metric is the query mixes per hour from the BSBM test driver output. For the interested, the complete output is here. The benchmark is pure SQL, nothing to do with SPARQL or RDF. The hardware is 2 x Xeon 5345 (2 x quad core, 2.33 GHz), 16 G RAM. The OS is 64-bit Debian Linux. The benchmark was run at a scale of 200,000. Each run had 2000 warm-up query mixes and 500 measured query mixes, which gives steady state, eliminating any effects of OS disk cache and the like. Both databases were configured to use 8G for disk cache. The test effectively runs from memory. We ran an analyze table on each MySQL table but noticed that this had no effect. Virtuoso does the stats sampling on the go; possibly MySQL also since the explicit stats did not make any difference. The MySQL tables were served by the InnoDB engine. MySQL appears to cache results of queries in some cases. This was not apparent in the tests. The versions are 5.09 for Virtuoso and 5.1.29 for MySQL. You can download and examine -- Virtuoso configuration file MySQL configuration file Table definitions &amp; RDF views Indexes on MySQL tables MySQL ought to do better. We suspect that here, just as in the TPC-D experiment we made way back, the query plans are not quite right. Also we rarely saw over 300% CPU utilization for MySQL. It is possible there is a config parameter that affects this. The public is invited to tell us about such. Update: Andreas Schultz of the BSBM team advised us to increase the innodb_buffer_pool_size setting in the MySQL config. We did and it produced some improvement. Indeed, this is more like it, as we now see CPU utilization around 700% instead of the 300% in the previously published run, which rendered it suspect. Also, our experiments with TPC-D led us to expect better. We ran these things a few times so as to have warm cache. On the first run, we noticed that the Innodb warm up time was somewhere well in excess of 2000 query mixes. Another time, we should make a graph of throughput as a function of time for both MySQL and Virtuoso. We recently made a greedy prefetch hack that should give us some mileage there. For the next BSBM, all we can advise is to run larger scale system for half an hour first and then measure and then measure again. If the second measurement is the same as the first then it is good. As always, since MySQL is not our specialty, we confidently invite the public to tell us how to make it run faster. So, unless something more turns up, our next trial is a revisit of TPC-H.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>In the context of the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0xa322b58">Berlin SPARQL Benchmark</a>, I have repeatedly written about measurement procedures and steady state.  The point is that the numbers at larger scales are unreliable due to cache behavior if one is not careful about measurement and does not have adequate warmup.  Thus it came to pass that one cut of the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x9524730">BSBM</a> paper had 3 seconds for <a href="http://dbpedia.org/resource/MySQL" id="link-id0x2ba8db0">MySQL</a> and 100 for <a href="http://virtuoso.openlinksw.com" id="link-id0xa9137d0">Virtuoso</a>, basically through ignoring cache effects.</p>

<p>So we decided to do it ourselves.</p>

<p>The score is (updated with revised <code>innodb_buffer_pool_size</code> setting, based on advice noted down below):</p>

<table border="1" cellspacing="2" cellpadding="5">
<tr>
    <th>n-clients</th>
    <th>Virtuoso</th>
    <th>MySQL <br /> (with increased buffer pool size)</th>
    <th>MySQL <br /> (with default buffer poll size)</th>
  </tr>
<tr align="right">
    <td>1</td>
    <td> 41,161.33</td>
    <td> 27,023.11 </td>
    <td> 12,171.41</td>
  </tr>
<tr align="right">
    <td>4</td>
    <td> 127,918.30</td>
    <td> (pending) </td>
    <td>  37,566.82</td>
  </tr>
<tr align="right">
    <td>8</td>
    <td> 218,162.29 </td>
    <td> 105,524.23 </td>
    <td>  51,104.39 </td>
  </tr>
<tr align="right">
    <td>16</td>
    <td> 214,763.58 </td>
    <td>  98,852.42 </td>
    <td>  47,589.18 </td>
  </tr>
</table>


<p>The metric is the query mixes per hour from the BSBM test driver output.  For the interested, the complete output is <a href="http://www.openlinksw.com/weblog/oerling/texts/bsbmres.txt" id="link-id1119f770">here</a>.</p>

<p>The benchmark is pure <a href="http://dbpedia.org/resource/SQL" id="link-id0x2b61c88">SQL</a>, nothing to do with <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x17a6d408">SPARQL</a> or <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x9a0a968">RDF</a>.</p>

<p>The hardware is 2 x Xeon 5345 (2 x quad core, 2.33 GHz), 16 G RAM.  The OS is 64-bit Debian Linux.</p>

<p>The benchmark was run at a scale of 200,000.  Each run had 2000 warm-up query mixes and 500 measured query mixes, which gives steady state, eliminating any effects of OS disk cache and the like.  Both databases were configured to use 8G for disk cache.  The test effectively runs from memory.  We ran an analyze table on each MySQL table but noticed that this had no effect.  Virtuoso does the stats sampling on the go; possibly MySQL also since the explicit stats did not make any difference.  The MySQL tables were served by the InnoDB engine.  MySQL appears to cache results of queries in some cases.  This was not apparent in the tests.</p>

<p>The versions are 5.09 for Virtuoso and 5.1.29 for MySQL.  You can download and examine --</p>
<ul> 
<li>
<a href="http://www.openlinksw.com/weblog/oerling/texts/virtuoso.ini" id="link-id14fe17f0">Virtuoso configuration file</a>
</li>
<li>
<a href="http://www.openlinksw.com/weblog/oerling/texts/my.cnf" id="link-id116fe490">MySQL configuration file</a>
</li>
<li>
    <a href="http://www.openlinksw.com/weblog/oerling/texts/create_tables_and_rdf_view.sql" id="link-id14ce9268">Table definitions &amp; RDF views</a> 
</li>
<li> <a href="http://www.openlinksw.com/weblog/oerling/texts/mysqlinx.sql" id="link-id1535e298">Indexes on MySQL tables</a>
</li>
</ul>

<p>
<strike>MySQL ought to do better.  We suspect that here, just as in the TPC-D experiment we made way back, the query plans are not quite right. Also we rarely saw over 300% CPU utilization for MySQL.  It is possible there is a config parameter that affects this.  The public is invited to tell us about such.</strike>
</p>

<p>
<b>Update:</b>
</p>

<p>Andreas Schultz of the BSBM team advised us to increase the <code>innodb_buffer_pool_size</code> setting in the MySQL config.  We did and it produced some improvement.  Indeed, this is more like it, as we now see CPU utilization around 700% instead of the 300% in the previously published run, which rendered it suspect. Also, our experiments with TPC-D led us to expect better.  We ran these things a few times so as to have warm cache.</p>

<p>On the first run, we noticed that the Innodb warm up time was somewhere well in excess of 2000 query mixes.  Another time, we should make a graph of throughput as a function of time for both MySQL and Virtuoso.  We recently made a greedy prefetch hack that should give us some mileage there.  For the next BSBM, all we can advise is to run larger scale system for half an hour first and then measure and then measure again.  If the second measurement is the same as the first then it is good.</p>

<p>As always, since MySQL is not our specialty, we confidently invite the public to tell us how to make it run faster. So, unless something more turns up, our next trial is a revisit of <a href="http://dbpedia.org/resource/TPC-H" id="link-id0x17a20498">TPC-H</a>.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-11-20#1484">
  <rss:title>Virtuoso Vs. MySQL:  Setting the Berlin Record Straight (update 2)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-11-20T11:06:11Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">In the context of the Berlin SPARQL Benchmark, I have repeatedly written about measurement procedures and steady state. The point is that the numbers at larger scales are unreliable due to cache behavior if one is not careful about measurement and does not have adequate warmup. Thus it came to pass that one cut of the BSBM paper had 3 seconds for MySQL and 100 for Virtuoso, basically through ignoring cache effects. So we decided to do it ourselves. The score is (updated with revised innodb_buffer_pool_size setting, based on advice noted down below): n-clients Virtuoso MySQL (with increased buffer pool size) MySQL (with default buffer poll size) 1 41,161.33 27,023.11 12,171.41 4 127,918.30 (pending) 37,566.82 8 218,162.29 105,524.23 51,104.39 16 214,763.58 98,852.42 47,589.18 The metric is the query mixes per hour from the BSBM test driver output. For the interested, the complete output is here. The benchmark is pure SQL, nothing to do with SPARQL or RDF. The hardware is 2 x Xeon 5345 (2 x quad core, 2.33 GHz), 16 G RAM. The OS is 64-bit Debian Linux. The benchmark was run at a scale of 200,000. Each run had 2000 warm-up query mixes and 500 measured query mixes, which gives steady state, eliminating any effects of OS disk cache and the like. Both databases were configured to use 8G for disk cache. The test effectively runs from memory. We ran an analyze table on each MySQL table but noticed that this had no effect. Virtuoso does the stats sampling on the go; possibly MySQL also since the explicit stats did not make any difference. The MySQL tables were served by the InnoDB engine. MySQL appears to cache results of queries in some cases. This was not apparent in the tests. The versions are 5.09 for Virtuoso and 5.1.29 for MySQL. You can download and examine -- Virtuoso configuration file MySQL configuration file Table definitions &amp; RDF views Indexes on MySQL tables MySQL ought to do better. We suspect that here, just as in the TPC-D experiment we made way back, the query plans are not quite right. Also we rarely saw over 300% CPU utilization for MySQL. It is possible there is a config parameter that affects this. The public is invited to tell us about such. Update: Andreas Schultz of the BSBM team advised us to increase the innodb_buffer_pool_size setting in the MySQL config. We did and it produced some improvement. Indeed, this is more like it, as we now see CPU utilization around 700% instead of the 300% in the previously published run, which rendered it suspect. Also, our experiments with TPC-D led us to expect better. We ran these things a few times so as to have warm cache. On the first run, we noticed that the Innodb warm up time was somewhere well in excess of 2000 query mixes. Another time, we should make a graph of throughput as a function of time for both MySQL and Virtuoso. We recently made a greedy prefetch hack that should give us some mileage there. For the next BSBM, all we can advise is to run larger scale system for half an hour first and then measure and then measure again. If the second measurement is the same as the first then it is good. As always, since MySQL is not our specialty, we confidently invite the public to tell us how to make it run faster. So, unless something more turns up, our next trial is a revisit of TPC-H.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>In the context of the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0xa5314d8">Berlin SPARQL Benchmark</a>, I have repeatedly written about measurement procedures and steady state.  The point is that the numbers at larger scales are unreliable due to cache behavior if one is not careful about measurement and does not have adequate warmup.  Thus it came to pass that one cut of the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x18482c20">BSBM</a> paper had 3 seconds for <a href="http://dbpedia.org/resource/MySQL" id="link-id0xb8c54de8">MySQL</a> and 100 for <a href="http://virtuoso.openlinksw.com" id="link-id0x189b2210">Virtuoso</a>, basically through ignoring cache effects.</p>

<p>So we decided to do it ourselves.</p>

<p>The score is (updated with revised <code>innodb_buffer_pool_size</code> setting, based on advice noted down below):</p>

<table border="1" cellspacing="2" cellpadding="5">
<tr>
    <th>n-clients</th>
    <th>Virtuoso</th>
    <th>MySQL <br /> (with increased buffer pool size)</th>
    <th>MySQL <br /> (with default buffer poll size)</th>
  </tr>
<tr align="right">
    <td>1</td>
    <td> 41,161.33</td>
    <td> 27,023.11 </td>
    <td> 12,171.41</td>
  </tr>
<tr align="right">
    <td>4</td>
    <td> 127,918.30</td>
    <td> (pending) </td>
    <td>  37,566.82</td>
  </tr>
<tr align="right">
    <td>8</td>
    <td> 218,162.29 </td>
    <td> 105,524.23 </td>
    <td>  51,104.39 </td>
  </tr>
<tr align="right">
    <td>16</td>
    <td> 214,763.58 </td>
    <td>  98,852.42 </td>
    <td>  47,589.18 </td>
  </tr>
</table>


<p>The metric is the query mixes per hour from the BSBM test driver output.  For the interested, the complete output is <a href="http://www.openlinksw.com/weblog/oerling/texts/bsbmres.txt" id="link-id1119f770">here</a>.</p>

<p>The benchmark is pure <a href="http://dbpedia.org/resource/SQL" id="link-id0x5257718">SQL</a>, nothing to do with <a href="http://dbpedia.org/resource/SPARQL" id="link-id0xb8c463e0">SPARQL</a> or <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x16e68d50">RDF</a>.</p>

<p>The hardware is 2 x Xeon 5345 (2 x quad core, 2.33 GHz), 16 G RAM.  The OS is 64-bit Debian Linux.</p>

<p>The benchmark was run at a scale of 200,000.  Each run had 2000 warm-up query mixes and 500 measured query mixes, which gives steady state, eliminating any effects of OS disk cache and the like.  Both databases were configured to use 8G for disk cache.  The test effectively runs from memory.  We ran an analyze table on each MySQL table but noticed that this had no effect.  Virtuoso does the stats sampling on the go; possibly MySQL also since the explicit stats did not make any difference.  The MySQL tables were served by the InnoDB engine.  MySQL appears to cache results of queries in some cases.  This was not apparent in the tests.</p>

<p>The versions are 5.09 for Virtuoso and 5.1.29 for MySQL.  You can download and examine --</p>
<ul> 
<li>
<a href="http://www.openlinksw.com/weblog/oerling/texts/virtuoso.ini" id="link-id14fe17f0">Virtuoso configuration file</a>
</li>
<li>
<a href="http://www.openlinksw.com/weblog/oerling/texts/my.cnf" id="link-id116fe490">MySQL configuration file</a>
</li>
<li>
    <a href="http://www.openlinksw.com/weblog/oerling/texts/create_tables_and_rdf_view.sql" id="link-id14ce9268">Table definitions &amp; RDF views</a> 
</li>
<li> <a href="http://www.openlinksw.com/weblog/oerling/texts/mysqlinx.sql" id="link-id1535e298">Indexes on MySQL tables</a>
</li>
</ul>

<p>
<strike>MySQL ought to do better.  We suspect that here, just as in the TPC-D experiment we made way back, the query plans are not quite right. Also we rarely saw over 300% CPU utilization for MySQL.  It is possible there is a config parameter that affects this.  The public is invited to tell us about such.</strike>
</p>

<p>
<b>Update:</b>
</p>

<p>Andreas Schultz of the BSBM team advised us to increase the <code>innodb_buffer_pool_size</code> setting in the MySQL config.  We did and it produced some improvement.  Indeed, this is more like it, as we now see CPU utilization around 700% instead of the 300% in the previously published run, which rendered it suspect. Also, our experiments with TPC-D led us to expect better.  We ran these things a few times so as to have warm cache.</p>

<p>On the first run, we noticed that the Innodb warm up time was somewhere well in excess of 2000 query mixes.  Another time, we should make a graph of throughput as a function of time for both MySQL and Virtuoso.  We recently made a greedy prefetch hack that should give us some mileage there.  For the next BSBM, all we can advise is to run larger scale system for half an hour first and then measure and then measure again.  If the second measurement is the same as the first then it is good.</p>

<p>As always, since MySQL is not our specialty, we confidently invite the public to tell us how to make it run faster. So, unless something more turns up, our next trial is a revisit of <a href="http://dbpedia.org/resource/TPC-H" id="link-id0x122eaa00">TPC-H</a>.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-11-04#1481">
  <rss:title>ISWC 2008: Some Questions</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-11-04T15:54:42Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Inference: Is it always forward chaining? We got a number of questions about Virtuoso&#39;s inference support. It seems that we are the odd one out, as we do not take it for granted that inference ought to consist of materializing entailment. Firstly, of course one can materialize all one wants with Virtuoso. The simplest way to do this is using SPARUL. With the recent transitivity extensions to SPARQL, it is also easy to materialize implications of transitivity with a single statement. Our point is that for trivial entailment such as subclass, sub-property, single transitive property, and owl:sameAs, we do not require materialization, as we can resolve these at run time also, with backward-chaining built into the engine. For more complex situations, one needs to materialize the entailment. At the present time, we know how to generalize our transitive feature to run arbitrary backward-chaining rules, including recursive ones. We could have a sort of Datalog backward-chaining embedded in our SQL/SPARQL and could run this with good parallelism, as the transitive feature already works with clustering and partitioning without dying of message latency. Exactly when and how we do this will be seen. Even if users want entailment to be materialized, such a rule system could be used for producing the materialization at good speed. We had a word with Ian Horrocks on the question. He noted that it is often naive on behalf of the community to tend to equate description of semantics with description of algorithm. The data need not always be blown up. The advantage of not always materializing is that the working set stays better. Once the working set is no longer in memory, response times jump disproportionately. Also, if the data changes or is retracted or is unreliable, one can end up doing a lot of extra work with materialization. Consider the effect of one malicious sameAs statement. This can lead to a lot of effects that are hard to retract. On the other hand, if running in memory with static data such as the LUBM benchmark, the queries run some 20% faster if entailment subclasses and sub-properties are materialized rather than done at run time. Genetic Algorithms for SPARQL? Our compliments for the wildest idea of the conference go to Eyal Oren, Christophe GuÃ©ret, and Stefan Schlobach, et al, for their paper on using genetic algorithms for guessing how variables in a SPARQL query ought to be instantiated. Prisoners of our &quot;conventional wisdom&quot; as we are, this might never have occurred to us. Schema Last? It is interesting to see how the industry comes to the semantic web conferences talking about schema last while at the same time the traditional semantic web people stress enforcing schema constraints and making more predictably performing and database friendlier logics. So do the extremes converge. There is a point to schema last. RDF is very good for getting a view of ad hoc or unknown data. One can just load and look at what there is. Also, additions of unforeseen optional properties or relations to the schema are easy and efficient. However, it seems that a really high traffic online application would always benefit from having some application specific data structures. Such could also save considerably in hardware. It is not a sharp divide between RDF and relational application oriented representation. We have the capabilities in our RDB to RDF mapping. We just need to show this and have SPARUL and data loading</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<h2>Inference: Is it always forward chaining?</h2>

<p>We got a number of questions about <a href="http://virtuoso.openlinksw.com" id="link-id0x13c64b60">Virtuoso</a>&#39;s inference support. It seems that we are the odd one out, as we do not take it for granted that inference ought to consist of materializing entailment.</p>

<p>Firstly, of course one can materialize all one wants with Virtuoso. The simplest way to do this is using SPARUL. With the recent transitivity extensions to <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x14d17778">SPARQL</a>, it is also easy to materialize implications of transitivity with a single statement. Our point is that for trivial entailment such as subclass, sub-property, single transitive property, and <a href="http://dbpedia.org/resource/Web_Ontology_Language" id="link-id0x128e55d0">owl</a>:sameAs, we do not require materialization, as we can resolve these at run time also, with backward-chaining built into the engine.</p>

<p>For more complex situations, one needs to materialize the entailment. At the present time, we know how to generalize our transitive feature to run arbitrary backward-chaining rules, including recursive ones. We could have a sort of Datalog backward-chaining embedded in our <a href="http://dbpedia.org/resource/SQL" id="link-id0x12614770">SQL</a>/SPARQL and could run this with good parallelism, as the transitive feature already works with clustering and partitioning without dying of message latency. Exactly when and how we do this will be seen. Even if users want entailment to be materialized, such a rule system could be used for producing the materialization at good speed.</p>

<p>We had a word with <a href="http://web.comlab.ox.ac.uk/people/Ian.Horrocks/" id="link-id117c99d0">Ian Horrocks</a> on the question. He noted that it is often naive on behalf of the community to tend to equate description of semantics with description of algorithm. The <a href="http://dbpedia.org/resource/Data" id="link-id0x145b2980">data</a> need not always be blown up.</p>

<p>The advantage of not always materializing is that the working set stays better. Once the working set is no longer in memory, response times jump disproportionately. Also, if the data changes or is retracted or is unreliable, one can end up doing a lot of extra work with materialization. Consider the effect of one malicious sameAs statement. This can lead to a lot of effects that are hard to retract. On the other hand, if running in memory with static data such as the LUBM benchmark, the queries run some 20% faster if entailment subclasses and sub-properties are materialized rather than done at run time.</p>

<h2>Genetic Algorithms for SPARQL?</h2>

<p>Our compliments for the wildest idea of the conference go to <a href="http://www.eyaloren.org/" id="link-id1a203af8">Eyal Oren</a>, <a href="http://www.few.vu.nl/~cgueret/" id="link-id16208758">Christophe GuÃ©ret</a>, and <a href="http://www.few.vu.nl/~schlobac/" id="link-id111923e0">Stefan Schlobach</a>, <i>et al</i>, for their <a href="http://www.informatik.uni-trier.de/~ley/db/conf/semweb/iswc2008.html#OrenGS08" id="link-id11793540">paper on using genetic algorithms for guessing how variables in a SPARQL query ought to be instantiated</a>. Prisoners of our &quot;conventional wisdom&quot; as we are, this might never have occurred to us.</p>

<h2>Schema Last?</h2>

<p>It is interesting to see how the industry comes to the <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x12b57e90">semantic web</a> conferences talking about schema last while at the same time the traditional semantic web people stress enforcing schema constraints and making more predictably performing and database friendlier logics. So do the extremes converge.</p>

<p>There is a point to schema last. <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x12a8ff48">RDF</a> is very good for getting a view of ad hoc or unknown data. One can just load and look at what there is. Also, additions of unforeseen optional properties or relations to the schema are easy and efficient. However, it seems that a really high traffic online application would always benefit from having some application specific data structures. Such could also save considerably in hardware.</p>

<p>It is not a sharp divide between RDF and relational application oriented representation. We have the capabilities in our RDB to RDF mapping. We just need to show this and have SPARUL and data loading</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-11-04#1479">
  <rss:title>ISWC 2008: Some Questions</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-11-04T15:54:42Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Inference: Is it always forward chaining? We got a number of questions about Virtuoso&#39;s inference support. It seems that we are the odd one out, as we do not take it for granted that inference ought to consist of materializing entailment. Firstly, of course one can materialize all one wants with Virtuoso. The simplest way to do this is using SPARUL. With the recent transitivity extensions to SPARQL, it is also easy to materialize implications of transitivity with a single statement. Our point is that for trivial entailment such as subclass, sub-property, single transitive property, and owl:sameAs, we do not require materialization, as we can resolve these at run time also, with backward-chaining built into the engine. For more complex situations, one needs to materialize the entailment. At the present time, we know how to generalize our transitive feature to run arbitrary backward-chaining rules, including recursive ones. We could have a sort of Datalog backward-chaining embedded in our SQL/SPARQL and could run this with good parallelism, as the transitive feature already works with clustering and partitioning without dying of message latency. Exactly when and how we do this will be seen. Even if users want entailment to be materialized, such a rule system could be used for producing the materialization at good speed. We had a word with Ian Horrocks on the question. He noted that it is often naive on behalf of the community to tend to equate description of semantics with description of algorithm. The data need not always be blown up. The advantage of not always materializing is that the working set stays better. Once the working set is no longer in memory, response times jump disproportionately. Also, if the data changes or is retracted or is unreliable, one can end up doing a lot of extra work with materialization. Consider the effect of one malicious sameAs statement. This can lead to a lot of effects that are hard to retract. On the other hand, if running in memory with static data such as the LUBM benchmark, the queries run some 20% faster if entailment subclasses and sub-properties are materialized rather than done at run time. Genetic Algorithms for SPARQL? Our compliments for the wildest idea of the conference go to Eyal Oren, Christophe GuÃ©ret, and Stefan Schlobach, et al, for their paper on using genetic algorithms for guessing how variables in a SPARQL query ought to be instantiated. Prisoners of our &quot;conventional wisdom&quot; as we are, this might never have occurred to us. Schema Last? It is interesting to see how the industry comes to the semantic web conferences talking about schema last while at the same time the traditional semantic web people stress enforcing schema constraints and making more predictably performing and database friendlier logics. So do the extremes converge. There is a point to schema last. RDF is very good for getting a view of ad hoc or unknown data. One can just load and look at what there is. Also, additions of unforeseen optional properties or relations to the schema are easy and efficient. However, it seems that a really high traffic online application would always benefit from having some application specific data structures. Such could also save considerably in hardware. It is not a sharp divide between RDF and relational application oriented representation. We have the capabilities in our RDB to RDF mapping. We just need to show this and have SPARUL and data loading</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<h2>Inference: Is it always forward chaining?</h2>

<p>We got a number of questions about <a href="http://virtuoso.openlinksw.com" id="link-id0x131604a8">Virtuoso</a>&#39;s inference support. It seems that we are the odd one out, as we do not take it for granted that inference ought to consist of materializing entailment.</p>

<p>Firstly, of course one can materialize all one wants with Virtuoso. The simplest way to do this is using SPARUL. With the recent transitivity extensions to <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1422f910">SPARQL</a>, it is also easy to materialize implications of transitivity with a single statement. Our point is that for trivial entailment such as subclass, sub-property, single transitive property, and <a href="http://dbpedia.org/resource/Web_Ontology_Language" id="link-id0x145894a8">owl</a>:sameAs, we do not require materialization, as we can resolve these at run time also, with backward-chaining built into the engine.</p>

<p>For more complex situations, one needs to materialize the entailment. At the present time, we know how to generalize our transitive feature to run arbitrary backward-chaining rules, including recursive ones. We could have a sort of Datalog backward-chaining embedded in our <a href="http://dbpedia.org/resource/SQL" id="link-id0x1458a288">SQL</a>/SPARQL and could run this with good parallelism, as the transitive feature already works with clustering and partitioning without dying of message latency. Exactly when and how we do this will be seen. Even if users want entailment to be materialized, such a rule system could be used for producing the materialization at good speed.</p>

<p>We had a word with <a href="http://web.comlab.ox.ac.uk/people/Ian.Horrocks/" id="link-id117c99d0">Ian Horrocks</a> on the question. He noted that it is often naive on behalf of the community to tend to equate description of semantics with description of algorithm. The <a href="http://dbpedia.org/resource/Data" id="link-id0x14cf0b18">data</a> need not always be blown up.</p>

<p>The advantage of not always materializing is that the working set stays better. Once the working set is no longer in memory, response times jump disproportionately. Also, if the data changes or is retracted or is unreliable, one can end up doing a lot of extra work with materialization. Consider the effect of one malicious sameAs statement. This can lead to a lot of effects that are hard to retract. On the other hand, if running in memory with static data such as the LUBM benchmark, the queries run some 20% faster if entailment subclasses and sub-properties are materialized rather than done at run time.</p>

<h2>Genetic Algorithms for SPARQL?</h2>

<p>Our compliments for the wildest idea of the conference go to <a href="http://www.eyaloren.org/" id="link-id1a203af8">Eyal Oren</a>, <a href="http://www.few.vu.nl/~cgueret/" id="link-id16208758">Christophe GuÃ©ret</a>, and <a href="http://www.few.vu.nl/~schlobac/" id="link-id111923e0">Stefan Schlobach</a>, <i>et al</i>, for their <a href="http://www.informatik.uni-trier.de/~ley/db/conf/semweb/iswc2008.html#OrenGS08" id="link-id11793540">paper on using genetic algorithms for guessing how variables in a SPARQL query ought to be instantiated</a>. Prisoners of our &quot;conventional wisdom&quot; as we are, this might never have occurred to us.</p>

<h2>Schema Last?</h2>

<p>It is interesting to see how the industry comes to the <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x1154c1b0">semantic web</a> conferences talking about schema last while at the same time the traditional semantic web people stress enforcing schema constraints and making more predictably performing and database friendlier logics. So do the extremes converge.</p>

<p>There is a point to schema last. <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x14c6a930">RDF</a> is very good for getting a view of ad hoc or unknown data. One can just load and look at what there is. Also, additions of unforeseen optional properties or relations to the schema are easy and efficient. However, it seems that a really high traffic online application would always benefit from having some application specific data structures. Such could also save considerably in hardware.</p>

<p>It is not a sharp divide between RDF and relational application oriented representation. We have the capabilities in our RDB to RDF mapping. We just need to show this and have SPARUL and data loading</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-11-04#1480">
  <rss:title>ISWC 2008: Billion Triples Challenge</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-11-04T15:52:11Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We showed our billion triples demo at the ISWC 2008 poster session. Generally people liked what they saw, as we basically did what one always had wanted to do with SPARQL but never could. This means firstly full SQL parity, with sub-queries, aggregation, full text, etc. Beyond SQL, we have transitive sub-queries, owl:sameAs at run time, and other inference things, all on demand. The live demo is at http://b3s.openlinksw.com/. This site is under development and may not be on all the time. We are taking it in the direction of hosting the whole LOD cloud. This is an evolving operation where we will continue showcasing how one can ask increasingly interesting questions from a growing online database, in the spirit of the billion triples charter. In the words of Jim Hendler, we were not selected for the finale because this would have made the challenge a database shootout instead of a more research-oriented event. There is some point to this since if the event becomes like the TPC benchmarks, this will limit the entrance to full time database players. Anyway, we got a special mention in the intro of the challenge track. The winner was Semaplorer, a federated SPARQL query system. There is some merit to this, as we ourselves are not convinced that centralization is always the right direction. As discussed in the DARQ Matter of Federation post, we have a notion of how to do this production-strength with our cluster engine, now also over wide area networks. We shall see. Why Not Just Join? The entries from Deri and LARKC (MaRVIN, &quot;Massive RDF Versatile Inference Network&quot;) were doing materialization of inference results in a cluster environment. The thing they were not doing was joining across partitions. Thus, the data was partitioned on whatever criterion and then the data in each partition was further refined according to rules known to all partitions. Deri did not address joining further. &quot;Nature shall be the guide of the alchemist,&quot; goes the old adage. We can look at MaRVIN as an example of this dictum. Networks of people are low bandwidth, not nearly fully connected. Asking a colleague for information is expensive and subject to misunderstanding; asking another research group might never produce an answer. Even looking at one individual, we have no reason to think that the human expert would do complete reasoning. Indeed, the brain is a sort of compute cluster, but it does not have flat latency point to point connectivity â some joins are fast; others are not even tried, for all we know. A database running on a cluster is a sort of counter-example. A database with RDF workload will end up joining across partitions pretty much all of the time. MaRVIN&#39;s approach to joining could be likened to a country dance: Boys get to take a whirl with different girls according to a complex pattern. For match-making, some matches are produced early but one never knows if the love of a lifetime might be just around the corner. Also, if the dancers are inexperienced, they will have little ability to evaluate how good a match they have with their partner. A few times around the dance floor are needed to get the hang of things. The question is, at what point will it no longer be possible to join across the database? This depends on the interconnect latency. The higher the latency, the more useful the square-dancing approach becomes. Another practical consideration is the fact that RDF reasoners are not usually built for distributed memory multiprocessors. If the reasoner must be a plug-in component, then it cannot be expected to be written for grids. We can think of a product safety use case: Find cosmetics that have ingredients that are considered toxic in the amounts they are present in each product. This can be done as a database query with some transitive operations, like running through a cosmetics taxonomy and a poisons database. If the business logic deciding whether the presence of an ingredient in the product is a health hazard is very complex, we can get a lot of joins. The MaRVIN way would be to set up a ball where each lipstick and eyeliner dances with every poison and then see if matches are made. The matching logic could be arbitrarily complex since it would run locally. Of course here, some domain knowledge is needed in order to set up the processing so that each product and poison carry all the associated information with them. Dancing with half a partner can bias one&#39;s perceptions: Again, it is like nature, sometimes not all cards are on the table. It would seem that there is some setup involved before answering a question: Composition of partitions, frequency of result exchange, etc. How critical the domain knowledge implicit in the setup is for the quality of results is an interesting question. The question is, at what point will a cluster using distributed database operations for inference become impractical? Of course, it is impractical from the get-go if the reasoners and query processors are not made for this. But what if they are? We are presently evaluating different message patterns for joining between partitions. The baseline is some 250,000 random single-triple lookups per second per core. Using a cluster increases this throughput. The increase is more or less linear depending on whether all intermediate results pass via one coordinating node (worst case) or whether each node can decide which other node will do the next join step for each result (best case). For example, a DISTINCT operation requires that data passes through a single place but JOINing and aggregation in general do not. We will still publish numbers during this November.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We showed our billion triples demo at the <a href="http://iswc2008.semanticweb.org/" id="link-id0x14898200">ISWC 2008</a> poster session. Generally people liked what they saw, as we basically did what one always had wanted to do with <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x12c56820">SPARQL</a> but never could. This means firstly full <a href="http://dbpedia.org/resource/SQL" id="link-id0x13a86e38">SQL</a> parity, with sub-queries, aggregation, full text, etc. Beyond SQL, we have transitive sub-queries, <a href="http://dbpedia.org/resource/Web_Ontology_Language" id="link-id0x12842500">owl</a>:sameAs at run time, and other inference things, all on demand.</p>

<p>The live demo is at <a href="http://b3s.openlinksw.com/" id="link-id14ba36e0">http://b3s.openlinksw.com/</a>. This site is under development and may not be on all the time. We are taking it in the direction of hosting the whole <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x14329f58">LOD</a> cloud. This is an evolving operation where we will continue showcasing how one can ask increasingly interesting questions from a growing online database, in the spirit of the billion triples charter.</p>

<p>In the words of <a href="http://www.cs.rpi.edu/~hendler/" id="link-id111ad740">Jim Hendler</a>, we were not selected for the finale because this would have made the challenge a database shootout instead of a more research-oriented event. There is some point to this since if the event becomes like the TPC benchmarks, this will limit the entrance to full time database players. Anyway, we got a special mention in the intro of the challenge track.</p>

<p>The winner was Semaplorer, a federated SPARQL query system. There is some merit to this, as we ourselves are not convinced that centralization is always the right direction. As discussed in the <i><a href="http://www.openlinksw.com/weblog/oerling/?id=1376" id="link-id1831cce0">DARQ Matter of Federation</a></i> post, we have a notion of how to do this production-strength with our cluster engine, now also over wide area networks. We shall see.</p>

<h2>Why Not Just Join?</h2>

<p>The entries from Deri and LARKC (<a href="http://www.larkc.eu/marvin/" id="link-id1bb42778">MaRVIN</a>, &quot;Massive <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id19c15d30">RDF</a> Versatile Inference Network&quot;) were doing materialization of inference results in a cluster environment. The thing they were not doing was joining across partitions. Thus, the <a href="http://dbpedia.org/resource/Data" id="link-id0x147fb970">data</a> was partitioned on whatever criterion and then the data in each partition was further refined according to rules known to all partitions. Deri did not address joining further.</p>

<p>&quot;Nature shall be the guide of the alchemist,&quot; goes the old adage. We can look at MaRVIN as an example of this dictum. Networks of people are low bandwidth, not nearly fully connected. Asking a colleague for <a href="http://dbpedia.org/resource/Information" id="link-id0x14c02be8">information</a> is expensive and subject to misunderstanding; asking another research group might never produce an answer.</p>

<p>Even looking at one individual, we have no reason to think that the human expert would do complete reasoning. Indeed, the brain is a sort of compute cluster, but it does not have flat latency point to point connectivity â some joins are fast; others are not even tried, for all we know.</p>

<p>A database running on a cluster is a sort of counter-example. A database with RDF workload will end up joining across partitions pretty much all of the time.</p>

<p>MaRVIN&#39;s approach to joining could be likened to a country dance: Boys get to take a whirl with different girls according to a complex pattern. For match-making, some matches are produced early but one never knows if the love of a lifetime might be just around the corner. Also, if the dancers are inexperienced, they will have little ability to evaluate how good a match they have with their partner. A few times around the dance floor are needed to get the hang of things.</p>

<p>The question is, at what point will it no longer be possible to join across the database? This depends on the interconnect latency. The higher the latency, the more useful the square-dancing approach becomes.</p>

<p>Another practical consideration is the fact that RDF reasoners are not usually built for distributed memory multiprocessors. If the reasoner must be a plug-in component, then it cannot be expected to be written for grids.</p>

<p>We can think of a product safety use case: Find cosmetics that have ingredients that are considered toxic in the amounts they are present in each product. This can be done as a database query with some transitive operations, like running through a cosmetics taxonomy and a poisons database. If the business logic deciding whether the presence of an ingredient in the product is a health hazard is very complex, we can get a lot of joins.</p>

<p>The MaRVIN way would be to set up a ball where each lipstick and eyeliner dances with every poison and then see if matches are made. The matching logic could be arbitrarily complex since it would run locally. Of course here, some domain <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x14a9f8c0">knowledge</a> is needed in order to set up the processing so that each product and poison carry all the associated information with them. Dancing with half a partner can bias one&#39;s perceptions: Again, it is like nature, sometimes not all cards are on the table.</p>

<p>It would seem that there is some setup involved before answering a question: Composition of partitions, frequency of result exchange, etc. How critical the domain knowledge implicit in the setup is for the quality of results is an interesting question.</p>

<p>The question is, at what point will a cluster using <a href="http://dbpedia.org/resource/federated_database_system" id="link-id0x12b31cf0">distributed database</a> operations for inference become impractical? Of course, it is impractical from the get-go if the reasoners and query processors are not made for this. But what if they are? We are presently evaluating different message patterns for joining between partitions. The baseline is some 250,000 random single-triple lookups per second per core. Using a cluster increases this throughput. The increase is more or less linear depending on whether all intermediate results pass via one coordinating node (worst case) or whether each node can decide which other node will do the next join step for each result (best case). For example, a <code>DISTINCT</code> operation requires that data passes through a single place but <code>JOIN</code>ing and aggregation in general do not.</p>

<p>We will still publish numbers during this November.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-11-04#1478">
  <rss:title>ISWC 2008: Billion Triples Challenge</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-11-04T15:52:11Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We showed our billion triples demo at the ISWC 2008 poster session. Generally people liked what they saw, as we basically did what one always had wanted to do with SPARQL but never could. This means firstly full SQL parity, with sub-queries, aggregation, full text, etc. Beyond SQL, we have transitive sub-queries, owl:sameAs at run time, and other inference things, all on demand. The live demo is at http://b3s.openlinksw.com/. This site is under development and may not be on all the time. We are taking it in the direction of hosting the whole LOD cloud. This is an evolving operation where we will continue showcasing how one can ask increasingly interesting questions from a growing online database, in the spirit of the billion triples charter. In the words of Jim Hendler, we were not selected for the finale because this would have made the challenge a database shootout instead of a more research-oriented event. There is some point to this since if the event becomes like the TPC benchmarks, this will limit the entrance to full time database players. Anyway, we got a special mention in the intro of the challenge track. The winner was Semaplorer, a federated SPARQL query system. There is some merit to this, as we ourselves are not convinced that centralization is always the right direction. As discussed in the DARQ Matter of Federation post, we have a notion of how to do this production-strength with our cluster engine, now also over wide area networks. We shall see. Why Not Just Join? The entries from Deri and LARKC (MaRVIN, &quot;Massive RDF Versatile Inference Network&quot;) were doing materialization of inference results in a cluster environment. The thing they were not doing was joining across partitions. Thus, the data was partitioned on whatever criterion and then the data in each partition was further refined according to rules known to all partitions. Deri did not address joining further. &quot;Nature shall be the guide of the alchemist,&quot; goes the old adage. We can look at MaRVIN as an example of this dictum. Networks of people are low bandwidth, not nearly fully connected. Asking a colleague for information is expensive and subject to misunderstanding; asking another research group might never produce an answer. Even looking at one individual, we have no reason to think that the human expert would do complete reasoning. Indeed, the brain is a sort of compute cluster, but it does not have flat latency point to point connectivity â some joins are fast; others are not even tried, for all we know. A database running on a cluster is a sort of counter-example. A database with RDF workload will end up joining across partitions pretty much all of the time. MaRVIN&#39;s approach to joining could be likened to a country dance: Boys get to take a whirl with different girls according to a complex pattern. For match-making, some matches are produced early but one never knows if the love of a lifetime might be just around the corner. Also, if the dancers are inexperienced, they will have little ability to evaluate how good a match they have with their partner. A few times around the dance floor are needed to get the hang of things. The question is, at what point will it no longer be possible to join across the database? This depends on the interconnect latency. The higher the latency, the more useful the square-dancing approach becomes. Another practical consideration is the fact that RDF reasoners are not usually built for distributed memory multiprocessors. If the reasoner must be a plug-in component, then it cannot be expected to be written for grids. We can think of a product safety use case: Find cosmetics that have ingredients that are considered toxic in the amounts they are present in each product. This can be done as a database query with some transitive operations, like running through a cosmetics taxonomy and a poisons database. If the business logic deciding whether the presence of an ingredient in the product is a health hazard is very complex, we can get a lot of joins. The MaRVIN way would be to set up a ball where each lipstick and eyeliner dances with every poison and then see if matches are made. The matching logic could be arbitrarily complex since it would run locally. Of course here, some domain knowledge is needed in order to set up the processing so that each product and poison carry all the associated information with them. Dancing with half a partner can bias one&#39;s perceptions: Again, it is like nature, sometimes not all cards are on the table. It would seem that there is some setup involved before answering a question: Composition of partitions, frequency of result exchange, etc. How critical the domain knowledge implicit in the setup is for the quality of results is an interesting question. The question is, at what point will a cluster using distributed database operations for inference become impractical? Of course, it is impractical from the get-go if the reasoners and query processors are not made for this. But what if they are? We are presently evaluating different message patterns for joining between partitions. The baseline is some 250,000 random single-triple lookups per second per core. Using a cluster increases this throughput. The increase is more or less linear depending on whether all intermediate results pass via one coordinating node (worst case) or whether each node can decide which other node will do the next join step for each result (best case). For example, a DISTINCT operation requires that data passes through a single place but JOINing and aggregation in general do not. We will still publish numbers during this November.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We showed our billion triples demo at the <a href="http://iswc2008.semanticweb.org/" id="link-id0x13a0a520">ISWC 2008</a> poster session. Generally people liked what they saw, as we basically did what one always had wanted to do with <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x138f5798">SPARQL</a> but never could. This means firstly full <a href="http://dbpedia.org/resource/SQL" id="link-id0x1264a688">SQL</a> parity, with sub-queries, aggregation, full text, etc. Beyond SQL, we have transitive sub-queries, <a href="http://dbpedia.org/resource/Web_Ontology_Language" id="link-id0x1e084138">owl</a>:sameAs at run time, and other inference things, all on demand.</p>

<p>The live demo is at <a href="http://b3s.openlinksw.com/" id="link-id14ba36e0">http://b3s.openlinksw.com/</a>. This site is under development and may not be on all the time. We are taking it in the direction of hosting the whole <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x13355f80">LOD</a> cloud. This is an evolving operation where we will continue showcasing how one can ask increasingly interesting questions from a growing online database, in the spirit of the billion triples charter.</p>

<p>In the words of <a href="http://www.cs.rpi.edu/~hendler/" id="link-id111ad740">Jim Hendler</a>, we were not selected for the finale because this would have made the challenge a database shootout instead of a more research-oriented event. There is some point to this since if the event becomes like the TPC benchmarks, this will limit the entrance to full time database players. Anyway, we got a special mention in the intro of the challenge track.</p>

<p>The winner was Semaplorer, a federated SPARQL query system. There is some merit to this, as we ourselves are not convinced that centralization is always the right direction. As discussed in the <i><a href="http://www.openlinksw.com/weblog/oerling/?id=1376" id="link-id1831cce0">DARQ Matter of Federation</a></i> post, we have a notion of how to do this production-strength with our cluster engine, now also over wide area networks. We shall see.</p>

<h2>Why Not Just Join?</h2>

<p>The entries from Deri and LARKC (<a href="http://www.larkc.eu/marvin/" id="link-id1bb42778">MaRVIN</a>, &quot;Massive <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id19c15d30">RDF</a> Versatile Inference Network&quot;) were doing materialization of inference results in a cluster environment. The thing they were not doing was joining across partitions. Thus, the <a href="http://dbpedia.org/resource/Data" id="link-id0x1d3c1a38">data</a> was partitioned on whatever criterion and then the data in each partition was further refined according to rules known to all partitions. Deri did not address joining further.</p>

<p>&quot;Nature shall be the guide of the alchemist,&quot; goes the old adage. We can look at MaRVIN as an example of this dictum. Networks of people are low bandwidth, not nearly fully connected. Asking a colleague for <a href="http://dbpedia.org/resource/Information" id="link-id0x125dd698">information</a> is expensive and subject to misunderstanding; asking another research group might never produce an answer.</p>

<p>Even looking at one individual, we have no reason to think that the human expert would do complete reasoning. Indeed, the brain is a sort of compute cluster, but it does not have flat latency point to point connectivity â some joins are fast; others are not even tried, for all we know.</p>

<p>A database running on a cluster is a sort of counter-example. A database with RDF workload will end up joining across partitions pretty much all of the time.</p>

<p>MaRVIN&#39;s approach to joining could be likened to a country dance: Boys get to take a whirl with different girls according to a complex pattern. For match-making, some matches are produced early but one never knows if the love of a lifetime might be just around the corner. Also, if the dancers are inexperienced, they will have little ability to evaluate how good a match they have with their partner. A few times around the dance floor are needed to get the hang of things.</p>

<p>The question is, at what point will it no longer be possible to join across the database? This depends on the interconnect latency. The higher the latency, the more useful the square-dancing approach becomes.</p>

<p>Another practical consideration is the fact that RDF reasoners are not usually built for distributed memory multiprocessors. If the reasoner must be a plug-in component, then it cannot be expected to be written for grids.</p>

<p>We can think of a product safety use case: Find cosmetics that have ingredients that are considered toxic in the amounts they are present in each product. This can be done as a database query with some transitive operations, like running through a cosmetics taxonomy and a poisons database. If the business logic deciding whether the presence of an ingredient in the product is a health hazard is very complex, we can get a lot of joins.</p>

<p>The MaRVIN way would be to set up a ball where each lipstick and eyeliner dances with every poison and then see if matches are made. The matching logic could be arbitrarily complex since it would run locally. Of course here, some domain <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x133b84b8">knowledge</a> is needed in order to set up the processing so that each product and poison carry all the associated information with them. Dancing with half a partner can bias one&#39;s perceptions: Again, it is like nature, sometimes not all cards are on the table.</p>

<p>It would seem that there is some setup involved before answering a question: Composition of partitions, frequency of result exchange, etc. How critical the domain knowledge implicit in the setup is for the quality of results is an interesting question.</p>

<p>The question is, at what point will a cluster using <a href="http://dbpedia.org/resource/federated_database_system" id="link-id0x1466c1c0">distributed database</a> operations for inference become impractical? Of course, it is impractical from the get-go if the reasoners and query processors are not made for this. But what if they are? We are presently evaluating different message patterns for joining between partitions. The baseline is some 250,000 random single-triple lookups per second per core. Using a cluster increases this throughput. The increase is more or less linear depending on whether all intermediate results pass via one coordinating node (worst case) or whether each node can decide which other node will do the next join step for each result (best case). For example, a <code>DISTINCT</code> operation requires that data passes through a single place but <code>JOIN</code>ing and aggregation in general do not.</p>

<p>We will still publish numbers during this November.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-10-26#1467">
  <rss:title>Virtuoso - Are We Too Clever for Our Own Good? (updated)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-10-26T12:15:35Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">&quot;Physician, heal thyself,&quot; it is said. We profess to say what the messaging of the semantic web ought to be, but is our own perfect? I will here engage in some critical introspection as well as amplify on some answers given to Virtuoso-related questions in recent times. I use some conversations from the Vienna Linked Data Practitioners meeting as a starting point. These views are mine and are limited to the Virtuoso server. These do not apply to the ODS (OpenLink Data Spaces) applications line, OAT (OpenLink Ajax Toolkit), or ODE (OpenLink Data Explorer). &quot;It is not always clear what the main thrust is, we get the impression that you are spread too thin,&quot; said SÃ¶ren Auer. Well, personally, I am all for core competence. This is why I do not participate in all the online conversations and groups as much as I could, for example. Time and energy are critical resources and must be invested where they make a difference. In this case, the real core competence is running in the database race. This in itself, come to think of it, is a pretty broad concept. This is why we put a lot of emphasis on Linked Data and the Data Web for now, as this is the emerging game. This is a deliberate choice, not an outside imperative or built-in limitation. More specifically, this means exposing any pre-existing relational data as linked data plus being the definitive RDF store. We can do this because we own our database and SQL and data access middleware and have a history of connecting to any RDBMS out there. The principal message we have been hearing from the RDF field is the call for scale of triple storage. This is even louder than the call for relational mapping. We believe that in time mapping will exceed triple storage as such, once we get some real production strength mappings deployed, enough to outperform RDF warehousing. There are also RDF middleware things like RDF-ization and demand-driven web harvesting (i.e, the so-called Sponger). These are SPARQL options, thus accessed via standard interfaces. We have little desire to create our own languages or APIs, or to tell people how to program. This is why we recently introduced Sesame- and Jena-compatible APIs to our RDF store. From what we hear, these work. On the other hand, we do not hesitate to move beyond the standards when there is obvious value or necessity. This is why we brought SPARQL up to and beyond SQL expressivity. It is not a case of E3 (Embrace, Extend, Extinguish). Now, this message could be better reflected in our material on the web. This blog is a rather informal step in this direction; more is to come. For now we concentrate on delivering. The conventional communications wisdom is to split the message by target audience. For this, we should split the RDF, relational, and web services messages from each other. We believe that a challenger, like the semantic web technology stack, must have a compelling message to tell for it to be interesting. This is not a question of research prototypes. The new technology cannot lack something the installed technology takes for granted. This is why we do not tend to show things like how to insert and query a few triples: No business out there will insert and query triples for the sake of triples. There must be a more compelling story â for example, turning the whole world into a database. This is why our examples start with things like turning the TPC-H database into RDF, queries and all. Anything less is not interesting. Why would an enterprise that has business intelligence and integration issues way more complex than the rather stereotypical TPC-H even look at a technology that pretends to be all for integration and all for expressivity of queries, yet cannot answer the first question of the entry exam? The world out there is complex. But maybe we ought to make some simple tutorials? So, as a call to the people out there, tell us what a good tutorial would be. The question is more about figuring out what is out there and adapting these and making a sort of compatibility list. Jena and Sesame stuff ought to run as is. We could offer a webinar to all the data web luminaries showing how to promote the data web message with Virtuoso. After all, why not show it on the best platform? &quot;You are arrogant. When I read your papers or documentation, the impression I get is that you say you are smart and the reader is stupid.&quot; We should answer in multiple parts. For general collateral, like web sites and documentation: The web site gives a confused product image. For the Virtuoso product, we should divide at the top into Data web and RDF - Host linked data, expose relational assets as linked data; Relational Database - Full function, high performance, open source, Federated/Virtual Relational DBMS, expose heterogeneous RDB assets through one point of contact for integration; Web Services - access all the above over standard protocols, dynamic web pages, web hosting. For each point, one simple statement. We all know what the above things mean? Then we add a new point about scalability that impacts all the above, namely the Virtuoso version 6 Cluster, meaning that you can do all these things at 10 to 1000 times the scale. This means this much more data or in some cases this much more requests per second. This too is clear. Far as I am concerned, hosting Java or .NET does not have to be on the front page. Also, we have no great interest in going against Apache when it comes to a web server only situation. The fact that we have a web listener is important for some things but our claim to fame does not rest on this. Then for documentation and training materials: The documentation should be better. Specifically it should have more of a how-to dimension since nobody reads the whole thing anyhow. About online tutorials, the order of presentation should be different. They do not really reflect what is important at the present moment either. Now for conference papers: Since taking the data web as a focus area, we have submitted some papers and had some rejected because these do not have enough references and do not explain what is obvious to ourselves. I think that the communications failure in this case is that we want to talk about end to end solutions and the reviewers expect research. For us, the solution is interesting and exists only if there is an adequate functionality mix for addressing a specific use case. This is why we do not make a paper about query cost model alone because the cost model, while indispensable, is a thing that is taken for granted where we come from. So we mention RDF adaptations to cost model, as these are important to the whole but do not find these to be the justification for a whole paper. If we made papers on this basis, we would have to make five times as many. Maybe we ought to. &quot;Virtuoso is very big and very difficult&quot; One thing that is not obvious from the Virtuoso packaging is that the minimum installation is an executable under 10MB and a config file. Two files. This gives you SQL and SPARQL out of the box. Adding ODBC and JDBC clients is as simple as it gets. After this, there is basic database functionality. Tuning is a matter of a few parameters that are explained on this blog and elsewhere. Also, the full scale installation is available as an Amazon EC2 image, so no installation required. Now for the difficult side: Use SQL and SPARQL; use stored procedures whenever there is server side business logic. For some time critical web pages, use VSP. Do not use VSPX. Otherwise, use whatever you are used to â PHP or Java or anything else. For web services, simple is best. Stick to basics. &quot;The engineer is one who can invent a simple thing.&quot; Use SQL statements rather than admin UI. Know that you can start a server with no database file and you get an initial database with nothing extra. The demo database, the way it is produced by installers is cluttered. We should put this into a couple of use case oriented how-tos. Also, we should create a network of &quot;friendly local virtuoso geeks&quot; for providing basic training and services so we do not have to explain these things all the time. To all you data-web-ers out there â please sign up and we will provide instructions, etc. Contact YrjÃ¤nÃ¤ Rankka (ghard[at-sign]openlinksw.com), or go through the mailing lists; do not contact me directly. &quot;OK, we understand that you may be good at the large end of the spectrum but how do you reconcile this with the lightweight or embedded end, like the semantic desktop?&quot; Now, what is good for one end is usually good for the other. Namely, a database, no matter the scale, needs to have space efficient storage, fast index lookup, and correct query plans. Then there are things that occur only at the high-end, like clustering, but these are separate things. For embedding, the initial memory footprint needs to be small. With Virtuoso, this is accomplished by leaving out some 200 built-in tables and 100,000 lines of SQL procedures that are normally in by default, supporting things such as DAV and diverse other protocols. After all, if SPARQL is all one wants these are not needed. If one really wants to do one&#39;s server logic (like web listener and thread dispatching) oneself, this is not impossible but requires some advice from us. On the other hand, if one wants to have logic for security close to the data, then using stored procedures is recommended; these execute right next to the data, and support inline SPARQL and SQL. Depending on the license status of the other code, some special licensing arrangements may apply. We are talking about such things with different parties at present. &quot;How webby are you? What is webby?&quot; &quot;Webby means distributed, heterogeneous, open; not monolithic consolidation of everything.&quot; We are philosophically webby. We come from open standards; we are after all called OpenLink; our history consists of connecting things. We believe in choice â the user should be able to pick the best of breed for components and have them work together. We cannot and do not wish to force replacement of existing assets. Transforming data on the fly and connecting systems, leaving data where it originally resides, is the first preference. For the data web, the first preference is a federation of independent SPARQL end points. When there is harvesting, we prefer to do it on demand, as with our Sponger. With the immense amount of data out there we believe in finding what is relevant when it is relevant, preferably close at hand, leveraging things like social networks. With a data web, many things which are now siloized, such as marketplaces and social networks, will return to the open. Google-style crawling of everything becomes less practical if one needs to run complex ad hoc queries against the mass of data. For these types of scenarios, if one needs to warehouse, the data cloud will offer solutions where one pays for database on demand. While we believe in loosely coupled federation where possible, we have serious work on the scalability side for the data center and the compute-on-demand cloud. &quot;How does OpenLink see the next five years unfolding?&quot; Personally, I think we have the basics for the birth of a new inflection in the knowledge economy. The URI is the unit of exchange; its value and competitive edge lie in the data it links you with. A name without context is worth little, but as a name gets more use, more information can be found through that name. This is anything from financial statistics, to legal precedents, to news reporting or government data. Right now, if the SEC just added one line of markup to the XBRL template, this would instantaneously make all SEC-mandated reporting into linked data via GRDDL. The URI is a carrier of brand. An information brand gets traffic and references, and this can be monetized in diverse ways. The key word is context. Information overload is here to stay, and only better context offers the needed increase in productivity to stay ahead of the flood. Semantic technologies on the whole can help with this. Why these should be semantic web or data web technologies as opposed to just semantic is the linked data value proposition. Even smart islands are still islands. Agility, scale, and scope, depend on the possibility of combining things. Therefore common terminologies and dereferenceability and discoverability are important. Without these, we are at best dealing with closed systems even if they were smart. The expert systems of the 1980s are a case in point. Ever since the .com era, the URL has been a brand. Now it becomes a URI. Thus, entirely hiding the URI from the user experience is not always desirable. The URI is a sort of handle on the provenance and where more can be found; besides, people are already used to these. With linked data, information value-add products become easy to build and deploy. They can be basically just canned SPARQL queries combining data in a useful and insightful manner. And where there is traffic there can be monetization, whether by advertizing, subscription, or other means. Such possibilities are a natural adjunct to the blogosphere. To publish analysis, one no longer needs to be a think tank or media company. We could call this scenario the birth of a meshup economy. For OpenLink itself, this is our roadmap. The immediate future is about getting our high end offerings like clustered RDF storage generally available, both on the cloud and for private data centers. Ourselves, we will offer the whole Linked Open Data cloud as a database. The single feature to come in version 2 of this is fully automatic partitioning and repartitioning for on-demand scale; now, you have to choose how many partitions you have. This makes some things possible that were hard thus far. On the mapping front, we go for real-scale data integration scenarios where we can show that SPARQL can unify terms and concepts across databases, yet bring no added cost for complex queries. Enterprises can use their existing warehouses and have an added level of abstraction, the possibility of cross systems interlinking, the advantages of using the same taxonomies and ontologies across systems, and so forth. Then there will be developments in the direction of smarter web harvesting on demand with the Virtuoso Sponger, and federation of heterogeneous SPARQL end points. The federation is not so unlike clustering, except the time scales are 2 orders of magnitude longer. The work on SPARQL end point statistics and data set description and discovery is a good development in the community. Then there will be NLP integration, as exemplified by the Open Calais linked data wrapper and more. Can we pull this off or is this being spread too thin? We know from experience that all this can be accomplished. Scale is already here; we show it with the billion triples set. Mapping is here; we showed it last in the Berlin Benchmark. We will also show some TPC-H results after we get a little quiet after the ISWC event. Then there is ongoing maintenance but with this we have shown a steady turnaround and quick time to fix for pretty much anything.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>&quot;Physician, heal thyself,&quot; it is said. We profess to say what the messaging of the <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x1b4a25f0">semantic web</a> ought to be, but is our own perfect?</p>

<p>I will here engage in some critical introspection as well as amplify on some answers given to <a href="http://virtuoso.openlinksw.com" id="link-id0x1e4f9928">Virtuoso</a>-related questions in recent times.</p>

<p>I use some conversations from the <a href="http://dbpedia.org/resource/Vienna" id="link-id0x1e6c0ca8">Vienna</a> <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x1e56df88">Linked Data</a> Practitioners meeting as a starting point. These views are mine and are limited to the Virtuoso server. These do not apply to the <a href="http://dbpedia.org/resource/OpenLink_Data_Spaces" id="link-id0x1e680440">ODS</a> (<a href="http://dbpedia.org/resource/OpenLink_Data_Spaces" id="link-id0x1e140068">OpenLink Data Spaces</a>) applications line, <a href="http://oat.openlinksw.com/" id="link-id0x1f4ba630">OAT</a> (<a href="http://oat.openlinksw.com/" id="link-id0x1ba4bac8">OpenLink Ajax Toolkit</a>), or <a href="http://ode.openlinksw.com/" id="link-id0x1d4159b0">ODE</a> (<a href="http://ode.openlinksw.com/" id="link-id0x1e973c80">OpenLink Data Explorer</a>).</p>

<h3>&quot;It is not always clear what the main thrust is, we get the impression that you are spread too thin,&quot; said <a href="http://www.informatik.uni-leipzig.de/~auer/foaf.rdf#me" id="link-id0x1f8bafe0">SÃ¶ren Auer</a>.</h3>

<p>Well, personally, I am all for core competence. This is why I do not participate in all the online conversations and groups as much as I could, for example. Time and energy are critical resources and must be invested where they make a difference. In this case, the real core competence is running in the database race. This in itself, come to think of it, is a pretty broad concept.</p>

<p>This is why we put a lot of emphasis on Linked Data and the <a href="http://dbpedia.org/resource/Data" id="link-id0x200bd1f0">Data</a> Web for now, as this is the emerging game. This is a deliberate choice, not an outside imperative or built-in limitation. More specifically, this means exposing any pre-existing relational data as linked data plus being the definitive <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1fb03528">RDF</a> store.</p>

<p>We can do this because we own our database and <a href="http://dbpedia.org/resource/SQL" id="link-id0x1e7dcc70">SQL</a> and data access middleware and have a history of connecting to any <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x1e9baf18">RDBMS</a> out there.</p>

<p>The principal message we have been hearing from the RDF field is the call for scale of triple storage. This is even louder than the call for relational mapping. We believe that in time mapping will exceed triple storage as such, once we get some real production strength mappings deployed, enough to outperform RDF warehousing.</p>

<p>There are also RDF middleware things like RDF-ization and demand-driven web harvesting (i.e, the so-called Sponger). These are <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1f5f6b78">SPARQL</a> options, thus accessed via standard interfaces. We have little desire to create our own languages or APIs, or to tell people how to program. This is why we recently introduced <a href="http://sourceforge.net/projects/sesame/" id="link-id0x206818c8">Sesame</a>- and <a href="http://jena.sourceforge.net/" id="link-id0x202b3348">Jena</a>-compatible APIs to our RDF store. From what we hear, these work. On the other hand, we do not hesitate to move beyond the standards when there is obvious value or necessity. This is why we brought SPARQL up to and beyond SQL expressivity. It is not a case of E3 (Embrace, Extend, Extinguish).</p>

<p>Now, this message could be better reflected in our material on the web. This <a href="http://dbpedia.org/resource/Blog" id="link-id0x1c82e508">blog</a> is a rather informal step in this direction; more is to come. For now we concentrate on delivering.</p>

<p>The conventional communications wisdom is to split the message by target audience. For this, we should split the RDF, relational, and web services messages from each other. We believe that a challenger, like the semantic web technology stack, must have a compelling message to tell for it to be interesting. This is not a question of research prototypes. The new technology cannot lack something the installed technology takes for granted.</p>

<p>This is why we do not tend to show things like how to insert and query a few triples: No business out there will insert and query triples for the sake of triples. There must be a more compelling story â for example, turning the whole world into a database. This is why our examples start with things like turning the <a href="http://dbpedia.org/resource/TPC-H" id="link-id0x20832510">TPC-H</a> database into RDF, queries and all. Anything less is not interesting. Why would an enterprise that has business intelligence and integration issues way more complex than the rather stereotypical TPC-H even look at a technology that pretends to be all for integration and all for expressivity of queries, yet cannot answer the first question of the entry exam?</p>

<p>The world out there is complex. But maybe we ought to make some simple tutorials? So, as a call to the people out there, tell us what a good tutorial would be. The question is more about figuring out what is out there and adapting these and making a sort of compatibility list.  Jena and Sesame stuff ought to run as is. We could offer a webinar to all the data web luminaries showing how to promote the data web message with Virtuoso. After all, why not show it on the best platform?</p>

<h3>&quot;You are arrogant. When I read your papers or documentation, the impression I get is that you say you are smart and the reader is stupid.&quot;</h3>

<p>We should answer in multiple  parts.</p>

<p>For general collateral, like web sites and documentation:</p>

<p>The web site gives a confused product image.  For the Virtuoso product, we should divide at the top into</p>

<ul>  
<li> Data web and RDF - Host linked data, expose relational assets as linked data;</li>
<li> Relational Database - Full function, high performance, open source, Federated/Virtual Relational DBMS, expose heterogeneous RDB assets through one point of contact for integration;</li>
<li> Web Services - access all the above over standard protocols, dynamic web pages, web hosting.</li>
</ul>

<p>For each point, one simple statement.  We all know what the above things mean?</p>

<p>Then we add a new point about scalability that impacts all the above, namely the Virtuoso version 6 Cluster, meaning that you can do all these things at 10 to 1000 times the scale. This means this much more data or in some cases this much more requests per second. This too is clear.</p>

<p>Far as I am concerned, hosting Java or .<a href="http://dbpedia.org/resource/.NET_Framework" id="link-id0x20283a88">NET</a> does not have to be on the front page. Also, we have no great interest in going against <a href="http://dbpedia.org/resource/Apache" id="link-id0x2024a068">Apache</a> when it comes to a web server only situation. The fact that we have a web listener is important for some things but our claim to fame does not rest on this.</p>

<p>Then for documentation and training materials: The documentation should be better. Specifically it should have more of a how-to dimension since nobody reads the whole thing anyhow. About online tutorials, the order of presentation should be different. They do not really reflect what is important at the present moment either.</p>

<p>Now for conference papers: Since taking the data web as a focus area, we have submitted some papers and had some rejected because these do not have enough references and do not explain what is obvious to ourselves.</p>

<p>I think that the communications failure in this case is that we want to talk about end to end solutions and the reviewers expect research. For us, the solution is interesting and exists only if there is an adequate functionality mix for addressing a specific use case. This is why we do not make a paper about query cost model alone because the cost model, while indispensable, is a thing that is taken for granted where we come from. So we mention RDF adaptations to cost model, as these are important to the whole but do not find these to be the justification for a whole paper. If we made papers on this basis, we would have to make five times as many. Maybe we ought to.</p>

<h3>&quot;Virtuoso is very big and very difficult&quot;</h3>

<p>One thing that is not obvious from the Virtuoso packaging is that the minimum installation is an executable under 10MB and a config file. Two files.</p>

<p>This gives you SQL and SPARQL out of the box.  Adding <a href="http://dbpedia.org/resource/Open_Database_Connectivity" id="link-id0x1ee61058">ODBC</a> and <a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id0x1b8c31c0">JDBC</a> clients is as simple as it gets. After this, there is basic database functionality. Tuning is a matter of a few parameters that are explained on this blog and elsewhere. Also, the full scale installation is available as an Amazon EC2 image, so no installation required.</p>

<p>Now for the difficult side:</p>

<p>Use SQL and SPARQL; use stored procedures whenever there is server side business logic. For some time critical web pages, use VSP. Do not use VSPX. Otherwise, use whatever you are used to â <a href="http://dbpedia.org/resource/PHP" id="link-id0x20a13c00">PHP</a> or Java or anything else. For web services, simple is best. Stick to basics. &quot;The engineer is one who can invent a simple thing.&quot; Use SQL statements rather than admin UI.</p>

<p>Know that you can start a server with no database file and you get an initial database with nothing extra. The demo database, the way it is produced by installers is cluttered.</p>

<p>We should put this into a couple of use case oriented how-tos.</p>

<p>Also, we should create a network of &quot;friendly local virtuoso geeks&quot; for providing basic training and services so we do not have to explain these things all the time. To all you data-web-ers out there â please sign up and we will provide instructions, etc. Contact YrjÃ¤nÃ¤ Rankka (ghard[at-sign]openlinksw.com), or go through the mailing lists; do not contact me directly.</p>

<h3>&quot;OK, we understand that you may be good at the large end of the spectrum but how do you reconcile this with the lightweight or embedded end, like the semantic desktop?&quot;</h3>

<p>Now, what is good for one end is usually good for the other. Namely, a database, no matter the scale, needs to have space efficient storage, fast index lookup, and correct query plans. Then there are things that occur only at the high-end, like clustering, but these are separate things. For embedding, the initial memory footprint needs to be small. With Virtuoso, this is accomplished by leaving out some 200 built-in tables and 100,000 lines of SQL procedures that are normally in by default, supporting things such as DAV and diverse other protocols. After all, if SPARQL is all one wants these are not needed.</p>

<p>If one really wants to do one&#39;s server logic (like web listener and thread dispatching) oneself, this is not impossible but requires some advice from us. On the other hand, if one wants to have logic for security close to the data, then using stored procedures is recommended; these execute right next to the data, and support inline SPARQL and SQL. Depending on the license status of the other code, some special licensing arrangements may apply.</p>

<p>We are talking about such things with different parties at present.</p>

<h3>&quot;How webby are you?  What is webby?&quot;</h3>

<p>&quot;Webby means distributed, heterogeneous, open; not monolithic consolidation of everything.&quot;</p>

<p>We are philosophically webby. We come from open standards; we are after all called OpenLink; our history consists of connecting things. We believe in choice â the user should be able to pick the best of breed for components and have them work together. We cannot and do not wish to force replacement of existing assets. Transforming data on the fly and connecting systems, leaving data where it originally resides, is the first preference. For the data web, the first preference is a federation of independent SPARQL end points. When there is harvesting, we prefer to do it on demand, as with our Sponger. With the immense amount of data out there we believe in finding what is relevant <i>when</i> it is relevant, preferably close at hand, leveraging things like social networks. With a data web, many things which are now siloized, such as marketplaces and social networks, will return to the open.</p>

<p>Google-style crawling of everything becomes less practical if one needs to run complex <i>ad hoc</i> queries against the mass of data. For these types of scenarios, if one needs to warehouse, the data cloud will offer solutions where one pays for database on demand. While we believe in loosely coupled federation where possible, we have serious work on the scalability side for the data center and the compute-on-demand cloud.</p>

<h3>&quot;How does OpenLink see the next five years unfolding?&quot;</h3>

<p>Personally, I think we have the basics for the birth of a new inflection in the <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x1fb9ae58">knowledge</a> economy. The <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id0x1f07c648">URI</a> is the unit of exchange; its value and competitive edge lie in the data it links you with. A name without context is worth little, but as a name gets more use, more <a href="http://dbpedia.org/resource/Information" id="link-id0x1f007d60">information</a> can be found through that name. This is anything from financial statistics, to legal precedents, to news reporting or government data. Right now, if the SEC just added one line of markup to the XBRL template, this would instantaneously make all SEC-mandated reporting into linked data via GRDDL.</p>

<p>The URI is a carrier of brand. An information brand gets traffic and references, and this can be monetized in diverse ways. The key word is <i>context</i>. Information overload is here to stay, and only better context offers the needed increase in productivity to stay ahead of the flood.</p>

<p>Semantic technologies on the whole can help with this. Why these should be semantic web or data web technologies as opposed to just semantic is the linked data value proposition. Even smart islands are still islands. Agility, scale, and scope, depend on the possibility of combining things. Therefore common terminologies and dereferenceability and discoverability are important. Without these, we are at best dealing with closed systems even if they were smart. The expert systems of the 1980s are a case in point.</p>

<p>Ever since the .com era, the <a href="http://dbpedia.org/resource/Uniform_Resource_Locator" id="link-id0x2048e670">URL</a> has been a brand. Now it becomes a URI. Thus, entirely hiding the URI from the user experience is not always desirable. The URI is a sort of handle on the provenance and where more can be found; besides, people are already used to these.</p>

<p>With linked data, information value-add products become easy to build and deploy. They can be basically just canned SPARQL queries combining data in a useful and insightful manner. And where there is traffic there can be monetization, whether by advertizing, subscription, or other means. Such possibilities are a natural adjunct to the blogosphere. To publish analysis, one no longer needs to be a think tank or media company. We could call this scenario the birth of a meshup economy.</p>

<p>For OpenLink itself, this is our roadmap. The immediate future is about getting our high end offerings like clustered RDF storage generally available, both on the cloud and for private data centers. Ourselves, we will offer the whole <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x1c696170">Linked Open Data</a> cloud as a database. The single feature to come in version 2 of this is fully automatic partitioning and repartitioning for on-demand scale; now, you have to choose how many partitions you have.</p>

<p>This makes some things possible that were hard thus far.</p>

<p>On the mapping front, we go for real-scale data integration scenarios where we can show that SPARQL can unify terms and concepts across databases, yet bring no added cost for complex queries. Enterprises can use their existing warehouses and have an added level of abstraction, the possibility of cross systems interlinking, the advantages of using the same taxonomies and ontologies across systems, and so forth.</p>

<p>Then there will be developments in the direction of smarter web harvesting on demand with the Virtuoso <a href="http://virtuoso.openlinksw.com/Whitepapers/html/VirtSpongerWhitePaper.html" id="link-id0x206ab780">Sponger</a>, and federation of heterogeneous SPARQL end points. The federation is not so unlike clustering, except the time scales are 2 orders of magnitude longer. The work on SPARQL end point statistics and data set description and discovery is a good development in the community.</p>

<p>Then there will be NLP integration, as exemplified by the Open Calais linked data wrapper and more.</p>

<p>Can we pull this off or is this being spread too thin? We know from experience that all this can be accomplished. Scale is already here; we show it with the billion triples set. Mapping is here; we showed it last in the Berlin Benchmark. We will also show some TPC-H results after we get a little quiet after the ISWC event.  Then there is ongoing maintenance but with this we have shown a steady turnaround and quick time to fix for pretty much anything.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-10-26#1465">
  <rss:title>Virtuoso - Are We Too Clever for Our Own Good? (updated)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-10-26T12:15:35Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">&quot;Physician, heal thyself,&quot; it is said. We profess to say what the messaging of the semantic web ought to be, but is our own perfect? I will here engage in some critical introspection as well as amplify on some answers given to Virtuoso-related questions in recent times. I use some conversations from the Vienna Linked Data Practitioners meeting as a starting point. These views are mine and are limited to the Virtuoso server. These do not apply to the ODS (OpenLink Data Spaces) applications line, OAT (OpenLink Ajax Toolkit), or ODE (OpenLink Data Explorer). &quot;It is not always clear what the main thrust is, we get the impression that you are spread too thin,&quot; said SÃ¶ren Auer. Well, personally, I am all for core competence. This is why I do not participate in all the online conversations and groups as much as I could, for example. Time and energy are critical resources and must be invested where they make a difference. In this case, the real core competence is running in the database race. This in itself, come to think of it, is a pretty broad concept. This is why we put a lot of emphasis on Linked Data and the Data Web for now, as this is the emerging game. This is a deliberate choice, not an outside imperative or built-in limitation. More specifically, this means exposing any pre-existing relational data as linked data plus being the definitive RDF store. We can do this because we own our database and SQL and data access middleware and have a history of connecting to any RDBMS out there. The principal message we have been hearing from the RDF field is the call for scale of triple storage. This is even louder than the call for relational mapping. We believe that in time mapping will exceed triple storage as such, once we get some real production strength mappings deployed, enough to outperform RDF warehousing. There are also RDF middleware things like RDF-ization and demand-driven web harvesting (i.e, the so-called Sponger). These are SPARQL options, thus accessed via standard interfaces. We have little desire to create our own languages or APIs, or to tell people how to program. This is why we recently introduced Sesame- and Jena-compatible APIs to our RDF store. From what we hear, these work. On the other hand, we do not hesitate to move beyond the standards when there is obvious value or necessity. This is why we brought SPARQL up to and beyond SQL expressivity. It is not a case of E3 (Embrace, Extend, Extinguish). Now, this message could be better reflected in our material on the web. This blog is a rather informal step in this direction; more is to come. For now we concentrate on delivering. The conventional communications wisdom is to split the message by target audience. For this, we should split the RDF, relational, and web services messages from each other. We believe that a challenger, like the semantic web technology stack, must have a compelling message to tell for it to be interesting. This is not a question of research prototypes. The new technology cannot lack something the installed technology takes for granted. This is why we do not tend to show things like how to insert and query a few triples: No business out there will insert and query triples for the sake of triples. There must be a more compelling story â for example, turning the whole world into a database. This is why our examples start with things like turning the TPC-H database into RDF, queries and all. Anything less is not interesting. Why would an enterprise that has business intelligence and integration issues way more complex than the rather stereotypical TPC-H even look at a technology that pretends to be all for integration and all for expressivity of queries, yet cannot answer the first question of the entry exam? The world out there is complex. But maybe we ought to make some simple tutorials? So, as a call to the people out there, tell us what a good tutorial would be. The question is more about figuring out what is out there and adapting these and making a sort of compatibility list. Jena and Sesame stuff ought to run as is. We could offer a webinar to all the data web luminaries showing how to promote the data web message with Virtuoso. After all, why not show it on the best platform? &quot;You are arrogant. When I read your papers or documentation, the impression I get is that you say you are smart and the reader is stupid.&quot; We should answer in multiple parts. For general collateral, like web sites and documentation: The web site gives a confused product image. For the Virtuoso product, we should divide at the top into Data web and RDF - Host linked data, expose relational assets as linked data; Relational Database - Full function, high performance, open source, Federated/Virtual Relational DBMS, expose heterogeneous RDB assets through one point of contact for integration; Web Services - access all the above over standard protocols, dynamic web pages, web hosting. For each point, one simple statement. We all know what the above things mean? Then we add a new point about scalability that impacts all the above, namely the Virtuoso version 6 Cluster, meaning that you can do all these things at 10 to 1000 times the scale. This means this much more data or in some cases this much more requests per second. This too is clear. Far as I am concerned, hosting Java or .NET does not have to be on the front page. Also, we have no great interest in going against Apache when it comes to a web server only situation. The fact that we have a web listener is important for some things but our claim to fame does not rest on this. Then for documentation and training materials: The documentation should be better. Specifically it should have more of a how-to dimension since nobody reads the whole thing anyhow. About online tutorials, the order of presentation should be different. They do not really reflect what is important at the present moment either. Now for conference papers: Since taking the data web as a focus area, we have submitted some papers and had some rejected because these do not have enough references and do not explain what is obvious to ourselves. I think that the communications failure in this case is that we want to talk about end to end solutions and the reviewers expect research. For us, the solution is interesting and exists only if there is an adequate functionality mix for addressing a specific use case. This is why we do not make a paper about query cost model alone because the cost model, while indispensable, is a thing that is taken for granted where we come from. So we mention RDF adaptations to cost model, as these are important to the whole but do not find these to be the justification for a whole paper. If we made papers on this basis, we would have to make five times as many. Maybe we ought to. &quot;Virtuoso is very big and very difficult&quot; One thing that is not obvious from the Virtuoso packaging is that the minimum installation is an executable under 10MB and a config file. Two files. This gives you SQL and SPARQL out of the box. Adding ODBC and JDBC clients is as simple as it gets. After this, there is basic database functionality. Tuning is a matter of a few parameters that are explained on this blog and elsewhere. Also, the full scale installation is available as an Amazon EC2 image, so no installation required. Now for the difficult side: Use SQL and SPARQL; use stored procedures whenever there is server side business logic. For some time critical web pages, use VSP. Do not use VSPX. Otherwise, use whatever you are used to â PHP or Java or anything else. For web services, simple is best. Stick to basics. &quot;The engineer is one who can invent a simple thing.&quot; Use SQL statements rather than admin UI. Know that you can start a server with no database file and you get an initial database with nothing extra. The demo database, the way it is produced by installers is cluttered. We should put this into a couple of use case oriented how-tos. Also, we should create a network of &quot;friendly local virtuoso geeks&quot; for providing basic training and services so we do not have to explain these things all the time. To all you data-web-ers out there â please sign up and we will provide instructions, etc. Contact YrjÃ¤nÃ¤ Rankka (ghard[at-sign]openlinksw.com), or go through the mailing lists; do not contact me directly. &quot;OK, we understand that you may be good at the large end of the spectrum but how do you reconcile this with the lightweight or embedded end, like the semantic desktop?&quot; Now, what is good for one end is usually good for the other. Namely, a database, no matter the scale, needs to have space efficient storage, fast index lookup, and correct query plans. Then there are things that occur only at the high-end, like clustering, but these are separate things. For embedding, the initial memory footprint needs to be small. With Virtuoso, this is accomplished by leaving out some 200 built-in tables and 100,000 lines of SQL procedures that are normally in by default, supporting things such as DAV and diverse other protocols. After all, if SPARQL is all one wants these are not needed. If one really wants to do one&#39;s server logic (like web listener and thread dispatching) oneself, this is not impossible but requires some advice from us. On the other hand, if one wants to have logic for security close to the data, then using stored procedures is recommended; these execute right next to the data, and support inline SPARQL and SQL. Depending on the license status of the other code, some special licensing arrangements may apply. We are talking about such things with different parties at present. &quot;How webby are you? What is webby?&quot; &quot;Webby means distributed, heterogeneous, open; not monolithic consolidation of everything.&quot; We are philosophically webby. We come from open standards; we are after all called OpenLink; our history consists of connecting things. We believe in choice â the user should be able to pick the best of breed for components and have them work together. We cannot and do not wish to force replacement of existing assets. Transforming data on the fly and connecting systems, leaving data where it originally resides, is the first preference. For the data web, the first preference is a federation of independent SPARQL end points. When there is harvesting, we prefer to do it on demand, as with our Sponger. With the immense amount of data out there we believe in finding what is relevant when it is relevant, preferably close at hand, leveraging things like social networks. With a data web, many things which are now siloized, such as marketplaces and social networks, will return to the open. Google-style crawling of everything becomes less practical if one needs to run complex ad hoc queries against the mass of data. For these types of scenarios, if one needs to warehouse, the data cloud will offer solutions where one pays for database on demand. While we believe in loosely coupled federation where possible, we have serious work on the scalability side for the data center and the compute-on-demand cloud. &quot;How does OpenLink see the next five years unfolding?&quot; Personally, I think we have the basics for the birth of a new inflection in the knowledge economy. The URI is the unit of exchange; its value and competitive edge lie in the data it links you with. A name without context is worth little, but as a name gets more use, more information can be found through that name. This is anything from financial statistics, to legal precedents, to news reporting or government data. Right now, if the SEC just added one line of markup to the XBRL template, this would instantaneously make all SEC-mandated reporting into linked data via GRDDL. The URI is a carrier of brand. An information brand gets traffic and references, and this can be monetized in diverse ways. The key word is context. Information overload is here to stay, and only better context offers the needed increase in productivity to stay ahead of the flood. Semantic technologies on the whole can help with this. Why these should be semantic web or data web technologies as opposed to just semantic is the linked data value proposition. Even smart islands are still islands. Agility, scale, and scope, depend on the possibility of combining things. Therefore common terminologies and dereferenceability and discoverability are important. Without these, we are at best dealing with closed systems even if they were smart. The expert systems of the 1980s are a case in point. Ever since the .com era, the URL has been a brand. Now it becomes a URI. Thus, entirely hiding the URI from the user experience is not always desirable. The URI is a sort of handle on the provenance and where more can be found; besides, people are already used to these. With linked data, information value-add products become easy to build and deploy. They can be basically just canned SPARQL queries combining data in a useful and insightful manner. And where there is traffic there can be monetization, whether by advertizing, subscription, or other means. Such possibilities are a natural adjunct to the blogosphere. To publish analysis, one no longer needs to be a think tank or media company. We could call this scenario the birth of a meshup economy. For OpenLink itself, this is our roadmap. The immediate future is about getting our high end offerings like clustered RDF storage generally available, both on the cloud and for private data centers. Ourselves, we will offer the whole Linked Open Data cloud as a database. The single feature to come in version 2 of this is fully automatic partitioning and repartitioning for on-demand scale; now, you have to choose how many partitions you have. This makes some things possible that were hard thus far. On the mapping front, we go for real-scale data integration scenarios where we can show that SPARQL can unify terms and concepts across databases, yet bring no added cost for complex queries. Enterprises can use their existing warehouses and have an added level of abstraction, the possibility of cross systems interlinking, the advantages of using the same taxonomies and ontologies across systems, and so forth. Then there will be developments in the direction of smarter web harvesting on demand with the Virtuoso Sponger, and federation of heterogeneous SPARQL end points. The federation is not so unlike clustering, except the time scales are 2 orders of magnitude longer. The work on SPARQL end point statistics and data set description and discovery is a good development in the community. Then there will be NLP integration, as exemplified by the Open Calais linked data wrapper and more. Can we pull this off or is this being spread too thin? We know from experience that all this can be accomplished. Scale is already here; we show it with the billion triples set. Mapping is here; we showed it last in the Berlin Benchmark. We will also show some TPC-H results after we get a little quiet after the ISWC event. Then there is ongoing maintenance but with this we have shown a steady turnaround and quick time to fix for pretty much anything.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>&quot;Physician, heal thyself,&quot; it is said. We profess to say what the messaging of the <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x1fa3da18">semantic web</a> ought to be, but is our own perfect?</p>

<p>I will here engage in some critical introspection as well as amplify on some answers given to <a href="http://virtuoso.openlinksw.com" id="link-id0x1e1eecf0">Virtuoso</a>-related questions in recent times.</p>

<p>I use some conversations from the <a href="http://dbpedia.org/resource/Vienna" id="link-id0x1ec0b2e0">Vienna</a> <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x2045ac10">Linked Data</a> Practitioners meeting as a starting point. These views are mine and are limited to the Virtuoso server. These do not apply to the <a href="http://dbpedia.org/resource/OpenLink_Data_Spaces" id="link-id0x2045ac38">ODS</a> (<a href="http://dbpedia.org/resource/OpenLink_Data_Spaces" id="link-id0x14f63c58">OpenLink Data Spaces</a>) applications line, <a href="http://oat.openlinksw.com/" id="link-id0x14f63c80">OAT</a> (<a href="http://oat.openlinksw.com/" id="link-id0x1e536928">OpenLink Ajax Toolkit</a>), or <a href="http://ode.openlinksw.com/" id="link-id0x1eaed7f8">ODE</a> (<a href="http://ode.openlinksw.com/" id="link-id0x1edfff88">OpenLink Data Explorer</a>).</p>

<h3>&quot;It is not always clear what the main thrust is, we get the impression that you are spread too thin,&quot; said <a href="http://www.informatik.uni-leipzig.de/~auer/foaf.rdf#me" id="link-id0x1b8a9580">SÃ¶ren Auer</a>.</h3>

<p>Well, personally, I am all for core competence. This is why I do not participate in all the online conversations and groups as much as I could, for example. Time and energy are critical resources and must be invested where they make a difference. In this case, the real core competence is running in the database race. This in itself, come to think of it, is a pretty broad concept.</p>

<p>This is why we put a lot of emphasis on Linked Data and the <a href="http://dbpedia.org/resource/Data" id="link-id0x1b85fa38">Data</a> Web for now, as this is the emerging game. This is a deliberate choice, not an outside imperative or built-in limitation. More specifically, this means exposing any pre-existing relational data as linked data plus being the definitive <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1f5b4468">RDF</a> store.</p>

<p>We can do this because we own our database and <a href="http://dbpedia.org/resource/SQL" id="link-id0x20076468">SQL</a> and data access middleware and have a history of connecting to any <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x1ffd6f98">RDBMS</a> out there.</p>

<p>The principal message we have been hearing from the RDF field is the call for scale of triple storage. This is even louder than the call for relational mapping. We believe that in time mapping will exceed triple storage as such, once we get some real production strength mappings deployed, enough to outperform RDF warehousing.</p>

<p>There are also RDF middleware things like RDF-ization and demand-driven web harvesting (i.e, the so-called Sponger). These are <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1316f720">SPARQL</a> options, thus accessed via standard interfaces. We have little desire to create our own languages or APIs, or to tell people how to program. This is why we recently introduced <a href="http://sourceforge.net/projects/sesame/" id="link-id0x20756a68">Sesame</a>- and <a href="http://jena.sourceforge.net/" id="link-id0x1ec01ac0">Jena</a>-compatible APIs to our RDF store. From what we hear, these work. On the other hand, we do not hesitate to move beyond the standards when there is obvious value or necessity. This is why we brought SPARQL up to and beyond SQL expressivity. It is not a case of E3 (Embrace, Extend, Extinguish).</p>

<p>Now, this message could be better reflected in our material on the web. This <a href="http://dbpedia.org/resource/Blog" id="link-id0x2027b410">blog</a> is a rather informal step in this direction; more is to come. For now we concentrate on delivering.</p>

<p>The conventional communications wisdom is to split the message by target audience. For this, we should split the RDF, relational, and web services messages from each other. We believe that a challenger, like the semantic web technology stack, must have a compelling message to tell for it to be interesting. This is not a question of research prototypes. The new technology cannot lack something the installed technology takes for granted.</p>

<p>This is why we do not tend to show things like how to insert and query a few triples: No business out there will insert and query triples for the sake of triples. There must be a more compelling story â for example, turning the whole world into a database. This is why our examples start with things like turning the <a href="http://dbpedia.org/resource/TPC-H" id="link-id0x2051ff98">TPC-H</a> database into RDF, queries and all. Anything less is not interesting. Why would an enterprise that has business intelligence and integration issues way more complex than the rather stereotypical TPC-H even look at a technology that pretends to be all for integration and all for expressivity of queries, yet cannot answer the first question of the entry exam?</p>

<p>The world out there is complex. But maybe we ought to make some simple tutorials? So, as a call to the people out there, tell us what a good tutorial would be. The question is more about figuring out what is out there and adapting these and making a sort of compatibility list.  Jena and Sesame stuff ought to run as is. We could offer a webinar to all the data web luminaries showing how to promote the data web message with Virtuoso. After all, why not show it on the best platform?</p>

<h3>&quot;You are arrogant. When I read your papers or documentation, the impression I get is that you say you are smart and the reader is stupid.&quot;</h3>

<p>We should answer in multiple  parts.</p>

<p>For general collateral, like web sites and documentation:</p>

<p>The web site gives a confused product image.  For the Virtuoso product, we should divide at the top into</p>

<ul>  
<li> Data web and RDF - Host linked data, expose relational assets as linked data;</li>
<li> Relational Database - Full function, high performance, open source, Federated/Virtual Relational DBMS, expose heterogeneous RDB assets through one point of contact for integration;</li>
<li> Web Services - access all the above over standard protocols, dynamic web pages, web hosting.</li>
</ul>

<p>For each point, one simple statement.  We all know what the above things mean?</p>

<p>Then we add a new point about scalability that impacts all the above, namely the Virtuoso version 6 Cluster, meaning that you can do all these things at 10 to 1000 times the scale. This means this much more data or in some cases this much more requests per second. This too is clear.</p>

<p>Far as I am concerned, hosting Java or .<a href="http://dbpedia.org/resource/.NET_Framework" id="link-id0x1f297540">NET</a> does not have to be on the front page. Also, we have no great interest in going against <a href="http://dbpedia.org/resource/Apache" id="link-id0x1ea29578">Apache</a> when it comes to a web server only situation. The fact that we have a web listener is important for some things but our claim to fame does not rest on this.</p>

<p>Then for documentation and training materials: The documentation should be better. Specifically it should have more of a how-to dimension since nobody reads the whole thing anyhow. About online tutorials, the order of presentation should be different. They do not really reflect what is important at the present moment either.</p>

<p>Now for conference papers: Since taking the data web as a focus area, we have submitted some papers and had some rejected because these do not have enough references and do not explain what is obvious to ourselves.</p>

<p>I think that the communications failure in this case is that we want to talk about end to end solutions and the reviewers expect research. For us, the solution is interesting and exists only if there is an adequate functionality mix for addressing a specific use case. This is why we do not make a paper about query cost model alone because the cost model, while indispensable, is a thing that is taken for granted where we come from. So we mention RDF adaptations to cost model, as these are important to the whole but do not find these to be the justification for a whole paper. If we made papers on this basis, we would have to make five times as many. Maybe we ought to.</p>

<h3>&quot;Virtuoso is very big and very difficult&quot;</h3>

<p>One thing that is not obvious from the Virtuoso packaging is that the minimum installation is an executable under 10MB and a config file. Two files.</p>

<p>This gives you SQL and SPARQL out of the box.  Adding <a href="http://dbpedia.org/resource/Open_Database_Connectivity" id="link-id0x20a2e7d0">ODBC</a> and <a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id0x1e4cceb8">JDBC</a> clients is as simple as it gets. After this, there is basic database functionality. Tuning is a matter of a few parameters that are explained on this blog and elsewhere. Also, the full scale installation is available as an Amazon EC2 image, so no installation required.</p>

<p>Now for the difficult side:</p>

<p>Use SQL and SPARQL; use stored procedures whenever there is server side business logic. For some time critical web pages, use VSP. Do not use VSPX. Otherwise, use whatever you are used to â <a href="http://dbpedia.org/resource/PHP" id="link-id0x20b03f08">PHP</a> or Java or anything else. For web services, simple is best. Stick to basics. &quot;The engineer is one who can invent a simple thing.&quot; Use SQL statements rather than admin UI.</p>

<p>Know that you can start a server with no database file and you get an initial database with nothing extra. The demo database, the way it is produced by installers is cluttered.</p>

<p>We should put this into a couple of use case oriented how-tos.</p>

<p>Also, we should create a network of &quot;friendly local virtuoso geeks&quot; for providing basic training and services so we do not have to explain these things all the time. To all you data-web-ers out there â please sign up and we will provide instructions, etc. Contact YrjÃ¤nÃ¤ Rankka (ghard[at-sign]openlinksw.com), or go through the mailing lists; do not contact me directly.</p>

<h3>&quot;OK, we understand that you may be good at the large end of the spectrum but how do you reconcile this with the lightweight or embedded end, like the semantic desktop?&quot;</h3>

<p>Now, what is good for one end is usually good for the other. Namely, a database, no matter the scale, needs to have space efficient storage, fast index lookup, and correct query plans. Then there are things that occur only at the high-end, like clustering, but these are separate things. For embedding, the initial memory footprint needs to be small. With Virtuoso, this is accomplished by leaving out some 200 built-in tables and 100,000 lines of SQL procedures that are normally in by default, supporting things such as DAV and diverse other protocols. After all, if SPARQL is all one wants these are not needed.</p>

<p>If one really wants to do one&#39;s server logic (like web listener and thread dispatching) oneself, this is not impossible but requires some advice from us. On the other hand, if one wants to have logic for security close to the data, then using stored procedures is recommended; these execute right next to the data, and support inline SPARQL and SQL. Depending on the license status of the other code, some special licensing arrangements may apply.</p>

<p>We are talking about such things with different parties at present.</p>

<h3>&quot;How webby are you?  What is webby?&quot;</h3>

<p>&quot;Webby means distributed, heterogeneous, open; not monolithic consolidation of everything.&quot;</p>

<p>We are philosophically webby. We come from open standards; we are after all called OpenLink; our history consists of connecting things. We believe in choice â the user should be able to pick the best of breed for components and have them work together. We cannot and do not wish to force replacement of existing assets. Transforming data on the fly and connecting systems, leaving data where it originally resides, is the first preference. For the data web, the first preference is a federation of independent SPARQL end points. When there is harvesting, we prefer to do it on demand, as with our Sponger. With the immense amount of data out there we believe in finding what is relevant <i>when</i> it is relevant, preferably close at hand, leveraging things like social networks. With a data web, many things which are now siloized, such as marketplaces and social networks, will return to the open.</p>

<p>Google-style crawling of everything becomes less practical if one needs to run complex <i>ad hoc</i> queries against the mass of data. For these types of scenarios, if one needs to warehouse, the data cloud will offer solutions where one pays for database on demand. While we believe in loosely coupled federation where possible, we have serious work on the scalability side for the data center and the compute-on-demand cloud.</p>

<h3>&quot;How does OpenLink see the next five years unfolding?&quot;</h3>

<p>Personally, I think we have the basics for the birth of a new inflection in the <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x2018bd98">knowledge</a> economy. The <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id0x1ec110d8">URI</a> is the unit of exchange; its value and competitive edge lie in the data it links you with. A name without context is worth little, but as a name gets more use, more <a href="http://dbpedia.org/resource/Information" id="link-id0x1ecfba08">information</a> can be found through that name. This is anything from financial statistics, to legal precedents, to news reporting or government data. Right now, if the SEC just added one line of markup to the XBRL template, this would instantaneously make all SEC-mandated reporting into linked data via GRDDL.</p>

<p>The URI is a carrier of brand. An information brand gets traffic and references, and this can be monetized in diverse ways. The key word is <i>context</i>. Information overload is here to stay, and only better context offers the needed increase in productivity to stay ahead of the flood.</p>

<p>Semantic technologies on the whole can help with this. Why these should be semantic web or data web technologies as opposed to just semantic is the linked data value proposition. Even smart islands are still islands. Agility, scale, and scope, depend on the possibility of combining things. Therefore common terminologies and dereferenceability and discoverability are important. Without these, we are at best dealing with closed systems even if they were smart. The expert systems of the 1980s are a case in point.</p>

<p>Ever since the .com era, the <a href="http://dbpedia.org/resource/Uniform_Resource_Locator" id="link-id0x1c4c9248">URL</a> has been a brand. Now it becomes a URI. Thus, entirely hiding the URI from the user experience is not always desirable. The URI is a sort of handle on the provenance and where more can be found; besides, people are already used to these.</p>

<p>With linked data, information value-add products become easy to build and deploy. They can be basically just canned SPARQL queries combining data in a useful and insightful manner. And where there is traffic there can be monetization, whether by advertizing, subscription, or other means. Such possibilities are a natural adjunct to the blogosphere. To publish analysis, one no longer needs to be a think tank or media company. We could call this scenario the birth of a meshup economy.</p>

<p>For OpenLink itself, this is our roadmap. The immediate future is about getting our high end offerings like clustered RDF storage generally available, both on the cloud and for private data centers. Ourselves, we will offer the whole <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x20791bf0">Linked Open Data</a> cloud as a database. The single feature to come in version 2 of this is fully automatic partitioning and repartitioning for on-demand scale; now, you have to choose how many partitions you have.</p>

<p>This makes some things possible that were hard thus far.</p>

<p>On the mapping front, we go for real-scale data integration scenarios where we can show that SPARQL can unify terms and concepts across databases, yet bring no added cost for complex queries. Enterprises can use their existing warehouses and have an added level of abstraction, the possibility of cross systems interlinking, the advantages of using the same taxonomies and ontologies across systems, and so forth.</p>

<p>Then there will be developments in the direction of smarter web harvesting on demand with the Virtuoso <a href="http://virtuoso.openlinksw.com/Whitepapers/html/VirtSpongerWhitePaper.html" id="link-id0x1f27e6d8">Sponger</a>, and federation of heterogeneous SPARQL end points. The federation is not so unlike clustering, except the time scales are 2 orders of magnitude longer. The work on SPARQL end point statistics and data set description and discovery is a good development in the community.</p>

<p>Then there will be NLP integration, as exemplified by the Open Calais linked data wrapper and more.</p>

<p>Can we pull this off or is this being spread too thin? We know from experience that all this can be accomplished. Scale is already here; we show it with the billion triples set. Mapping is here; we showed it last in the Berlin Benchmark. We will also show some TPC-H results after we get a little quiet after the ISWC event.  Then there is ongoing maintenance but with this we have shown a steady turnaround and quick time to fix for pretty much anything.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-10-26#1466">
  <rss:title>State of the Semantic Web, Part 2 - The Technical Questions (updated)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-10-26T12:02:43Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Here I will talk about some more technical questions that came up. This is mostly general; Virtuoso specific questions and answers are separate. &quot;How to Bootstrap? Where will the triples come from?&quot; There are already wrappers producing RDF from many applications. Since any structured or semi-structured data can be converted to RDF and often there is even a pre-existing terminology for the application domain, the availability of the data per se is not the concern. The triples may come from any application or database, but they will not come from the end user directly. There was a good talk about photograph annotation in Vienna, describing many ways of deriving metadata for photos. The essential wisdom is annotating on the spot and wherever possible doing so automatically. The consumer is very unlikely to go annotate photos after the fact. Further, one can infer that photos made with the same camera around the same time are from the same location. There are other such heuristics. In this use case, the end user does not need to see triples. There is some benefit though in using commonly used geographical terminology for linking to other data sources. &quot;How will one develop applications?&quot; I&#39;d say one will develop them much the same way as thus far. In PHP, for example. Whether one&#39;s query language is SPARQL or SQL does not make a large difference in how basic web UI is made. A SPARQL end-point is no more an end-user item than a SQL command-line is. A common mistake among techies is that they think the data structure and user experience can or ought to be of the same structure. The UI dialogs do not, for example, have to have a 1:1 correspondence with SQL tables. The idea of generating UI from data, whether relational or data-web, is so seductive that generation upon generation of developers fall for it, repeatedly. Even I, at OpenLink, after supposedly having been around the block a couple of times made some experiments around the topic. What does make sense is putting a thin wrapper or HTML around the application, using XSLT and such for formatting. Since the model does allow for unforeseen properties of data, one can build a viewer for these alongside the regular forms. For this, Ajax technologies like OAT (the OpenLink AJAX Toolkit) will be good. The UI ought not to completely hide the URIs of the data from the user. It should offer a drill down to faceted views of the triples for example. Remember when Xerox talked about graphical user interfaces in 1980? &quot;Don&#39;t mode me in&quot; was the slogan, as I recall. Since then, we have vacillated between modal and non-modal interaction models. Repetitive workflows like order entry go best modally and are anyway being replaced by web services. Also workflows that are very infrequent benefit from modality; take personal network setup wizards, for example. But enabling the knowledge worker is a domain that by its nature must retain some respect for human intelligence and not kill this by denying access to the underlying data, including provenance and URIs. Face it: the world is not getting simpler. It is increasingly data dependent and when this is so, having semantics and flexibility of access for the data is important. For a real-time task-oriented user interface like a fighter plane cockpit, one will not show URIs unless specifically requested. For planning fighter sorties though, there is some potential benefit in having all data such as friendly and hostile assets, geography, organizational structure, etc., as linked data. It makes for more flexible querying. Linked data does not per se mean open, so one can be joinable with open data through using the same identifiers even while maintaining arbitrary levels of security and compartmentalization. For automating tasks that every time involve the same data and queries, RDF has no intrinsic superiority. Thus the user interfaces in places where RDF will have real edge must be more capable of ad hoc viewing and navigation than regular real-time or line of business user interfaces. The OpenLink Data Explorer idea of a &quot;data behind the web page&quot; view goes in this direction. Read the web as before, then hit a switch to go to the data view. There are and will be separate clarifications and demos about this. &quot;What of the proliferation of standards? Does this not look too tangled, no clear identity? How would one know where to begin?&quot; When SWEO was beginning, there was an endlessly protracted discussion of the so-called layer cake. This acronym jungle is not good messaging. Just say linked, flexibly repurpose-able data, and rich vocabularies and structure. Just the right amount of structure for the application, less rigid and easier to change than relational. Do not even mention the different serialization formats. Just say that it fits on top of the accepted web infrastructure â HTTP, URIs, and XML where desired. It is misleading to say inference is a box at some specific place in the diagram. Inference of different types may or may not take place at diverse points, whether presentation or storage, on demand or as a preprocessing step. Since there is structure and semantics, inference is possible if desired. &quot;Can I make a social network application in RDF only, with no RDBMS?&quot; Yes, in principle, but what do you have in mind? The answer is very context dependent. The person posing the question had an E-learning system in mind, with things such as course catalogues, course material, etc. In such a case, RDF is a great match, especially since the user count will not be in the millions. No university has that many students and anyway they do not hang online browsing the course catalogue. On the other hand, if I think of making a social network site with RDF as the exclusive data model, I see things that would be very inefficient. For example, keeping a count of logins or the last time of login would be by default several times less efficient than with a RDBMS. If some application is really large scale and has a knowable workload profile, like any social network does, then some task-specific data structure is simply economical. This does not mean that the application language cannot be SPARQL but this means that the storage format must be tuned to favor some operations over others, relational style. This is a matter of cost more than of feasibility. Ten servers cost less than a hundred and have failures ten times less frequently. In the near term we will see the birth of an application paradigm for the data web. The data will be open, exposed, first-class citizen; yet the user experience will not have to be in a 1:1 image of the data.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Here I will talk about some more technical questions that came up.  This is mostly general; <a href="http://virtuoso.openlinksw.com" id="link-id0x205901a0">Virtuoso</a> specific questions and answers are separate.
</p>

<h3>&quot;How to Bootstrap?  Where will the triples come from?&quot;</h3>

<p>There are already wrappers producing <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x13519ac8">RDF</a> from many applications. Since any structured or semi-structured <a href="http://dbpedia.org/resource/Data" id="link-id0x1c93b418">data</a> can be converted to RDF and often there is even a pre-existing terminology for the application domain, the availability of the data <i>per se</i> is not the concern.</p>

<p>The triples may come from any application or database, but they will not come from the end user directly.  There was a good talk about photograph annotation in <a href="http://dbpedia.org/resource/Vienna" id="link-id0x1ea9d150">Vienna</a>, describing many ways of deriving metadata for photos.  The essential wisdom is annotating on the spot and wherever possible doing so automatically.  The consumer is very unlikely to go annotate  photos after the fact.  Further, one can infer that photos made with the same camera around the same time are from the same location.  There are other such heuristics.  In this use case, the end user does not need to see triples.  There is some benefit though in using commonly used geographical terminology for linking to other data sources.</p>

<h3>&quot;How will one develop applications?&quot;</h3>

<p>I&#39;d say one will develop them much the same way as thus far.  In <a href="http://dbpedia.org/resource/PHP" id="link-id0x207fca00">PHP</a>, for example.  Whether one&#39;s query language is <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x20a5fde0">SPARQL</a> or <a href="http://dbpedia.org/resource/SQL" id="link-id0x1a0bb5e0">SQL</a> does not make a large difference in how basic web UI is made.</p>

<p>A SPARQL end-point is no more an end-user item than a SQL command-line is.</p>

<p>A common mistake among techies is that they think the data structure and user experience can or ought to be of the same structure.  The UI dialogs do not, for example, have to have a 1:1 correspondence with SQL tables.</p>

<p>The idea of generating UI from data, whether relational or data-web, is so seductive that generation upon generation of developers fall for it, repeatedly.  Even I, at OpenLink, after supposedly having been around the block a couple of times made some experiments around the topic.  What does make sense is putting a thin wrapper or HTML around the application, using XSLT and such for formatting.  Since the model does allow for unforeseen properties of data, one can build a viewer for these alongside the regular forms.  For this, Ajax technologies like <a href="http://oat.openlinksw.com/" id="link-id0x1e91d118">OAT</a> (the <a href="http://oat.openlinksw.com/" id="link-id0x174b7950">OpenLink AJAX Toolkit</a>) will be good.</p>

<p>The UI ought not to completely hide the URIs of the data from the user.  It should offer a drill down to faceted views of the triples for example.  Remember when Xerox talked about graphical user interfaces in 1980? &quot;Don&#39;t mode me in&quot; was the slogan, as I recall.</p>

<p>Since then, we have vacillated between modal and non-modal interaction models.  Repetitive workflows like order entry go best modally and are anyway being replaced by web services.  Also workflows that are very infrequent benefit from modality; take personal network setup wizards, for example.  But enabling the <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x1ea14610">knowledge</a> worker is a domain that by its nature must retain some respect for human intelligence and not kill this by denying access to the underlying data, including provenance and URIs.  Face it: the world is not getting simpler.  It is increasingly data dependent and when this is so, having semantics and flexibility of access for the data is important.</p>

<p>For a real-time task-oriented user interface like a fighter plane cockpit, one will not show URIs unless specifically requested.  For planning fighter sorties though, there is some potential benefit in having all data such as friendly and hostile assets, geography, organizational structure, etc., as <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x207bcd20">linked data</a>.  It makes for more flexible querying.  Linked data does not <i>per se</i> mean open, so one can be joinable with open data through using the same identifiers even while maintaining arbitrary levels of security and compartmentalization.</p>

<p>For automating tasks that every time involve the same data and queries, RDF has no intrinsic superiority.  Thus the user interfaces in places where RDF will have real edge must be more capable of <i>ad hoc</i> viewing and navigation than regular real-time or line of business user interfaces.</p>

<p>The <a href="http://ode.openlinksw.com/" id="link-id0x2083a6f0">OpenLink Data Explorer</a> idea of a &quot;data behind the web page&quot; view goes in this direction. Read the web as before, then hit a switch to go to the data view.  There are and will be separate clarifications and demos about this.</p>

<h3>&quot;What of the proliferation of standards?  Does this not look too tangled, no clear identity?  How would one know where to begin?&quot;</h3>

<p>When <a href="http://www.w3.org/2001/sw/sweo/" id="link-id0x1e8eac68">SWEO</a> was beginning, there was an endlessly protracted discussion of the so-called layer cake. This acronym jungle is not good messaging. Just say linked, flexibly repurpose-able data, and rich vocabularies and structure.  Just the right amount of structure for the application, less rigid and easier to change than relational.</p>

<p>Do not even mention the different serialization formats.  Just say that it fits on top of the accepted web infrastructure â <a href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x1e3806b8">HTTP</a>, URIs, and <a href="http://dbpedia.org/resource/XML" id="link-id0x1f547288">XML</a> where desired.</p>

<p>It is misleading to say inference is a box at some specific place in the diagram.  Inference of different types may or may not take place at diverse points, whether presentation or storage, on demand or as a preprocessing step.  Since there is structure and semantics, inference is possible if desired.</p>

<h3>&quot;Can I make a social network application in RDF only, with no <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x20553ee0">RDBMS</a>?&quot;</h3>

<p>Yes, in principle, but what do you have in mind?  The answer is very context dependent.  The person posing the question had an E-learning system in mind, with things such as course catalogues, course material, etc.  In such a case, RDF is a great match, especially since the user count will not be in the millions.  No university has that many students and anyway they do not hang online browsing the course catalogue.</p>

<p>On the other hand, if I think of making a social network site with RDF as the exclusive data model, I see things that would be very inefficient. For example, keeping a count of logins or the last time of login would be by default several times less efficient than with a RDBMS.</p>

<p>If some application is really large scale and has a knowable workload profile, like any social network does, then some task-specific data structure is simply economical.  This does not mean that the application language cannot be SPARQL but this means that the storage format must be tuned to favor some operations over others, relational style.  This is a matter of cost more than of feasibility.  Ten servers cost less than a hundred and have failures ten times less frequently.</p>

<p>In the near term we will see the birth of an application paradigm for the data web. The data will be open, exposed, first-class citizen; yet the user experience will not have to be in a 1:1 image of the data.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-10-26#1464">
  <rss:title>State of the Semantic Web, Part 2 - The Technical Questions (updated)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-10-26T12:02:43Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Here I will talk about some more technical questions that came up. This is mostly general; Virtuoso specific questions and answers are separate. &quot;How to Bootstrap? Where will the triples come from?&quot; There are already wrappers producing RDF from many applications. Since any structured or semi-structured data can be converted to RDF and often there is even a pre-existing terminology for the application domain, the availability of the data per se is not the concern. The triples may come from any application or database, but they will not come from the end user directly. There was a good talk about photograph annotation in Vienna, describing many ways of deriving metadata for photos. The essential wisdom is annotating on the spot and wherever possible doing so automatically. The consumer is very unlikely to go annotate photos after the fact. Further, one can infer that photos made with the same camera around the same time are from the same location. There are other such heuristics. In this use case, the end user does not need to see triples. There is some benefit though in using commonly used geographical terminology for linking to other data sources. &quot;How will one develop applications?&quot; I&#39;d say one will develop them much the same way as thus far. In PHP, for example. Whether one&#39;s query language is SPARQL or SQL does not make a large difference in how basic web UI is made. A SPARQL end-point is no more an end-user item than a SQL command-line is. A common mistake among techies is that they think the data structure and user experience can or ought to be of the same structure. The UI dialogs do not, for example, have to have a 1:1 correspondence with SQL tables. The idea of generating UI from data, whether relational or data-web, is so seductive that generation upon generation of developers fall for it, repeatedly. Even I, at OpenLink, after supposedly having been around the block a couple of times made some experiments around the topic. What does make sense is putting a thin wrapper or HTML around the application, using XSLT and such for formatting. Since the model does allow for unforeseen properties of data, one can build a viewer for these alongside the regular forms. For this, Ajax technologies like OAT (the OpenLink AJAX Toolkit) will be good. The UI ought not to completely hide the URIs of the data from the user. It should offer a drill down to faceted views of the triples for example. Remember when Xerox talked about graphical user interfaces in 1980? &quot;Don&#39;t mode me in&quot; was the slogan, as I recall. Since then, we have vacillated between modal and non-modal interaction models. Repetitive workflows like order entry go best modally and are anyway being replaced by web services. Also workflows that are very infrequent benefit from modality; take personal network setup wizards, for example. But enabling the knowledge worker is a domain that by its nature must retain some respect for human intelligence and not kill this by denying access to the underlying data, including provenance and URIs. Face it: the world is not getting simpler. It is increasingly data dependent and when this is so, having semantics and flexibility of access for the data is important. For a real-time task-oriented user interface like a fighter plane cockpit, one will not show URIs unless specifically requested. For planning fighter sorties though, there is some potential benefit in having all data such as friendly and hostile assets, geography, organizational structure, etc., as linked data. It makes for more flexible querying. Linked data does not per se mean open, so one can be joinable with open data through using the same identifiers even while maintaining arbitrary levels of security and compartmentalization. For automating tasks that every time involve the same data and queries, RDF has no intrinsic superiority. Thus the user interfaces in places where RDF will have real edge must be more capable of ad hoc viewing and navigation than regular real-time or line of business user interfaces. The OpenLink Data Explorer idea of a &quot;data behind the web page&quot; view goes in this direction. Read the web as before, then hit a switch to go to the data view. There are and will be separate clarifications and demos about this. &quot;What of the proliferation of standards? Does this not look too tangled, no clear identity? How would one know where to begin?&quot; When SWEO was beginning, there was an endlessly protracted discussion of the so-called layer cake. This acronym jungle is not good messaging. Just say linked, flexibly repurpose-able data, and rich vocabularies and structure. Just the right amount of structure for the application, less rigid and easier to change than relational. Do not even mention the different serialization formats. Just say that it fits on top of the accepted web infrastructure â HTTP, URIs, and XML where desired. It is misleading to say inference is a box at some specific place in the diagram. Inference of different types may or may not take place at diverse points, whether presentation or storage, on demand or as a preprocessing step. Since there is structure and semantics, inference is possible if desired. &quot;Can I make a social network application in RDF only, with no RDBMS?&quot; Yes, in principle, but what do you have in mind? The answer is very context dependent. The person posing the question had an E-learning system in mind, with things such as course catalogues, course material, etc. In such a case, RDF is a great match, especially since the user count will not be in the millions. No university has that many students and anyway they do not hang online browsing the course catalogue. On the other hand, if I think of making a social network site with RDF as the exclusive data model, I see things that would be very inefficient. For example, keeping a count of logins or the last time of login would be by default several times less efficient than with a RDBMS. If some application is really large scale and has a knowable workload profile, like any social network does, then some task-specific data structure is simply economical. This does not mean that the application language cannot be SPARQL but this means that the storage format must be tuned to favor some operations over others, relational style. This is a matter of cost more than of feasibility. Ten servers cost less than a hundred and have failures ten times less frequently. In the near term we will see the birth of an application paradigm for the data web. The data will be open, exposed, first-class citizen; yet the user experience will not have to be in a 1:1 image of the data.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Here I will talk about some more technical questions that came up.  This is mostly general; <a href="http://virtuoso.openlinksw.com" id="link-id0x1f53d1a0">Virtuoso</a> specific questions and answers are separate.
</p>

<h3>&quot;How to Bootstrap?  Where will the triples come from?&quot;</h3>

<p>There are already wrappers producing <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1beda278">RDF</a> from many applications. Since any structured or semi-structured <a href="http://dbpedia.org/resource/Data" id="link-id0x1e57c648">data</a> can be converted to RDF and often there is even a pre-existing terminology for the application domain, the availability of the data <i>per se</i> is not the concern.</p>

<p>The triples may come from any application or database, but they will not come from the end user directly.  There was a good talk about photograph annotation in <a href="http://dbpedia.org/resource/Vienna" id="link-id0x2028b7e8">Vienna</a>, describing many ways of deriving metadata for photos.  The essential wisdom is annotating on the spot and wherever possible doing so automatically.  The consumer is very unlikely to go annotate  photos after the fact.  Further, one can infer that photos made with the same camera around the same time are from the same location.  There are other such heuristics.  In this use case, the end user does not need to see triples.  There is some benefit though in using commonly used geographical terminology for linking to other data sources.</p>

<h3>&quot;How will one develop applications?&quot;</h3>

<p>I&#39;d say one will develop them much the same way as thus far.  In <a href="http://dbpedia.org/resource/PHP" id="link-id0x1eff1748">PHP</a>, for example.  Whether one&#39;s query language is <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1d83dff8">SPARQL</a> or <a href="http://dbpedia.org/resource/SQL" id="link-id0x1e9f4e88">SQL</a> does not make a large difference in how basic web UI is made.</p>

<p>A SPARQL end-point is no more an end-user item than a SQL command-line is.</p>

<p>A common mistake among techies is that they think the data structure and user experience can or ought to be of the same structure.  The UI dialogs do not, for example, have to have a 1:1 correspondence with SQL tables.</p>

<p>The idea of generating UI from data, whether relational or data-web, is so seductive that generation upon generation of developers fall for it, repeatedly.  Even I, at OpenLink, after supposedly having been around the block a couple of times made some experiments around the topic.  What does make sense is putting a thin wrapper or HTML around the application, using XSLT and such for formatting.  Since the model does allow for unforeseen properties of data, one can build a viewer for these alongside the regular forms.  For this, Ajax technologies like <a href="http://oat.openlinksw.com/" id="link-id0x1d780520">OAT</a> (the <a href="http://oat.openlinksw.com/" id="link-id0x20943788">OpenLink AJAX Toolkit</a>) will be good.</p>

<p>The UI ought not to completely hide the URIs of the data from the user.  It should offer a drill down to faceted views of the triples for example.  Remember when Xerox talked about graphical user interfaces in 1980? &quot;Don&#39;t mode me in&quot; was the slogan, as I recall.</p>

<p>Since then, we have vacillated between modal and non-modal interaction models.  Repetitive workflows like order entry go best modally and are anyway being replaced by web services.  Also workflows that are very infrequent benefit from modality; take personal network setup wizards, for example.  But enabling the <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x1e14eb88">knowledge</a> worker is a domain that by its nature must retain some respect for human intelligence and not kill this by denying access to the underlying data, including provenance and URIs.  Face it: the world is not getting simpler.  It is increasingly data dependent and when this is so, having semantics and flexibility of access for the data is important.</p>

<p>For a real-time task-oriented user interface like a fighter plane cockpit, one will not show URIs unless specifically requested.  For planning fighter sorties though, there is some potential benefit in having all data such as friendly and hostile assets, geography, organizational structure, etc., as <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x1e91d118">linked data</a>.  It makes for more flexible querying.  Linked data does not <i>per se</i> mean open, so one can be joinable with open data through using the same identifiers even while maintaining arbitrary levels of security and compartmentalization.</p>

<p>For automating tasks that every time involve the same data and queries, RDF has no intrinsic superiority.  Thus the user interfaces in places where RDF will have real edge must be more capable of <i>ad hoc</i> viewing and navigation than regular real-time or line of business user interfaces.</p>

<p>The <a href="http://ode.openlinksw.com/" id="link-id0x1c7f8ee0">OpenLink Data Explorer</a> idea of a &quot;data behind the web page&quot; view goes in this direction. Read the web as before, then hit a switch to go to the data view.  There are and will be separate clarifications and demos about this.</p>

<h3>&quot;What of the proliferation of standards?  Does this not look too tangled, no clear identity?  How would one know where to begin?&quot;</h3>

<p>When <a href="http://www.w3.org/2001/sw/sweo/" id="link-id0x1d73c268">SWEO</a> was beginning, there was an endlessly protracted discussion of the so-called layer cake. This acronym jungle is not good messaging. Just say linked, flexibly repurpose-able data, and rich vocabularies and structure.  Just the right amount of structure for the application, less rigid and easier to change than relational.</p>

<p>Do not even mention the different serialization formats.  Just say that it fits on top of the accepted web infrastructure â <a href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x1efefed0">HTTP</a>, URIs, and <a href="http://dbpedia.org/resource/XML" id="link-id0x1af89b18">XML</a> where desired.</p>

<p>It is misleading to say inference is a box at some specific place in the diagram.  Inference of different types may or may not take place at diverse points, whether presentation or storage, on demand or as a preprocessing step.  Since there is structure and semantics, inference is possible if desired.</p>

<h3>&quot;Can I make a social network application in RDF only, with no <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x1cb62cd8">RDBMS</a>?&quot;</h3>

<p>Yes, in principle, but what do you have in mind?  The answer is very context dependent.  The person posing the question had an E-learning system in mind, with things such as course catalogues, course material, etc.  In such a case, RDF is a great match, especially since the user count will not be in the millions.  No university has that many students and anyway they do not hang online browsing the course catalogue.</p>

<p>On the other hand, if I think of making a social network site with RDF as the exclusive data model, I see things that would be very inefficient. For example, keeping a count of logins or the last time of login would be by default several times less efficient than with a RDBMS.</p>

<p>If some application is really large scale and has a knowable workload profile, like any social network does, then some task-specific data structure is simply economical.  This does not mean that the application language cannot be SPARQL but this means that the storage format must be tuned to favor some operations over others, relational style.  This is a matter of cost more than of feasibility.  Ten servers cost less than a hundred and have failures ten times less frequently.</p>

<p>In the near term we will see the birth of an application paradigm for the data web. The data will be open, exposed, first-class citizen; yet the user experience will not have to be in a 1:1 image of the data.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2008-10-24#1461">
  <rss:title>Virtuoso, PHP 3.5 Runtime Hosting, phpBB3, and Linked Data</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-10-24T19:55:00Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Runtime hosting is functionality realm of Virtuoso that is sometimes easily overlooked. In this post I want to provide a simple no-hassles HOWTO guide for installing Virtuoso on Windows (32 or 64 Bit), Mac OS X (Universal or Native 64 Bit), and Linux (32 or 64 Bit). The installation guide also covers the instantiation of phpBB3 as verification of the Virtuoso hosted PHP 3.5 runtime. What are the benefits of PHP Runtime Hosting? Simple, this means that like Apache, Virtuoso is a bona-fide Web Application Server for an PHP application. Unlike Apache, Virtuoso is also the following: a DBMS Engine (SQL, XML, RDF, and unstructured Text) that is accessible via industry standard interfaces (solely) a Virtual DBMS or Master Data Manager (MDM) for heterogeneous and distributed SQL, XML, RDF, unstructured Text based data sources an RDF Middleware solution for RDF-zation of non RDF resources across the Web and enterprise Intranets and/or Extranets (in the form of Cartridges for SOA &amp; REST Servers and RDF Views (Semantic Covers) over SQL and/or XML data sources) an RDF Linked Data Server (meaning it can deploy RDF Linked Data) As result of the above, when you deploy a PHP application using Virtuoso, you inherit the following benefits: Use of PHP-iODBC for in-process communication with Virtuoso Easy generation of RDF Linked Data from the SQL schemas of PHP applications Easy deployment of RDF Linked Data Less LAMP monoculture (*there is no such thing as virtuous monoculture*) when dealing with PHP based Web applications. As indicated in prior posts, producing RDF Linked Data from the existing Web, where a lot of content is deployed by PHP based content managers, should simply come down to RDF Views over the SQL Schemas and deployment / publishing of the RDF Views in RDF Linked data form. In a nutshell, this is what Virtuoso delivers via its PHP runtime hosting and pre packaged VADs (Virtuoso Application Distribution packages) for popular PHP based applications such as: phpBB3, Drupal, WordPress, and MediaWiki. In addition, to the RDF Linked Data deployment, we&#39;ve also taken the traditional LAMP installation tedium out of the typical PHP application deployment process. For instance, you don&#39;t have to rebuild PHP 3.5 (32 or 64 Bit) on Windows, Mac OS X, or Linux to get going, simply install Virtuoso, and then select a VAD package for the relevant application and you&#39;re set. If the application of choice isn&#39;t pre packaged by us, simply install as you would when using Apache, which comes dow to situating the PHP files in your Web structure under the Web Application&#39;s root directory. Installation Guide Download the Virtuoso installer for Windows (32 Bit msi file or 64 Bit msi file), Mac OS X (Universal Binary dmg file), or instantiate the Virtuoso EC2 AMI search for pattern: &quot;OpenLink, when using the Firefox extension or Web Interface based EC2 management consoles or look for: Â AMI ID: ami-c46084ad and Manifest Name: openlink/virtuoso-uim-unisvr-psnl/5.0/i686-fedora-linux-9.manifest.xml (32 bit edition) AMI ID:Â  ami-59628630 and Manifest Name: openlink/virtuoso-uim-unisvr-psnl/5.0/i686-fedora-linux-9.manifest.xml (64 bit edition)Â  Run the installer (or download the movies using the links in the related section below) Go to the Virtuoso Conductor (*which will show up at the end of the installation process* or go to http://localhost:8890/conductor) Go to the &quot;Admin&quot; tab within the (X)HTML based UI and select the &quot;Packages&quot; sub-menu item (a Tab) Pick phpBB3 (or any other pre-packaged PHP app) and then click on &quot;Install/Upgrase&quot; The watch one of my silent movies or read the initial startup guides for Virtuoso hosted phpBB3, Drupal, Wordpress, MediaWiki. Related At the current time, I&#39;ve only provided links to ZIP files containing the Virtuoso installation &quot;silent movies&quot;. This approach is a short-term solution to some of my current movie publishing challenges re. YouTube and Vimeo -- where the compressed output hasn&#39;t been of acceptable visual quality. Once resolved, I will publish much more &quot;Multimedia Web&quot; friendly movies :-) Windows Vista (x64) Installation Movie Mac OS X (x64 &amp; Universal binary) Installation Movie Virtuoso EC2 Cloud Edition Installation Movie</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[
 <p>Runtime hosting is functionality realm of <a href="http://virtuoso.openlinksw.com" id="link-id1189fee8">Virtuoso</a> that is sometimes easily overlooked. In this post I want to provide a simple no-hassles HOWTO guide for installing Virtuoso on Windows (32 or 64 Bit), Mac OS X (Universal or Native 64 Bit), and Linux (32 or 64 Bit). The installation guide also covers the instantiation of <a href="http://dbpedia.org/resource/PhpBB" id="link-id118af3a8">phpBB3</a> as verification of the Virtuoso hosted <a href="http://dbpedia.org/resource/PHP" id="link-id12736b88">PHP</a> 3.5 runtime.</p>  <h3>What are the benefits of PHP Runtime Hosting?</h3>  <p>Simple, this means that like <a href="http://dbpedia.org/resource/Apache" id="link-id111ca408">Apache</a>, Virtuoso is a bona-fide <a href="http://dbpedia.org/resource/World_Wide_Web" id="link-id0xba014968">Web</a> <a href="http://dbpedia.org/resource/Application_server" id="link-id110d2aa8">Application Server</a> for an PHP application. Unlike Apache, Virtuoso is also the following:</p>  <ul> <li>a DBMS Engine (<a href="http://dbpedia.org/resource/SQL" id="link-id10f43d78">SQL</a>, XML, RDF, and unstructured Text) that is accessible via industry standard interfaces (solely)</li> <li>a Virtual DBMS or Master <a href="http://dbpedia.org/resource/Data" id="link-id0x141b0a20">Data</a> Manager (MDM) for heterogeneous and distributed SQL, XML, RDF, unstructured Text based data sources</li> <li>an <a href="http://www.openlinksw.com/weblog/public/search.vspx?blogid=127&amp;q=rdf%20middleware&amp;type=text&amp;output=html" id="link-id1116aad8">RDF Middleware</a> solution for RDF-zation of non RDF resources across the Web and enterprise Intranets and/or Extranets (in the form of Cartridges for SOA &amp; REST Servers and RDF Views (Semantic Covers) over SQL and/or XML data sources)</li> <li>an RDF <a href="http://dbpedia.org/resource/Linked_Data" id="link-id10fbe088">Linked Data</a> Server (meaning it can deploy RDF Linked Data)</li> </ul>  <p>As result of the above, when you deploy a PHP application using Virtuoso, you inherit the following benefits:</p> <ol> <li>Use of PHP-<a href="http://www.iodbc.org" id="link-id1159e070">iODBC</a> for in-process communication with Virtuoso</li> <li>Easy generation of RDF Linked Data from the SQL schemas of PHP applications</li> <li>Easy deployment of RDF Linked Data</li> <li>Less <a href="http://dbpedia.org/resource/LAMP_stack" id="link-id1179dff0">LAMP</a> monoculture (*there is no such thing as virtuous monoculture*) when dealing with PHP based Web applications. </li> </ol>  <p>As indicated in prior posts, producing RDF Linked Data from the existing Web, where a lot of content is deployed by PHP based content managers, should simply come down to RDF Views over the SQL Schemas and deployment / publishing of the RDF Views in RDF Linked data form. In a nutshell,  this is what Virtuoso delivers via its PHP runtime hosting and pre packaged VADs (Virtuoso Application Distribution packages) for popular PHP based applications such as: phpBB3, <a href="http://dbpedia.org/resource/Drupal" id="link-id111ff1c0">Drupal</a>, <a href="http://dbpedia.org/resource/WordPress" id="link-id111e26f8">WordPress</a>, and <a href="http://dbpedia.org/resource/MediaWiki" id="link-id10ea0258">MediaWiki</a>.</p>  <p>In addition, to the RDF Linked Data deployment, we&#39;ve also taken the traditional LAMP installation tedium out of the typical PHP application deployment process. For instance, you don&#39;t have to rebuild PHP 3.5 (32 or 64 Bit) on Windows, Mac OS X, or Linux to get going, simply install Virtuoso, and then select a VAD package for the relevant application and you&#39;re set. If the application of choice isn&#39;t pre packaged by us, simply install as you would when using Apache, which comes dow to situating the PHP files in your Web structure under the Web Application&#39;s root directory.</p>  <h3>Installation Guide</h3> <ol> <li>Download the Virtuoso installer for Windows (<a href="http://virtuoso-installers.s3.amazonaws.com/virt50_server_Windows_x86_32-20081022.msi" id="link-id1160cc80">32 Bit msi file</a> or <a href="http://virtuoso-installers.s3.amazonaws.com/virt50_server_Windows_x86_64-20081022.msi" id="link-id11239828">64 Bit msi file</a>), Mac OS X (<a href="http://virtuoso-installers.s3.amazonaws.com/Virtuoso-PersonalEdition-V5.0-MacOSX-10.5-Universal.dmg" id="link-id110511f8">Universal Binary dmg file</a>), or instantiate the <a href="http://www.openlinksw.com/oat/wiki/main/Main/ODSInstallationEC2" id="link-id111fe248">Virtuoso EC2 AMI</a> search for pattern: &quot;OpenLink, when using the Firefox extension or Web Interface based EC2 management consoles or look for:</li>  <ul>   <li>Â AMI ID: ami-c46084ad and Manifest Name: openlink/virtuoso-uim-unisvr-psnl/5.0/i686-fedora-linux-9.manifest.xml (32 bit edition)   </li>   <li>AMI ID:Â  ami-59628630 and Manifest Name: openlink/virtuoso-uim-unisvr-psnl/5.0/i686-fedora-linux-9.manifest.xml (64 bit edition)<br />Â    </li>  </ul> <li>Run the installer (or download the movies using the links in the related section below)</li> <li>Go to the Virtuoso Conductor (*which will show up at the end of the installation process* or go to http://localhost:8890/conductor)</li> <li>Go to the &quot;Admin&quot; tab within the (X)HTML based UI and select the &quot;Packages&quot; sub-menu item (a Tab)</li> <li>Pick phpBB3 (or any other pre-packaged PHP app) and then click on &quot;Install/Upgrase&quot;</li> <li>The watch one of my silent movies or read the initial startup guides for Virtuoso hosted phpBB3, Drupal, Wordpress, MediaWiki.</li> </ol> <h3>Related</h3> <p> At the current time, I&#39;ve only provided links to ZIP files containing the Virtuoso installation &quot;silent movies&quot;. This approach is a short-term solution to some of my current movie publishing challenges re. YouTube and Vimeo -- where the compressed output hasn&#39;t been of acceptable visual quality. Once resolved, I will publish much more &quot;Multimedia Web&quot; friendly movies :-)</p> <ul> <li>   <a href="http://my-movies.s3.amazonaws.com/Virtuoso_PHPBB3_Vista_Linked_Data_Demo.mov.zip" id="link-id11642450">Windows Vista (x64) Installation Movie</a> </li> <li>   <a href="http://my-movies.s3.amazonaws.com/Virtuoso_PHPBB3_MacOSX_Linked_Data_Demo.mov.zip" id="link-id11210498">Mac OS X (x64 &amp; Universal binary) Installation Movie</a> </li> <li>   <a href="http://my-movies.s3.amazonaws.com/Virtuoso_PHPBB3_EC2_AMI_Linked_Data_Demo.zip" id="link-id111ff268">Virtuoso EC2 Cloud Edition Installation Movie</a> </li> </ul>  
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2008-10-10#1456">
  <rss:title>The Calamitous Nature of Opportunity</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-10-10T16:30:53Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">As articulated in timeless fashion by Albert Einstein: The significant problems we face cannot be solved at the same level of thinking we were at when we created them. This quote also applies to the current global financial mess because the essence of this crisis remains inextricably linked to dependency on outdated &quot;closed world&quot; systems. How we got here (5,000 ft. view) We have a global human network that depends on systems driven by, and confined to, data silos! Every time you hear a CEO, Government Official, work colleague, neighbor, sibling, or relative tell you they didn&#39;t see it coming, just remember: For every action, there is an equal and opposite reaction For every debit there is a credit What goes around, comes around No man is an Island (little tweak: Human) We are all Linked whether we like it or not System preserving reboots are a feature of all intelligently designed systems. Why there won&#39;t be a Depression There won&#39;t be a depression because we can&#39;t afford one. Just like we couldn&#39;t afford to continue with the manner in which our systems work today. Unlike the &#39;30s, we all know that there are no absolute safe havens right now, we have enough information at our disposal to eventually understand (post panic) that stuffing the mattress isn&#39;t an option (even government bonds won&#39;t cut it, ditto money market accounts). The Opportunity Take a deep breadth and tell traditional media to &quot;shut up&quot;. As per usual, the traditional mass media wants to have it both ways by stoking the panic and maxing out on the frenzy with reckless abandon (as per usual). If there is a time to appreciate the blogosphere and quality journalism etc.. It&#39;s now. Anyway, as the saying goes: &quot;It&#39;s always darkest before dawn&quot;, and as bizarre as this may sound in some quarters, things will ultimately change for the better. It just so happened that a really big cane was required in order for us to change our dysfunctional ways :-( I recently wrote a post about &quot;zero based cognition&quot; that sought to bring attention to the power of &quot;Human Thought&quot; in relation to value creation. Innovative creation and dissemination of value is how we will eventually get out of the current mess (as we&#39;ve done in the past). The predictability of the aforementioned reality is significantly increased by the sheer link density and resulting &quot;network effects&quot; potential of the Internet and World Wide Web. Our ability to &quot;connect the dots&quot; as part of our value creation, dissemination, and consumption processing pipelines is what will ultimately separate the winners from the losers (individuals, enterprises, nations). Related Yihong Ding&#39;s insightful perspectives Jason Kolb&#39;s poignant piece titled: The Year Innovation Died Tech Start-ups and the Economy&#39;s Best Hope Money as Debt - (a documentary spotted by Danja) Peter Kalfka&#39;s post: Smart Startup Advice: Don&#39;t Panic - Profit George Soros Interview Mark Cuban (Blog Maverick) echoing &quot;Entrepreneurship is the key&quot; sentiment.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>As articulated in timeless fashion by <a href="http://dbpedia.org/resource/Albert_Einstein" id="link-id160b76b8">Albert Einstein</a>:
</p>
<blockquote>
<cite>The significant problems we face cannot be solved at the same level of thinking we were at when we created them. </cite>
</blockquote>
<p>This quote also applies to the current global financial mess because the essence of this crisis remains inextricably linked to dependency on outdated &quot;<a href="http://dbpedia.org/resource/Closed_world_assumption" id="link-id14a6b6c0">closed world</a>&quot; systems.</p> 
<h3>How we got here (5,000 ft. view)</h3>
<p>We have a global human network that depends on systems driven by, and confined to, <a href="http://dbpedia.org/resource/Data">data</a> silos! Every time you hear a CEO, Government Official, work colleague, neighbor, sibling, or relative tell you they didn&#39;t see it coming, just remember:
</p>
<ul>
<li>For every action, there is an equal and opposite reaction</li>
<li>For every debit there is a credit</li>
<li>What goes around, comes around</li>
<li>
  <a href="http://www.quotedb.com/quotes/245" id="link-id12ace758">No man is an Island</a> (little tweak: Human)</li>
<li>We are all Linked whether we like it or not</li>
<li>System preserving reboots are a feature of all intelligently designed systems.</li>
</ul>
<h3>Why there won&#39;t be a Depression</h3>
<p>There won&#39;t be a depression because we can&#39;t afford one. Just like we couldn&#39;t afford to continue with the manner in which our systems work today. Unlike the &#39;30s, we all know that there are no absolute safe havens right now, we have enough <a href="http://dbpedia.org/resource/Information" id="link-id13d0c258">information</a> at our disposal to eventually understand (post panic) that stuffing the mattress isn&#39;t an option (even government bonds won&#39;t cut it, ditto money market accounts).</p>

<h3>The Opportunity</h3>
<p>Take a deep breadth and tell traditional media to &quot;shut up&quot;. As per usual, the traditional mass media wants to have it both ways by stoking the panic and maxing out on the frenzy with reckless abandon (as per usual). If there is a time to appreciate the blogosphere and quality journalism etc.. It&#39;s now.</p>
<p>
Anyway, as the saying goes: &quot;It&#39;s always darkest before dawn&quot;, and as bizarre as this may sound in some quarters, things will ultimately change for the better. It just so happened that a really big cane was required in order for us to change our dysfunctional ways :-(</p>
<p>I recently wrote a post about &quot;<a href="http://www.openlinksw.com/dataspace/kidehen@openlinksw.com/weblog/kidehen@openlinksw.com%27s%20BLOG%20%5B127%5D/1440" id="link-id115387f8">zero based cognition</a>&quot; that sought to bring attention to the power of &quot;Human Thought&quot; in relation to value creation.</p>
<p>Innovative creation and dissemination of value is how we will eventually get out of the current mess (as we&#39;ve done in the past). The predictability of the aforementioned reality is significantly increased by the sheer link density and resulting &quot;network effects&quot; potential of the <a href="http://dbpedia.org/resource/Internet" id="link-id14a595e8">Internet</a> and <a href="http://dbpedia.org/resource/World_Wide_Web" id="link-id1112a570">World Wide Web</a>. Our ability to &quot;connect the dots&quot; as part of our value creation, dissemination, and consumption processing pipelines is what will ultimately separate the winners from the losers (individuals, enterprises, nations).</p>

<h3>Related</h3>
<ul>
<li>
  <a href="http://yihongs-research.blogspot.com" id="link-id14b0fb90">Yihong Ding</a>&#39;s insightful
  <a href="http://yihongs-research.blogspot.com/2008/10/financial-crisis-who-will-be-winner.html" id="link-id112197b0">perspectives</a>
</li>
<li>
  <a href="http://www.jasonkolb.com/" id="link-id112d4ad8">Jason Kolb</a>&#39;s poignant piece titled: <a href="http://www.jasonkolb.com/weblog/2008/10/the-year-the-innovation-died.html" id="link-id10fe7008">The Year Innovation Died</a>
</li>
<li>
  <a href="http://www.portfolio.com/views/blogs/the-tech-observer/2008/10/09/tech-start-ups-and-the-economys-best-hope?tid=true" id="link-id14a80788">Tech Start-ups and the Economy&#39;s Best Hope</a>
</li>
<li>
  <a href="http://video.google.com/videoplay?docid=-9050474362583451279" id="link-id11053b90">Money as Debt</a> - (a documentary spotted by <a href="http://hyperdata.org/blog/" id="link-id114c0e30">Danja</a>)</li>
<li>
  <a href="http://www.alleyinsider.com/peter_kafka" id="link-id10f01b10">Peter Kalfka</a>&#39;s post: <a href="http://www.alleyinsider.com/2008/12/startup-advice-how-to-make-the-collapse-work-for-you" id="link-id10de8058">Smart Startup Advice: Don&#39;t Panic - Profit</a>
</li>
<li>
  <a href="http://www.huffingtonpost.com/nathan-gardels/soros-end-of-financial-cr_b_134008.html" id="link-id10fef1e8">George Soros Interview</a>
</li>
<li>Mark Cuban (<a href="http://blogmaverick.com/" id="link-ide8b5298">Blog Maverick</a>) echoing &quot;<a href="http://blogmaverick.com/2008/10/23/the-cure-to-our-economic-problems/" id="link-id10e630d8">Entrepreneurship is the key&quot;</a> sentiment.</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-10-02#1450">
  <rss:title>Virtuoso Update, Billion Triples and Outlook</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-10-02T10:02:32Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Virtuoso Update, Billion Triples and Outlook I will say a few things about what we have been doing and where we can go. Firstly, we have a fairly scalable platform with Virtuoso 6 Cluster. It was most recently tested with the workload discussed in the previous Billion Triples post. There is an updated version of the paper about this. This will be presented at the web scale workshop of ISWC 2008 in Karlsruhe. Right now, we are polishing some things in Virtuoso 6 -- some optimizations for smarter balancing of interconnect traffic over multiple network interfaces, and some more SQL optimizations specific to RDF. The must-have basics, like parallel running of sub-queries and aggregates, and all-around unrolling of loops of every kind into large partitioned batches, is all there and proven to work. We spent a lot of time around the Berlin SPARQL Benchmark story, so we got to the more advanced stuff like the Billion Triples Challenge rather late. We did along the way also run BSBM with an Oracle back-end, with Virtuoso mapping SPARQL to SQL. This merits its own analysis in the near future. This will be the basic how-to of mapping OLTP systems to RDF. Depending on the case, one can use this for lookups in real-time or ETL. RDF will deliver value in complex situations. An example of a complex relational mapping use case came from Ordnance Survey, presented at the RDB2RDF XG. Examples of complex warehouses include the Neurocommons database, the Billion Triples Challenge, and the Garlik DataPatrol. In comparison, the Berlin workload is really simple and one where RDF is not at its best, as amply discussed on the Linked Data forum. BSBM&#39;s primary value is as a demonstrator for the basic mapping tasks that will be repeated over and over for pretty much any online system when presence on the data web becomes as indispensable as presence on the HTML web. I will now talk about the complex warehouse/web-harvesting side. I will come to the mapping in another post. Now, all the things shown in the Billion Triples post can be done with a relational system specially built for each purpose. Since we are a general purpose RDBMS, we use this capability where it makes sense. For example, storing statistics about which tags or interests occur with which other tags or interests as RDF blank nodes makes no sense. We do not even make the experiment; we know ahead of time that the result is at least an order of magnitude in favor of the relational row-oriented solution in both space and time. Whenever there is a data structure specially made for answering one specific question, like joint occurrence of tags, RDB and mapping is the way to go. With Virtuoso, this can fully-well coexist with physical triples, and can still be accessed in SPARQL and mixed with triples. This is territory that we have not extensively covered yet, but we will be giving some examples about this later. The real value of RDF is in agility. When there is no time to design and load a new warehouse for every new question, RDF is unparalleled. Also SPARQL, once it has the necessary extensions of aggregating and sub-queries, is nicer than SQL, especially when we have sub-classes and sub-properties, transitivity, and &quot;same as&quot; enabled. These things have some run time cost and if there is a report one is hitting absolutely all the time, then chances are that resolving terms and identity at load-time and using materialized views in SQL is the reasonable thing. If one is inventing a new report every time, then RDF has a lot more convenience and flexibility. We are just beginning to explore what we can do with data sets such as the online conversation space, linked data, and the open ontologies of UMBEL and OpenCyc. It is safe to say that we can run with real world scale without loss of query expressivity. There is an incremental cost for performance but this is not prohibitive. Serving the whole billion triples set from memory would cost about $32K in hardware. $8K will do if one can wait for disk part of the time. One can use these numbers as a basis for costing larger systems. For online search applications, one will note that running the indexes pretty much from memory is necessary for flat response time. For back office analytics this is not necessarily as critical. It all depends on the use case. We expect to be able to combine geography, social proximity, subject matter, and named entities, with hierarchical taxonomies and traditional full text, and to present this through a simple user interface. We expect to do this with online response times if we have a limited set of starting points and do not navigate more than 2 or 3 steps from each starting point. An example would be to have a full text pattern and news group, and get the cloud of interests from the authors of matching posts. Another would be to make a faceted view of the properties of the 1000 people most closely connected to one person. Queries like finding the fastest online responders to questions about romance across the global board-scape, or finding the person who initiates the most long running conversations about crime, take a bit longer but are entirely possible. The genius of RDF is to be able to do these things within a general purpose database, ad hoc, in a single query language, mostly without materializing intermediate results. Any of these things could be done with arbitrary efficiency in a custom built system. But what is special now is that the cost of access to this type of information and far beyond drops dramatically as we can do these things in a far less labor intensive way, with a general purpose system, with no redesigning and reloading of warehouses at every turn. The query becomes a commodity. Still, one must know what to ask. In this respect, the self-describing nature of RDF is unmatched. A query like list the top 10 attributes with the most distinct values for all persons cannot be done in SQL. SQL simply does not allow the columns to be variable. Further, we can accept queries as text, the way people are used to supplying them, and use structure for drill-down or result-relevance, and also recognize named entities and subject matter concepts in query text. Very simple NLP will go a long way towards keeping SPARQL out of the user experience. The other way of keeping query complexity hidden is to publish hand-written SPARQL as parameter-fed canned reports. Between now and ISWC 2008, the last week of October, we will put out demos showing some of these things. Stay tuned.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<div style="display:none;">Virtuoso Update, Billion Triples and Outlook</div>
<p>I will say a few things about what we have been doing and where we can go.</p>

<p>Firstly, we have a fairly scalable platform with <a href="http://virtuoso.openlinksw.com" id="link-id0x1aa82dc0">Virtuoso</a> 6 Cluster. It was most recently tested with the workload discussed in the previous <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1445" id="link-id1638a5b8">Billion Triples post</a>.</p>

<p>There is an updated version of <a href="http://www.openlinksw.com/weblog/oerling/2008webscale_rdf.pdf" id="link-id16280a68">the paper about this</a>. This will be presented at the web scale workshop of ISWC 2008 in Karlsruhe.</p>

<p>Right now, we are polishing some things in Virtuoso 6 -- some optimizations for smarter balancing of interconnect traffic over multiple network interfaces, and some more <a href="http://dbpedia.org/resource/SQL" id="link-id0x1abd3f38">SQL</a> optimizations specific to <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1adbe410">RDF</a>. The must-have basics, like parallel running of sub-queries and aggregates, and all-around unrolling of loops of every kind into large partitioned batches, is all there and proven to work.</p>

<p>We spent a lot of time around the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x1aaa0e78">Berlin SPARQL Benchmark</a> story, so we got to the more advanced stuff like the <a href="http://challenge.semanticweb.org/" id="link-id0x1a860a50">Billion Triples Challenge</a> rather late. We did along the way also run <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x1a27f2a8">BSBM</a> with an <a href="http://dbpedia.org/resource/Oracle_Database" id="link-id0x1ad5c918">Oracle</a> back-end, with Virtuoso mapping <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1cf0e4a0">SPARQL</a> to SQL. This merits its own analysis in the near future. This will be the basic how-to of mapping OLTP systems to RDF. Depending on the case, one can use this for lookups in real-time or ETL.</p>

<p>RDF will deliver value in complex situations. An example of a complex relational mapping use case came from Ordnance Survey, presented at the <a href="http://www.w3.org/2005/Incubator/rdb2rdf/" id="link-id0x1ab96bb0">RDB2RDF XG</a>. Examples of complex warehouses include the <a href="http://neurocommons.org/page/Main_Page" id="link-id0x1adb2db0">Neurocommons</a> database, the Billion Triples Challenge, and the <a href="http://www.garlik.com/" id="link-id0x1925c7b0">Garlik DataPatrol</a>.</p>

<p>In comparison, the Berlin workload is really simple and one where RDF is not at its best, as amply discussed on the <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x1c6d1480">Linked Data</a> forum. BSBM&#39;s primary value is as a demonstrator for the basic mapping tasks that will be repeated over and over for pretty much any online system when presence on the <a href="http://dbpedia.org/resource/Data" id="link-id0x1a937400">data</a> web becomes as indispensable as presence on the HTML web.</p>

<p>I will now talk about the complex warehouse/web-harvesting side. I will come to the mapping in another post.</p>

<p>Now, all the things shown in the <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1445" id="link-id14de1d18">Billion Triples post</a> can be done with a relational system specially built for each purpose. Since we are a general purpose <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x1a457c70">RDBMS</a>, we use this capability where it makes sense. For example, storing statistics about which tags or interests occur with which other tags or interests as RDF blank nodes makes no sense. We do not even make the experiment; we know ahead of time that the result is at least an order of magnitude in favor of the relational row-oriented solution in both space and time.</p>

<p>Whenever there is a data structure specially made for answering one specific question, like joint occurrence of tags, RDB and mapping is the way to go. With Virtuoso, this can fully-well coexist with physical triples, and can still be accessed in SPARQL and mixed with triples. This is territory that we have not extensively covered yet, but we will be giving some examples about this later.</p>

<p>The real value of RDF is in agility. When there is no time to design and load a new warehouse for every new question, RDF is unparalleled. Also SPARQL, once it has the necessary extensions of aggregating and sub-queries, is nicer than SQL, especially when we have sub-classes and sub-properties, transitivity, and &quot;same as&quot; enabled. These things have some run time cost and if there is a report one is hitting absolutely all the time, then chances are that resolving terms and identity at load-time and using materialized views in SQL is the reasonable thing. If one is inventing a new report every time, then RDF has a lot more convenience and flexibility.</p>

<p>We are just beginning to explore what we can do with data sets such as the online conversation space, linked data, and the open ontologies of <a href="http://umbel.org/about/" id="link-id0x1aa5ea18">UMBEL</a> and <a href="http://dbpedia.org/resource/Cyc" id="link-id0x1a631a20">OpenCyc</a>. It is safe to say that we can run with real world scale without loss of query expressivity. There is an incremental cost for performance but this is not prohibitive. Serving the whole billion triples set from memory would cost about $32K in hardware. $8K will do if one can wait for disk part of the time. One can use these numbers as a basis for costing larger systems. For online search applications, one will note that running the indexes pretty much from memory is necessary for flat response time. For back office analytics this is not necessarily as critical. It all depends on the use case.</p>

<p>We expect to be able to combine geography, social proximity, subject matter, and <a href="http://dbpedia.org/resource/Named_entity_recognition" id="link-id0x1aebdcc8">named entities</a>, with hierarchical taxonomies and traditional full text, and to present this through a simple user interface.</p>

<p>We expect to do this with online response times if we have a limited set of starting points and do not navigate more than 2 or 3 steps from each starting point. An example would be to have a full text pattern and news group, and get the cloud of interests from the authors of matching posts. Another would be to make a faceted view of the properties of the 1000 people most closely connected to one person.</p>

<p>Queries like finding the fastest online responders to questions about romance across the global board-scape, or finding the person who initiates the most long running conversations about crime, take a bit longer but are entirely possible.</p>

<p>The genius of RDF is to be able to do these things within a general purpose database, ad hoc, in a single query language, mostly without materializing intermediate results. Any of these things could be done with arbitrary efficiency in a custom built system. But what is special now is that the cost of access to this type of <a href="http://dbpedia.org/resource/Information" id="link-id0x1ab88490">information</a> and far beyond drops dramatically as we can do these things in a far less labor intensive way, with a general purpose system, with no redesigning and reloading of warehouses at every turn. The query becomes a commodity.</p>

<p>Still, one must know what to ask. In this respect, the self-describing nature of RDF is unmatched. A query like <i>list the top 10 attributes with the most distinct values for all persons</i> cannot be done in SQL. SQL simply does not allow the columns to be variable.</p>

<p>Further, we can accept queries as text, the way people are used to supplying them, and use structure for drill-down or result-relevance, and also recognize named entities and subject matter concepts in query text. Very simple NLP will go a long way towards keeping SPARQL out of the user experience.</p>

<p>The other way of keeping query complexity hidden is to publish hand-written SPARQL as parameter-fed canned reports.</p>

<p>Between now and ISWC 2008, the last week of October, we will put out demos showing some of these things. Stay tuned.</p>
</div>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-10-02#1448">
  <rss:title>Virtuoso Update, Billion Triples and Outlook</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-10-02T09:31:17Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">I will say a few things about what we have been doing and where we can go. Firstly, we have a fairly scalable platform with Virtuoso 6 Cluster. It was most recently tested with the workload discussed in the previous Billion Triples post. There is an updated version of the paper about this. This will be presented at the web scale workshop of ISWC 2008 in Karlsruhe. Right now, we are polishing some things in Virtuoso 6 -- some optimizations for smarter balancing of interconnect traffic over multiple network interfaces, and some more SQL optimizations specific to RDF. The must-have basics, like parallel running of sub-queries and aggregates, and all-around unrolling of loops of every kind into large partitioned batches, is all there and proven to work. We spent a lot of time around the Berlin SPARQL Benchmark story, so we got to the more advanced stuff like the Billion Triples Challenge rather late. We did along the way also run BSBM with an Oracle back-end, with Virtuoso mapping SPARQL to SQL. This merits its own analysis in the near future. This will be the basic how-to of mapping OLTP systems to RDF. Depending on the case, one can use this for lookups in real-time or ETL. RDF will deliver value in complex situations. An example of a complex relational mapping use case came from Ordnance Survey, presented at the RDB2RDF XG. Examples of complex warehouses include the Neurocommons database, the Billion Triples Challenge, and the Garlik DataPatrol. In comparison, the Berlin workload is really simple and one where RDF is not at its best, as amply discussed on the Linked Data forum. BSBM&#39;s primary value is as a demonstrator for the basic mapping tasks that will be repeated over and over for pretty much any online system when presence on the data web becomes as indispensable as presence on the HTML web. I will now talk about the complex warehouse/web-harvesting side. I will come to the mapping in another post. Now, all the things shown in the Billion Triples post can be done with a relational system specially built for each purpose. Since we are a general purpose RDBMS, we use this capability where it makes sense. For example, storing statistics about which tags or interests occur with which other tags or interests as RDF blank nodes makes no sense. We do not even make the experiment; we know ahead of time that the result is at least an order of magnitude in favor of the relational row-oriented solution in both space and time. Whenever there is a data structure specially made for answering one specific question, like joint occurrence of tags, RDB and mapping is the way to go. With Virtuoso, this can fully-well coexist with physical triples, and can still be accessed in SPARQL and mixed with triples. This is territory that we have not extensively covered yet, but we will be giving some examples about this later. The real value of RDF is in agility. When there is no time to design and load a new warehouse for every new question, RDF is unparalleled. Also SPARQL, once it has the necessary extensions of aggregating and sub-queries, is nicer than SQL, especially when we have sub-classes and sub-properties, transitivity, and &quot;same as&quot; enabled. These things have some run time cost and if there is a report one is hitting absolutely all the time, then chances are that resolving terms and identity at load-time and using materialized views in SQL is the reasonable thing. If one is inventing a new report every time, then RDF has a lot more convenience and flexibility. We are just beginning to explore what we can do with data sets such as the online conversation space, linked data, and the open ontologies of UMBEL and OpenCyc. It is safe to say that we can run with real world scale without loss of query expressivity. There is an incremental cost for performance but this is not prohibitive. Serving the whole billion triples set from memory would cost about $32K in hardware. $8K will do if one can wait for disk part of the time. One can use these numbers as a basis for costing larger systems. For online search applications, one will note that running the indexes pretty much from memory is necessary for flat response time. For back office analytics this is not necessarily as critical. It all depends on the use case. We expect to be able to combine geography, social proximity, subject matter, and named entities, with hierarchical taxonomies and traditional full text, and to present this through a simple user interface. We expect to do this with online response times if we have a limited set of starting points and do not navigate more than 2 or 3 steps from each starting point. An example would be to have a full text pattern and news group, and get the cloud of interests from the authors of matching posts. Another would be to make a faceted view of the properties of the 1000 people most closely connected to one person. Queries like finding the fastest online responders to questions about romance across the global board-scape, or finding the person who initiates the most long running conversations about crime, take a bit longer but are entirely possible. The genius of RDF is to be able to do these things within a general purpose database, ad hoc, in a single query language, mostly without materializing intermediate results. Any of these things could be done with arbitrary efficiency in a custom built system. But what is special now is that the cost of access to this type of information and far beyond drops dramatically as we can do these things in a far less labor intensive way, with a general purpose system, with no redesigning and reloading of warehouses at every turn. The query becomes a commodity. Still, one must know what to ask. In this respect, the self-describing nature of RDF is unmatched. A query like list the top 10 attributes with the most distinct values for all persons cannot be done in SQL. SQL simply does not allow the columns to be variable. Further, we can accept queries as text, the way people are used to supplying them, and use structure for drill-down or result-relevance, and also recognize named entities and subject matter concepts in query text. Very simple NLP will go a long way towards keeping SPARQL out of the user experience. The other way of keeping query complexity hidden is to publish hand-written SPARQL as parameter-fed canned reports. Between now and ISWC 2008, the last week of October, we will put out demos showing some of these things. Stay tuned.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>I will say a few things about what we have been doing and where we can go.</p>

<p>Firstly, we have a fairly scalable platform with <a href="http://virtuoso.openlinksw.com" id="link-id0xa412e450">Virtuoso</a> 6 Cluster. It was most recently tested with the workload discussed in the previous <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1445" id="link-id1638a5b8">Billion Triples post</a>.</p>

<p>There is an updated version of <a href="http://www.openlinksw.com/weblog/oerling/2008webscale_rdf.pdf" id="link-id16280a68">the paper about this</a>. This will be presented at the web scale workshop of ISWC 2008 in Karlsruhe.</p>

<p>Right now, we are polishing some things in Virtuoso 6 -- some optimizations for smarter balancing of interconnect traffic over multiple network interfaces, and some more <a href="http://dbpedia.org/resource/SQL" id="link-id0x1c1c5f48">SQL</a> optimizations specific to <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1bcb6108">RDF</a>. The must-have basics, like parallel running of sub-queries and aggregates, and all-around unrolling of loops of every kind into large partitioned batches, is all there and proven to work.</p>

<p>We spent a lot of time around the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x3a4e17c8">Berlin SPARQL Benchmark</a> story, so we got to the more advanced stuff like the <a href="http://challenge.semanticweb.org/" id="link-id0x1a66c568">Billion Triples Challenge</a> rather late. We did along the way also run <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x188c2608">BSBM</a> with an <a href="http://dbpedia.org/resource/Oracle_Database" id="link-id0x1aa97f98">Oracle</a> back-end, with Virtuoso mapping <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1abd87a0">SPARQL</a> to SQL. This merits its own analysis in the near future. This will be the basic how-to of mapping OLTP systems to RDF. Depending on the case, one can use this for lookups in real-time or ETL.</p>

<p>RDF will deliver value in complex situations. An example of a complex relational mapping use case came from Ordnance Survey, presented at the <a href="http://www.w3.org/2005/Incubator/rdb2rdf/" id="link-id0x1a941678">RDB2RDF XG</a>. Examples of complex warehouses include the <a href="http://neurocommons.org/page/Main_Page" id="link-id0x1aa5a9f8">Neurocommons</a> database, the Billion Triples Challenge, and the <a href="http://www.garlik.com/" id="link-id0x372df7b0">Garlik DataPatrol</a>.</p>

<p>In comparison, the Berlin workload is really simple and one where RDF is not at its best, as amply discussed on the <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x1a671cf0">Linked Data</a> forum. BSBM&#39;s primary value is as a demonstrator for the basic mapping tasks that will be repeated over and over for pretty much any online system when presence on the <a href="http://dbpedia.org/resource/Data" id="link-id0x1ab83dd0">data</a> web becomes as indispensable as presence on the HTML web.</p>

<p>I will now talk about the complex warehouse/web-harvesting side. I will come to the mapping in another post.</p>

<p>Now, all the things shown in the <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1445" id="link-id14de1d18">Billion Triples post</a> can be done with a relational system specially built for each purpose. Since we are a general purpose <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x340d3470">RDBMS</a>, we use this capability where it makes sense. For example, storing statistics about which tags or interests occur with which other tags or interests as RDF blank nodes makes no sense. We do not even make the experiment; we know ahead of time that the result is at least an order of magnitude in favor of the relational row-oriented solution in both space and time.</p>

<p>Whenever there is a data structure specially made for answering one specific question, like joint occurrence of tags, RDB and mapping is the way to go. With Virtuoso, this can fully-well coexist with physical triples, and can still be accessed in SPARQL and mixed with triples. This is territory that we have not extensively covered yet, but we will be giving some examples about this later.</p>

<p>The real value of RDF is in agility. When there is no time to design and load a new warehouse for every new question, RDF is unparalleled. Also SPARQL, once it has the necessary extensions of aggregating and sub-queries, is nicer than SQL, especially when we have sub-classes and sub-properties, transitivity, and &quot;same as&quot; enabled. These things have some run time cost and if there is a report one is hitting absolutely all the time, then chances are that resolving terms and identity at load-time and using materialized views in SQL is the reasonable thing. If one is inventing a new report every time, then RDF has a lot more convenience and flexibility.</p>

<p>We are just beginning to explore what we can do with data sets such as the online conversation space, linked data, and the open ontologies of <a href="http://umbel.org/about/" id="link-id0x19cabf38">UMBEL</a> and <a href="http://dbpedia.org/resource/Cyc" id="link-id0x19cecd10">OpenCyc</a>. It is safe to say that we can run with real world scale without loss of query expressivity. There is an incremental cost for performance but this is not prohibitive. Serving the whole billion triples set from memory would cost about $32K in hardware. $8K will do if one can wait for disk part of the time. One can use these numbers as a basis for costing larger systems. For online search applications, one will note that running the indexes pretty much from memory is necessary for flat response time. For back office analytics this is not necessarily as critical. It all depends on the use case.</p>

<p>We expect to be able to combine geography, social proximity, subject matter, and <a href="http://dbpedia.org/resource/Named_entity_recognition" id="link-id0x1a8202e8">named entities</a>, with hierarchical taxonomies and traditional full text, and to present this through a simple user interface.</p>

<p>We expect to do this with online response times if we have a limited set of starting points and do not navigate more than 2 or 3 steps from each starting point. An example would be to have a full text pattern and news group, and get the cloud of interests from the authors of matching posts. Another would be to make a faceted view of the properties of the 1000 people most closely connected to one person.</p>

<p>Queries like finding the fastest online responders to questions about romance across the global board-scape, or finding the person who initiates the most long running conversations about crime, take a bit longer but are entirely possible.</p>

<p>The genius of RDF is to be able to do these things within a general purpose database, ad hoc, in a single query language, mostly without materializing intermediate results. Any of these things could be done with arbitrary efficiency in a custom built system. But what is special now is that the cost of access to this type of <a href="http://dbpedia.org/resource/Information" id="link-id0x1ab0a918">information</a> and far beyond drops dramatically as we can do these things in a far less labor intensive way, with a general purpose system, with no redesigning and reloading of warehouses at every turn. The query becomes a commodity.</p>

<p>Still, one must know what to ask. In this respect, the self-describing nature of RDF is unmatched. A query like <i>list the top 10 attributes with the most distinct values for all persons</i> cannot be done in SQL. SQL simply does not allow the columns to be variable.</p>

<p>Further, we can accept queries as text, the way people are used to supplying them, and use structure for drill-down or result-relevance, and also recognize named entities and subject matter concepts in query text. Very simple NLP will go a long way towards keeping SPARQL out of the user experience.</p>

<p>The other way of keeping query complexity hidden is to publish hand-written SPARQL as parameter-fed canned reports.</p>

<p>Between now and ISWC 2008, the last week of October, we will put out demos showing some of these things. Stay tuned.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-09-08#1435">
  <rss:title>Transitivity and Graphs for SQL</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-09-08T09:41:24Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Transitivity and Graphs for SQL Background I have mentioned on a couple of prior occasions that basic graph operations ought to be integrated into the SQL query language. The history of databases is by and large about moving from specialized applications toward a generic platform. The introduction of the DBMS itself is the archetypal example. It is all about extracting the common features of applications and making these the features of a platform instead. It is now time to apply this principle to graph traversal. The rationale is that graph operations are somewhat tedious to write in a parallelize-able, latency-tolerant manner. Writing them as one would for memory-based data structures is easier but totally unscalable as soon as there is any latency involved, i.e., disk reads or messages between cluster peers. The ad-hoc nature and very large volume of RDF data makes this a timely question. Up until now, the answer to this question has been to materialize any implied facts in RDF stores. If a was part of b, and b part of c, the implied fact that a is part of c would be inserted explicitly into the database as a pre-query step. This is simple and often efficient, but tends to have the downside that one makes a specialized warehouse for each new type of query. The activity becomes less ad-hoc. Also, this becomes next to impossible when the scale approaches web scale, or if some of the data is liable to be on-and-off included-into or excluded-from the set being analyzed. This is why with Virtuoso we have tended to favor inference on demand (&quot;backward chaining&quot;) and mapping of relational data into RDF without copying. The SQL world has taken steps towards dealing with recursion with the WITH - UNION construct which allows definition of recursive views. The idea there is to define, for example, a tree walk as a UNION of the data of the starting node plus the recursive walk of the starting node&#39;s immediate children. The main problem with this is that I do not very well see how a SQL optimizer could effectively rearrange queries involving JOINs between such recursive views. This model of recursion seems to lose SQL&#39;s non-procedural nature. One can no longer easily rearrange JOINs based on what data is given and what is to be retrieved. If the recursion is written from root to leaf, it is not obvious how to do this from leaf to root. At any rate, queries written in this way are so complex to write, let alone optimize, that I decided to take another approach. Take a question like &quot;list the parts of products of category C which have materials that are classified as toxic.&quot; Suppose that the product categories are a tree, the product parts are a tree, and the materials classification is a tree taxonomy where &quot;toxic&quot; has a multilevel substructure. Depending on the count of products and materials, the query can be evaluated as either going from products to parts to materials and then climbing up the materials tree to see if the material is toxic. Or one could do it in reverse, starting with the different toxic materials, looking up the parts containing these, going to the part tree to the product, and up the product hierarchy to see if the product is in the right category. One should be able to evaluate the identical query either way depending on what indices exist, what the cardinalities of the relations are, and so forth â regular cost based optimization. Especially with RDF, there are many problems of this type. In regular SQL, it is a long-standing cultural practice to flatten hierarchies, but this is not the case with RDF. In Virtuoso, we see SPARQL as reducing to SQL. Any RDF-oriented database-engine or query-optimization feature is accessed via SQL. Thus, if we address run-time-recursion in the Virtuoso query engine, this becomes, ipso facto, an SQL feature. Besides, we remember that SQL is a much more mature and expressive language than the current SPARQL recommendation. SQL and Transitivity We will here look at some simple social network queries. A later article will show how to do more general graph operations. We extend the SQL derived table construct, i.e., SELECT in another SELECT&#39;s FROM clause, with a TRANSITIVE clause. Consider the data: CREATE TABLE &quot;knows&quot; (&quot;p1&quot; INT, &quot;p2&quot; INT, PRIMARY KEY (&quot;p1&quot;, &quot;p2&quot;) ); ALTER INDEX &quot;knows&quot; ON &quot;knows&quot; PARTITION (&quot;p1&quot; INT); CREATE INDEX &quot;knows2&quot; ON &quot;knows&quot; (&quot;p2&quot;, &quot;p1&quot;) PARTITION (&quot;p2&quot; INT); We represent a social network with the many-to-many relation &quot;knows&quot;. The persons are identified by integers. INSERT INTO &quot;knows&quot; VALUES (1, 2); INSERT INTO &quot;knows&quot; VALUES (1, 3); INSERT INTO &quot;knows&quot; VALUES (2, 4); SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT &quot;p1&quot;, &quot;p2&quot; FROM &quot;knows&quot; ) &quot;k&quot; WHERE &quot;k&quot;.&quot;p1&quot; = 1; We obtain the result: p1 p2 1 3 1 2 1 4 The operation is reversible: SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT &quot;p1&quot;, &quot;p2&quot; FROM &quot;knows&quot; ) &quot;k&quot; WHERE &quot;k&quot;.&quot;p2&quot; = 4; p1 p2 2 4 1 4 Since now we give p2, we traverse from p2 towards p1. The result set states that 4 is known by 2 and 2 is known by 1. To see what would happen if x knowing y also meant y knowing x, one could write: SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT &quot;p1&quot;, &quot;p2&quot; FROM (SELECT &quot;p1&quot;, &quot;p2&quot; FROM &quot;knows&quot; UNION ALL SELECT &quot;p2&quot;, &quot;p1&quot; FROM &quot;knows&quot; ) &quot;k2&quot; ) &quot;k&quot; WHERE &quot;k&quot;.&quot;p2&quot; = 4; p1 p2 2 4 1 4 3 4 Now, since we know that 1 and 4 are related, we can ask how they are related. SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT &quot;p1&quot;, &quot;p2&quot;, T_STEP (1) AS &quot;via&quot;, T_STEP (&#39;step_no&#39;) AS &quot;step&quot;, T_STEP (&#39;path_id&#39;) AS &quot;path&quot; FROM &quot;knows&quot; ) &quot;k&quot; WHERE &quot;p1&quot; = 1 AND &quot;p2&quot; = 4; p1 p2 via step path 1 4 1 0 0 1 4 2 1 0 1 4 4 2 0 The two first columns are the ends of the path. The next column is the person that is a step on the path. The next one is the number of the step, counting from 0, so that the end of the path that corresponds to the end condition on the column designated as input, i.e., p1, has number 0. Since there can be multiple solutions, the last column is a sequence number allowing distinguishing multiple alternative paths from each other. For LinkedIn users, the friends ordered by distance and descending friend count query, which is at the basis of most LinkedIn search result views can be written as: SELECT p2, dist, (SELECT COUNT (*) FROM &quot;knows&quot; &quot;c&quot; WHERE &quot;c&quot;.&quot;p1&quot; = &quot;k&quot;.&quot;p2&quot; ) FROM (SELECT TRANSITIVE t_in (1) t_out (2) t_distinct &quot;p1&quot;, &quot;p2&quot;, t_step (&#39;step_no&#39;) AS &quot;dist&quot; FROM &quot;knows&quot; ) &quot;k&quot; WHERE &quot;p1&quot; = 1 ORDER BY &quot;dist&quot;, 3 DESC; p2 dist aggregate 2 1 1 3 1 0 4 2 0 How? The queries shown above work on Virtuoso v6. When running in cluster mode, several thousand graph traversal steps may be proceeding at the same time, meaning that all database access is parallelized and that the algorithm is internally latency-tolerant. By default, all results are produced in a deterministic order, permitting predictable slicing of result sets. Furthermore, for queries where both ends of a path are given, the optimizer may decide to attack the path from both ends simultaneously. So, supposing that every member of a social network has an average of 30 contacts, and we need to find a path between two users that are no more than 6 steps apart, we begin at both ends, expanding each up to 3 levels, and we stop when we find the first intersection. Thus, we reach 2 * 30^3 = 54,000 nodes, and not 30^6 = 729,000,000 nodes. Writing a generic database driven graph traversal framework on the application side, say in Java over JDBC, would easily be over a thousand lines. This is much more work than can be justified just for a one-off, ad-hoc query. Besides, the traversal order in such a case could not be optimized by the DBMS. Next In a future blog post I will show how this feature can be used for common graph tasks like critical path, itinerary planning, traveling salesman, the 8 queens chess problem, etc. There are lots of switches for controlling different parameters of the traversal. This is just the beginning. I will also give examples of the use of this in SPARQL.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<div style="display:none;">Transitivity and Graphs for SQL</div>
<h2>Background</h2> 

<p>I have mentioned on a couple of prior occasions that basic graph operations ought to be integrated into the <a href="http://dbpedia.org/resource/SQL" id="link-id0xa1a18c58">SQL</a> query language.</p>

<p>The history of databases is by and large about moving from specialized applications toward a generic platform. The introduction of the DBMS itself is the archetypal example.  It is all about extracting the common features of applications and making these the features of a platform instead.</p>

<p>It is now time to apply this principle to graph traversal.</p>

<p>The rationale is that graph operations are somewhat tedious to write in a parallelize-able, latency-tolerant manner. Writing them as one would for memory-based <a href="http://dbpedia.org/resource/Data" id="link-id0xaf8c730">data</a> structures is easier but totally unscalable as soon as there is any latency involved, i.e., disk reads or messages between cluster peers.</p>

<p>The ad-hoc nature and very large volume of <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xae41ef0">RDF</a> data makes this a timely question.  Up until now, the answer to this question has been to materialize any implied facts in RDF stores.  If <i>a</i> was part of <i>b</i>, and <i>b</i> part of <i><a href="http://dbpedia.org/resource/C_(programming_language)" id="link-id0xac9d8790">c</a></i>, the implied fact that <i>a</i> is part of <i>c</i> would be inserted explicitly into the database as a pre-query step.</p>

<p>This is simple and often efficient, but tends to have the downside that one makes a specialized warehouse for each new type of query.  The activity becomes less ad-hoc.</p>

<p>Also, this becomes next to impossible when the scale approaches web scale, or if some of the data is liable to be on-and-off included-into or excluded-from the set being analyzed.  This is why with <a href="http://virtuoso.openlinksw.com" id="link-id0xb68f9d0">Virtuoso</a> we have tended to favor inference on demand (&quot;backward chaining&quot;) and mapping of relational data into RDF without copying.</p>

<p>The SQL world has taken steps towards dealing with recursion with the <code>WITH - UNION</code> construct which allows definition of recursive views.  The idea there is to define, for example, a tree walk as a <code>UNION</code> of the data of the starting node plus the recursive walk of the starting node&#39;s immediate children.</p>

<p>The main problem with this is that I do not very well see how a SQL optimizer could effectively rearrange queries involving <code>JOIN</code>s between such recursive views.  This model of recursion seems to lose SQL&#39;s non-procedural nature.  One can no longer easily rearrange <code>JOIN</code>s based on what data is given and what is to be retrieved.  If the recursion is written from root to leaf, it is not obvious how to do this from leaf to root.  At any rate, queries written in this way are so complex to write, let alone optimize, that I decided to take another approach.</p>

<p>Take a question like &quot;list the parts of products of category <i>C</i> which have materials that are classified as toxic.&quot;  Suppose that the product categories are a tree, the product parts are a tree, and the materials classification is a tree taxonomy where &quot;toxic&quot; has a multilevel substructure.</p>

<p>Depending on the count of products and materials, the query can be evaluated as either going from products to parts to materials and then climbing up the materials tree to see if the material is toxic. Or one could do it in reverse, starting with the different toxic materials, looking up the parts containing these, going to the part tree to the product, and up the product hierarchy to see if the product is in the right category.  One should be able to evaluate the identical query either way depending on what indices exist, what the cardinalities of the relations are, and so forth â regular cost based optimization.</p>

<p>Especially with RDF, there are many problems of this type.  In regular SQL, it is a long-standing cultural practice to flatten hierarchies, but this is not the case with RDF.</p>

<p>In Virtuoso, we see <a href="http://dbpedia.org/resource/SPARQL" id="link-id0xb3bdcc0">SPARQL</a> as reducing to SQL.  Any RDF-oriented database-engine or query-optimization feature is accessed via SQL.  Thus, if we address run-time-recursion in the Virtuoso query engine, this becomes, <i>ipso facto</i>, an SQL feature.  Besides, we remember that SQL is a much more mature and expressive language than the current SPARQL recommendation.</p>

<h2> SQL and Transitivity </h2>

<p>We will here look at some simple social network queries.  A later article will show how to do more general graph operations. We extend the SQL derived table construct, i.e., <code>SELECT</code> in another <code>SELECT</code>&#39;s <code>FROM</code> clause, with a <code>TRANSITIVE</code> clause.</p>

<p>Consider the data:</p>

<blockquote>
 <pre><code>CREATE TABLE &quot;knows&quot; 
   (&quot;p1&quot; INT, 
    &quot;p2&quot; INT, 
    PRIMARY KEY (&quot;p1&quot;, &quot;p2&quot;)
   );
ALTER INDEX &quot;knows&quot; 
   ON &quot;knows&quot; 
   PARTITION (&quot;p1&quot; INT);
CREATE INDEX &quot;knows2&quot; 
   ON &quot;knows&quot; (&quot;p2&quot;, &quot;p1&quot;) 
   PARTITION (&quot;p2&quot; INT);
</code>
 </pre></blockquote>

<p>We represent a social network with the many-to-many relation &quot;knows&quot;.  The persons are identified by integers.</p>

<blockquote>
 <pre><code>INSERT INTO &quot;knows&quot; VALUES (1, 2);
INSERT INTO &quot;knows&quot; VALUES (1, 3);
INSERT INTO &quot;knows&quot; VALUES (2, 4);</code>
 </pre>

<pre><code>SELECT * 
   FROM (SELECT 
            TRANSITIVE 
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               &quot;p1&quot;, 
            &quot;p2&quot; 
         FROM &quot;knows&quot;
        ) &quot;k&quot; 
   WHERE &quot;k&quot;.&quot;p1&quot; = 1;</code></pre></blockquote>

<p>We obtain the result:</p>

<blockquote>
<table width="100">
<tr>
    <th align="center" width="50">p1</th>
    <th align="center" width="50">p2</th>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">3</td>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">2</td>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">4</td>
  </tr>
</table>
</blockquote>

<p>The operation is reversible:</p>

<blockquote>
 <pre><code>SELECT * 
   FROM (SELECT 
            TRANSITIVE 
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               &quot;p1&quot;, 
            &quot;p2&quot; 
         FROM &quot;knows&quot;
        ) &quot;k&quot; 
   WHERE &quot;k&quot;.&quot;p2&quot; = 4;
</code>
 </pre>

<table width="100">
<tr>
    <th align="center" width="50">p1</th>
    <th align="center" width="50">p2</th>
  </tr>
<tr>
    <td align="center">2</td>
    <td align="center">4</td>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">4</td>
  </tr>
</table>
</blockquote>

<p>Since now we give <i>p2</i>, we traverse from <i>p2</i> towards <i>p1</i>. The result set states that 4 is known by 2 and 2 is known by 1.</p>

<p>To see what would happen if <i>x</i> knowing <i>y</i> also meant <i>y</i> knowing <i>x</i>, one could write:</p>

<blockquote>
 <pre><code>SELECT * 
   FROM (SELECT 
            TRANSITIVE
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               &quot;p1&quot;, 
            &quot;p2&quot; 
	    FROM (SELECT 
                  &quot;p1&quot;, 
                  &quot;p2&quot; 
               FROM &quot;knows&quot; 
               UNION ALL 
                  SELECT 
                     &quot;p2&quot;, 
                     &quot;p1&quot; 
                  FROM &quot;knows&quot;
              ) &quot;k2&quot;
        ) &quot;k&quot; 
   WHERE &quot;k&quot;.&quot;p2&quot; = 4;</code>
 </pre>

<table width="100">
<tr>
    <th align="center" width="50">p1</th>
    <th align="center" width="50">p2</th>
  </tr>
<tr>
    <td align="center">2</td>
    <td align="center">4</td>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">4</td>
  </tr>
<tr>
    <td align="center">3</td>
    <td align="center">4</td>
  </tr>
</table>
</blockquote>


<p>Now, since we know that 1 and 4 are related, we can ask how they are related.</p>
<blockquote>
 <pre><code>SELECT * 
   FROM (SELECT 
            TRANSITIVE 
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               &quot;p1&quot;, 
            &quot;p2&quot;, 
            T_STEP (1) AS &quot;via&quot;, 
            T_STEP (&#39;step_no&#39;) AS &quot;step&quot;, 
            T_STEP (&#39;path_id&#39;) AS &quot;path&quot; 
         FROM &quot;knows&quot;
        ) &quot;k&quot; 
   WHERE &quot;p1&quot; = 1 
      AND &quot;p2&quot; = 4;</code>
 </pre>

<table width="250">
<tr>
    <th align="center" width="50">p1</th>
    <th align="center" width="50">p2</th>
    <th align="center" width="50">via</th>
    <th align="center" width="50">step</th>
    <th align="center" width="50">path</th>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">4</td>
    <td align="center">1</td>
    <td align="center">0</td>
    <td align="center">0</td>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">4</td>
    <td align="center">2</td>
    <td align="center">1</td>
    <td align="center">0</td>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">4</td>
    <td align="center">4</td>
    <td align="center">2</td>
    <td align="center">0</td>
  </tr>
</table>
</blockquote>


<p>The two first columns are the ends of the path.  The next column is the person that is a step on the path.  The next one is the number of the step, counting from 0, so that the end of the path that corresponds to the end condition on the column designated as input, i.e., <i>p1</i>, has number 0.  Since there can be multiple solutions, the last column is a sequence number allowing distinguishing multiple alternative paths from each other.</p>

<p>For LinkedIn users, the friends ordered by distance and descending friend count query, which is at the basis of most LinkedIn search result views can be written as: </p>

<blockquote>
 <pre><code>SELECT p2, 
      dist, 
      (SELECT 
          COUNT (*) 
          FROM &quot;knows&quot; &quot;c&quot; 
          WHERE &quot;c&quot;.&quot;p1&quot; = &quot;k&quot;.&quot;p2&quot;
      ) 
   FROM (SELECT 
            TRANSITIVE t_in (1) t_out (2) t_distinct &quot;p1&quot;, 
            &quot;p2&quot;, 
            t_step (&#39;step_no&#39;) AS &quot;dist&quot;
         FROM &quot;knows&quot;
        ) &quot;k&quot; 
   WHERE &quot;p1&quot; = 1 
   ORDER BY &quot;dist&quot;, 3 DESC;</code>
 </pre>


<table width="150">
<tr>
    <th align="center" width="50">p2</th>
    <th align="center" width="50">dist</th>
    <th align="center" width="50">aggregate</th>
  </tr>
<tr>
    <td align="center">2</td>
    <td align="center">1</td>
    <td align="center">1</td>
  </tr>
<tr>
    <td align="center">3</td>
    <td align="center">1</td>
    <td align="center">0</td>
  </tr>
<tr>
    <td align="center">4</td>
    <td align="center">2</td>
    <td align="center">0</td>
  </tr>
</table>
</blockquote>


<h2>How?</h2>

<p>The queries shown above work on Virtuoso v6.  When running in cluster mode, several thousand graph traversal steps may be proceeding at the same time, meaning that all database access is parallelized and that the algorithm is internally latency-tolerant.  By default, all results are produced in a deterministic order, permitting predictable slicing of result sets.</p>

<p>Furthermore, for queries where both ends of a path are given, the optimizer may decide to attack the path from both ends simultaneously. So, supposing that every member of a social network has an average of 30 contacts, and we need to find a path between two users that are no more than 6 steps apart, we begin at both ends, expanding each up to 3 levels, and we stop when we find the first intersection.  Thus, we reach 2 * 30^3 = 54,000 nodes, and not 30^6 = 729,000,000 nodes.</p>

<p>Writing a generic database driven graph traversal framework on the application side, say in Java over <a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id0xa8a9ef8">JDBC</a>, would easily be over a thousand lines. This is much more work than can be justified just for a one-off, ad-hoc query.  Besides, the traversal order in such a case could not be optimized by the DBMS.</p>

<h2>Next</h2> 

<p>In a future <a href="http://dbpedia.org/resource/Blog" id="link-id0xb526a40">blog</a> post I will show how this feature can be used for common graph tasks like critical path, itinerary planning, traveling salesman, the 8 queens chess problem, etc.  There are lots of switches for controlling different parameters of the traversal.  This is just the beginning.  I will also give examples of the use of this in SPARQL.</p>
</div>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-09-08#1433">
  <rss:title>Transitivity and Graphs for SQL</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-09-08T09:20:11Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Background I have mentioned on a couple of prior occasions that basic graph operations ought to be integrated into the SQL query language. The history of databases is by and large about moving from specialized applications toward a generic platform. The introduction of the DBMS itself is the archetypal example. It is all about extracting the common features of applications and making these the features of a platform instead. It is now time to apply this principle to graph traversal. The rationale is that graph operations are somewhat tedious to write in a parallelize-able, latency-tolerant manner. Writing them as one would for memory-based data structures is easier but totally unscalable as soon as there is any latency involved, i.e., disk reads or messages between cluster peers. The ad-hoc nature and very large volume of RDF data makes this a timely question. Up until now, the answer to this question has been to materialize any implied facts in RDF stores. If a was part of b, and b part of c, the implied fact that a is part of c would be inserted explicitly into the database as a pre-query step. This is simple and often efficient, but tends to have the downside that one makes a specialized warehouse for each new type of query. The activity becomes less ad-hoc. Also, this becomes next to impossible when the scale approaches web scale, or if some of the data is liable to be on-and-off included-into or excluded-from the set being analyzed. This is why with Virtuoso we have tended to favor inference on demand (&quot;backward chaining&quot;) and mapping of relational data into RDF without copying. The SQL world has taken steps towards dealing with recursion with the WITH - UNION construct which allows definition of recursive views. The idea there is to define, for example, a tree walk as a UNION of the data of the starting node plus the recursive walk of the starting node&#39;s immediate children. The main problem with this is that I do not very well see how a SQL optimizer could effectively rearrange queries involving JOINs between such recursive views. This model of recursion seems to lose SQL&#39;s non-procedural nature. One can no longer easily rearrange JOINs based on what data is given and what is to be retrieved. If the recursion is written from root to leaf, it is not obvious how to do this from leaf to root. At any rate, queries written in this way are so complex to write, let alone optimize, that I decided to take another approach. Take a question like &quot;list the parts of products of category C which have materials that are classified as toxic.&quot; Suppose that the product categories are a tree, the product parts are a tree, and the materials classification is a tree taxonomy where &quot;toxic&quot; has a multilevel substructure. Depending on the count of products and materials, the query can be evaluated as either going from products to parts to materials and then climbing up the materials tree to see if the material is toxic. Or one could do it in reverse, starting with the different toxic materials, looking up the parts containing these, going to the part tree to the product, and up the product hierarchy to see if the product is in the right category. One should be able to evaluate the identical query either way depending on what indices exist, what the cardinalities of the relations are, and so forth â regular cost based optimization. Especially with RDF, there are many problems of this type. In regular SQL, it is a long-standing cultural practice to flatten hierarchies, but this is not the case with RDF. In Virtuoso, we see SPARQL as reducing to SQL. Any RDF-oriented database-engine or query-optimization feature is accessed via SQL. Thus, if we address run-time-recursion in the Virtuoso query engine, this becomes, ipso facto, an SQL feature. Besides, we remember that SQL is a much more mature and expressive language than the current SPARQL recommendation. SQL and Transitivity We will here look at some simple social network queries. A later article will show how to do more general graph operations. We extend the SQL derived table construct, i.e., SELECT in another SELECT&#39;s FROM clause, with a TRANSITIVE clause. Consider the data: CREATE TABLE &quot;knows&quot; (&quot;p1&quot; INT, &quot;p2&quot; INT, PRIMARY KEY (&quot;p1&quot;, &quot;p2&quot;) ); ALTER INDEX &quot;knows&quot; ON &quot;knows&quot; PARTITION (&quot;p1&quot; INT); CREATE INDEX &quot;knows2&quot; ON &quot;knows&quot; (&quot;p2&quot;, &quot;p1&quot;) PARTITION (&quot;p2&quot; INT); We represent a social network with the many-to-many relation &quot;knows&quot;. The persons are identified by integers. INSERT INTO &quot;knows&quot; VALUES (1, 2); INSERT INTO &quot;knows&quot; VALUES (1, 3); INSERT INTO &quot;knows&quot; VALUES (2, 4); SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT &quot;p1&quot;, &quot;p2&quot; FROM &quot;knows&quot; ) &quot;k&quot; WHERE &quot;k&quot;.&quot;p1&quot; = 1; We obtain the result: p1 p2 1 3 1 2 1 4 The operation is reversible: SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT &quot;p1&quot;, &quot;p2&quot; FROM &quot;knows&quot; ) &quot;k&quot; WHERE &quot;k&quot;.&quot;p2&quot; = 4; p1 p2 2 4 1 4 Since now we give p2, we traverse from p2 towards p1. The result set states that 4 is known by 2 and 2 is known by 1. To see what would happen if x knowing y also meant y knowing x, one could write: SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT &quot;p1&quot;, &quot;p2&quot; FROM (SELECT &quot;p1&quot;, &quot;p2&quot; FROM &quot;knows&quot; UNION ALL SELECT &quot;p2&quot;, &quot;p1&quot; FROM &quot;knows&quot; ) &quot;k2&quot; ) &quot;k&quot; WHERE &quot;k&quot;.&quot;p2&quot; = 4; p1 p2 2 4 1 4 3 4 Now, since we know that 1 and 4 are related, we can ask how they are related. SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT &quot;p1&quot;, &quot;p2&quot;, T_STEP (1) AS &quot;via&quot;, T_STEP (&#39;step_no&#39;) AS &quot;step&quot;, T_STEP (&#39;path_id&#39;) AS &quot;path&quot; FROM &quot;knows&quot; ) &quot;k&quot; WHERE &quot;p1&quot; = 1 AND &quot;p2&quot; = 4; p1 p2 via step path 1 4 1 0 0 1 4 2 1 0 1 4 4 2 0 The two first columns are the ends of the path. The next column is the person that is a step on the path. The next one is the number of the step, counting from 0, so that the end of the path that corresponds to the end condition on the column designated as input, i.e., p1, has number 0. Since there can be multiple solutions, the last column is a sequence number allowing distinguishing multiple alternative paths from each other. For LinkedIn users, the friends ordered by distance and descending friend count query, which is at the basis of most LinkedIn search result views can be written as: SELECT p2, dist, (SELECT COUNT (*) FROM &quot;knows&quot; &quot;c&quot; WHERE &quot;c&quot;.&quot;p1&quot; = &quot;k&quot;.&quot;p2&quot; ) FROM (SELECT TRANSITIVE t_in (1) t_out (2) t_distinct &quot;p1&quot;, &quot;p2&quot;, t_step (&#39;step_no&#39;) AS &quot;dist&quot; FROM &quot;knows&quot; ) &quot;k&quot; WHERE &quot;p1&quot; = 1 ORDER BY &quot;dist&quot;, 3 DESC; p2 dist aggregate 2 1 1 3 1 0 4 2 0 How? The queries shown above work on Virtuoso v6. When running in cluster mode, several thousand graph traversal steps may be proceeding at the same time, meaning that all database access is parallelized and that the algorithm is internally latency-tolerant. By default, all results are produced in a deterministic order, permitting predictable slicing of result sets. Furthermore, for queries where both ends of a path are given, the optimizer may decide to attack the path from both ends simultaneously. So, supposing that every member of a social network has an average of 30 contacts, and we need to find a path between two users that are no more than 6 steps apart, we begin at both ends, expanding each up to 3 levels, and we stop when we find the first intersection. Thus, we reach 2 * 30^3 = 54,000 nodes, and not 30^6 = 729,000,000 nodes. Writing a generic database driven graph traversal framework on the application side, say in Java over JDBC, would easily be over a thousand lines. This is much more work than can be justified just for a one-off, ad-hoc query. Besides, the traversal order in such a case could not be optimized by the DBMS. Next In a future blog post I will show how this feature can be used for common graph tasks like critical path, itinerary planning, traveling salesman, the 8 queens chess problem, etc. There are lots of switches for controlling different parameters of the traversal. This is just the beginning. I will also give examples of the use of this in SPARQL.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<h2>Background</h2> 

<p>I have mentioned on a couple of prior occasions that basic graph operations ought to be integrated into the <a href="http://dbpedia.org/resource/SQL" id="link-id0xb1fe830">SQL</a> query language.</p>

<p>The history of databases is by and large about moving from specialized applications toward a generic platform. The introduction of the DBMS itself is the archetypal example.  It is all about extracting the common features of applications and making these the features of a platform instead.</p>

<p>It is now time to apply this principle to graph traversal.</p>

<p>The rationale is that graph operations are somewhat tedious to write in a parallelize-able, latency-tolerant manner. Writing them as one would for memory-based <a href="http://dbpedia.org/resource/Data" id="link-id0x1cb37218">data</a> structures is easier but totally unscalable as soon as there is any latency involved, i.e., disk reads or messages between cluster peers.</p>

<p>The ad-hoc nature and very large volume of <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1e1850a0">RDF</a> data makes this a timely question.  Up until now, the answer to this question has been to materialize any implied facts in RDF stores.  If <i>a</i> was part of <i>b</i>, and <i>b</i> part of <i><a href="http://dbpedia.org/resource/C_(programming_language)" id="link-id0xa1a08d38">c</a></i>, the implied fact that <i>a</i> is part of <i>c</i> would be inserted explicitly into the database as a pre-query step.</p>

<p>This is simple and often efficient, but tends to have the downside that one makes a specialized warehouse for each new type of query.  The activity becomes less ad-hoc.</p>

<p>Also, this becomes next to impossible when the scale approaches web scale, or if some of the data is liable to be on-and-off included-into or excluded-from the set being analyzed.  This is why with <a href="http://virtuoso.openlinksw.com" id="link-id0xa51bd10">Virtuoso</a> we have tended to favor inference on demand (&quot;backward chaining&quot;) and mapping of relational data into RDF without copying.</p>

<p>The SQL world has taken steps towards dealing with recursion with the <code>WITH - UNION</code> construct which allows definition of recursive views.  The idea there is to define, for example, a tree walk as a <code>UNION</code> of the data of the starting node plus the recursive walk of the starting node&#39;s immediate children.</p>

<p>The main problem with this is that I do not very well see how a SQL optimizer could effectively rearrange queries involving <code>JOIN</code>s between such recursive views.  This model of recursion seems to lose SQL&#39;s non-procedural nature.  One can no longer easily rearrange <code>JOIN</code>s based on what data is given and what is to be retrieved.  If the recursion is written from root to leaf, it is not obvious how to do this from leaf to root.  At any rate, queries written in this way are so complex to write, let alone optimize, that I decided to take another approach.</p>

<p>Take a question like &quot;list the parts of products of category <i>C</i> which have materials that are classified as toxic.&quot;  Suppose that the product categories are a tree, the product parts are a tree, and the materials classification is a tree taxonomy where &quot;toxic&quot; has a multilevel substructure.</p>

<p>Depending on the count of products and materials, the query can be evaluated as either going from products to parts to materials and then climbing up the materials tree to see if the material is toxic. Or one could do it in reverse, starting with the different toxic materials, looking up the parts containing these, going to the part tree to the product, and up the product hierarchy to see if the product is in the right category.  One should be able to evaluate the identical query either way depending on what indices exist, what the cardinalities of the relations are, and so forth â regular cost based optimization.</p>

<p>Especially with RDF, there are many problems of this type.  In regular SQL, it is a long-standing cultural practice to flatten hierarchies, but this is not the case with RDF.</p>

<p>In Virtuoso, we see <a href="http://dbpedia.org/resource/SPARQL" id="link-id0xb4b3ce8">SPARQL</a> as reducing to SQL.  Any RDF-oriented database-engine or query-optimization feature is accessed via SQL.  Thus, if we address run-time-recursion in the Virtuoso query engine, this becomes, <i>ipso facto</i>, an SQL feature.  Besides, we remember that SQL is a much more mature and expressive language than the current SPARQL recommendation.</p>

<h2> SQL and Transitivity </h2>

<p>We will here look at some simple social network queries.  A later article will show how to do more general graph operations. We extend the SQL derived table construct, i.e., <code>SELECT</code> in another <code>SELECT</code>&#39;s <code>FROM</code> clause, with a <code>TRANSITIVE</code> clause.</p>

<p>Consider the data:</p>

<blockquote>
 <pre><code>CREATE TABLE &quot;knows&quot; 
   (&quot;p1&quot; INT, 
    &quot;p2&quot; INT, 
    PRIMARY KEY (&quot;p1&quot;, &quot;p2&quot;)
   );
ALTER INDEX &quot;knows&quot; 
   ON &quot;knows&quot; 
   PARTITION (&quot;p1&quot; INT);
CREATE INDEX &quot;knows2&quot; 
   ON &quot;knows&quot; (&quot;p2&quot;, &quot;p1&quot;) 
   PARTITION (&quot;p2&quot; INT);
</code>
 </pre></blockquote>

<p>We represent a social network with the many-to-many relation &quot;knows&quot;.  The persons are identified by integers.</p>

<blockquote>
 <pre><code>INSERT INTO &quot;knows&quot; VALUES (1, 2);
INSERT INTO &quot;knows&quot; VALUES (1, 3);
INSERT INTO &quot;knows&quot; VALUES (2, 4);</code>
 </pre>

<pre><code>SELECT * 
   FROM (SELECT 
            TRANSITIVE 
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               &quot;p1&quot;, 
            &quot;p2&quot; 
         FROM &quot;knows&quot;
        ) &quot;k&quot; 
   WHERE &quot;k&quot;.&quot;p1&quot; = 1;</code></pre></blockquote>

<p>We obtain the result:</p>

<blockquote>
<table width="100">
<tr>
    <th align="center" width="50">p1</th>
    <th align="center" width="50">p2</th>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">3</td>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">2</td>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">4</td>
  </tr>
</table>
</blockquote>

<p>The operation is reversible:</p>

<blockquote>
 <pre><code>SELECT * 
   FROM (SELECT 
            TRANSITIVE 
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               &quot;p1&quot;, 
            &quot;p2&quot; 
         FROM &quot;knows&quot;
        ) &quot;k&quot; 
   WHERE &quot;k&quot;.&quot;p2&quot; = 4;
</code>
 </pre>

<table width="100">
<tr>
    <th align="center" width="50">p1</th>
    <th align="center" width="50">p2</th>
  </tr>
<tr>
    <td align="center">2</td>
    <td align="center">4</td>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">4</td>
  </tr>
</table>
</blockquote>

<p>Since now we give <i>p2</i>, we traverse from <i>p2</i> towards <i>p1</i>. The result set states that 4 is known by 2 and 2 is known by 1.</p>

<p>To see what would happen if <i>x</i> knowing <i>y</i> also meant <i>y</i> knowing <i>x</i>, one could write:</p>

<blockquote>
 <pre><code>SELECT * 
   FROM (SELECT 
            TRANSITIVE
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               &quot;p1&quot;, 
            &quot;p2&quot; 
	    FROM (SELECT 
                  &quot;p1&quot;, 
                  &quot;p2&quot; 
               FROM &quot;knows&quot; 
               UNION ALL 
                  SELECT 
                     &quot;p2&quot;, 
                     &quot;p1&quot; 
                  FROM &quot;knows&quot;
              ) &quot;k2&quot;
        ) &quot;k&quot; 
   WHERE &quot;k&quot;.&quot;p2&quot; = 4;</code>
 </pre>

<table width="100">
<tr>
    <th align="center" width="50">p1</th>
    <th align="center" width="50">p2</th>
  </tr>
<tr>
    <td align="center">2</td>
    <td align="center">4</td>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">4</td>
  </tr>
<tr>
    <td align="center">3</td>
    <td align="center">4</td>
  </tr>
</table>
</blockquote>


<p>Now, since we know that 1 and 4 are related, we can ask how they are related.</p>
<blockquote>
 <pre><code>SELECT * 
   FROM (SELECT 
            TRANSITIVE 
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               &quot;p1&quot;, 
            &quot;p2&quot;, 
            T_STEP (1) AS &quot;via&quot;, 
            T_STEP (&#39;step_no&#39;) AS &quot;step&quot;, 
            T_STEP (&#39;path_id&#39;) AS &quot;path&quot; 
         FROM &quot;knows&quot;
        ) &quot;k&quot; 
   WHERE &quot;p1&quot; = 1 
      AND &quot;p2&quot; = 4;</code>
 </pre>

<table width="250">
<tr>
    <th align="center" width="50">p1</th>
    <th align="center" width="50">p2</th>
    <th align="center" width="50">via</th>
    <th align="center" width="50">step</th>
    <th align="center" width="50">path</th>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">4</td>
    <td align="center">1</td>
    <td align="center">0</td>
    <td align="center">0</td>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">4</td>
    <td align="center">2</td>
    <td align="center">1</td>
    <td align="center">0</td>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">4</td>
    <td align="center">4</td>
    <td align="center">2</td>
    <td align="center">0</td>
  </tr>
</table>
</blockquote>


<p>The two first columns are the ends of the path.  The next column is the person that is a step on the path.  The next one is the number of the step, counting from 0, so that the end of the path that corresponds to the end condition on the column designated as input, i.e., <i>p1</i>, has number 0.  Since there can be multiple solutions, the last column is a sequence number allowing distinguishing multiple alternative paths from each other.</p>

<p>For LinkedIn users, the friends ordered by distance and descending friend count query, which is at the basis of most LinkedIn search result views can be written as: </p>

<blockquote>
 <pre><code>SELECT p2, 
      dist, 
      (SELECT 
          COUNT (*) 
          FROM &quot;knows&quot; &quot;c&quot; 
          WHERE &quot;c&quot;.&quot;p1&quot; = &quot;k&quot;.&quot;p2&quot;
      ) 
   FROM (SELECT 
            TRANSITIVE t_in (1) t_out (2) t_distinct &quot;p1&quot;, 
            &quot;p2&quot;, 
            t_step (&#39;step_no&#39;) AS &quot;dist&quot;
         FROM &quot;knows&quot;
        ) &quot;k&quot; 
   WHERE &quot;p1&quot; = 1 
   ORDER BY &quot;dist&quot;, 3 DESC;</code>
 </pre>


<table width="150">
<tr>
    <th align="center" width="50">p2</th>
    <th align="center" width="50">dist</th>
    <th align="center" width="50">aggregate</th>
  </tr>
<tr>
    <td align="center">2</td>
    <td align="center">1</td>
    <td align="center">1</td>
  </tr>
<tr>
    <td align="center">3</td>
    <td align="center">1</td>
    <td align="center">0</td>
  </tr>
<tr>
    <td align="center">4</td>
    <td align="center">2</td>
    <td align="center">0</td>
  </tr>
</table>
</blockquote>


<h2>How?</h2>

<p>The queries shown above work on Virtuoso v6.  When running in cluster mode, several thousand graph traversal steps may be proceeding at the same time, meaning that all database access is parallelized and that the algorithm is internally latency-tolerant.  By default, all results are produced in a deterministic order, permitting predictable slicing of result sets.</p>

<p>Furthermore, for queries where both ends of a path are given, the optimizer may decide to attack the path from both ends simultaneously. So, supposing that every member of a social network has an average of 30 contacts, and we need to find a path between two users that are no more than 6 steps apart, we begin at both ends, expanding each up to 3 levels, and we stop when we find the first intersection.  Thus, we reach 2 * 30^3 = 54,000 nodes, and not 30^6 = 729,000,000 nodes.</p>

<p>Writing a generic database driven graph traversal framework on the application side, say in Java over <a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id0xb595050">JDBC</a>, would easily be over a thousand lines. This is much more work than can be justified just for a one-off, ad-hoc query.  Besides, the traversal order in such a case could not be optimized by the DBMS.</p>

<h2>Next</h2> 

<p>In a future <a href="http://dbpedia.org/resource/Blog" id="link-id0x1e4d4f18">blog</a> post I will show how this feature can be used for common graph tasks like critical path, itinerary planning, traveling salesman, the 8 queens chess problem, etc.  There are lots of switches for controlling different parameters of the traversal.  This is just the beginning.  I will also give examples of the use of this in SPARQL.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2008-09-05#1430">
  <rss:title>Linked Data, Ubiquity Commands, and Resource Descriptions (Update 3)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-09-05T05:43:00Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Ubiquity from Mozilla Labs, provides an alternative entry point for experiencing the &quot;Controller&quot; aspect of the Web&#39;s natural compatibility with the MVC development pattern. As I&#39;ve noted (in various posts) Web Services, as practiced by the REST oriented Web 2.0 community or SOAP oriented SOA community within the enterprise, is fundamentally about the (&quot;Controller&quot; aspect of MVC. Ubiquity provides a commandline interface for direct invocation of Web Services. For instance, in our case, we can expose the Virtuoso&#39;s in-built RDF Middleware (&quot;Sponger&quot;) and Linked Data deployment services via a single command of the form: describe-resource &lt;url&gt; To experience this neat addition to Firefox you need to do the following:Download and install the Ubiquity Extension for FirefoxSubscribe to the OpenLink Command for Resource DescriptionClick on CTRL+Space (Windows / Linux) or Option+Space (Mac OS X)Type in: describe-resource &lt;a-web-resource-url&gt; How to unsubscribe At the current time, you need to do this if you&#39;ve installed commands using ubiquity 0.1.0 and seek to use newer versions of the same commands after upgrading to ubiquity 0.1.1. To unsubscribe use type &quot;about:ubiquity&quot; into browserClick on unsubscribe links associated with you command subscription list Enjoy!</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div><p><a href="http://labs.mozilla.com/2008/08/introducing-ubiquity/" id="link-id11258ea0">Ubiquity</a> from 
	<a href="http://labs.mozilla.com/" id="link-id112ebe28">Mozilla Labs</a>, 
	provides an alternative entry point for experiencing the "Controller" aspect of the 
	<a href="http://dbpedia.org/resource/World_Wide_Web" id="link-id0xa0d2ccd0">Web</a>'s natural compatibility with the 
	<a href="http://dbpedia.org/resource/Model-view-controller" id="link-id10ec1a08">MVC</a> development pattern. As I've noted (in <a href="http://myopenlink.net/weblog/public/search.vspx?blogid=kidehen-blog-0&amp;q=mvc&amp;type=text&amp;output=html" id="link-id15390f28">various posts</a>) <a href="http://dbpedia.org/resource/World_Wide_Web">Web</a> Services, as practiced by the REST oriented Web 2.0 community or SOAP oriented SOA community within the enterprise, is fundamentally about the ("Controller" aspect of <a href="http://dbpedia.org/resource/Model-view-controller" id="link-id13c0d758">MVC</a>. </p><p>Ubiquity provides a commandline interface for direct invocation of Web Services. For instance, in our case, we can expose the <a href="http://virtuoso.openlinksw.com" id="link-id10b04708">Virtuoso</a>'s in-built <a href="http://www.openlinksw.com/weblog/public/search.vspx?blogid=127&amp;q=rdf%20middleware&amp;type=text&amp;output=html" id="link-id1113ae38">RDF Middleware</a> ("Sponger") and <a href="http://dbpedia.org/resource/Linked_Data" id="link-id1457b3b8">Linked Data</a> deployment services via a single command of the form: describe-resource &lt;url&gt; </p><p>To experience this neat addition to Firefox you need to do the following:</p><ol><li><a href="https://people.mozilla.com/%7Eavarma/ubiquity-0.1.1.xpi" id="link-id13b15e88">Download</a> and install the Ubiquity Extension for Firefox</li><li><a href="http://demo.openlinksw.com/ubiq" id="link-id10e85880">Subscribe</a> to the OpenLink Command for Resource Description</li><li>Click on CTRL+Space (Windows / Linux) or Option+Space (Mac OS X)</li><li>Type in: describe-resource &lt;a-web-resource-url&gt;  </li></ol><h3>How to unsubscribe</h3> At the current time, you need to do this if you've installed commands using ubiquity 0.1.0 and seek to use newer versions of the same commands after upgrading to ubiquity 0.1.1.  <ol><li>To unsubscribe use type "about:ubiquity" into browser</li><li>Click on unsubscribe links associated with you command subscription list</li></ol> <p>Enjoy!</p></div>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2008-08-28#1425">
  <rss:title>The Essence of the Matter re. Information Overload</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-08-28T12:17:55Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">The title of this post is an expression of my gut reaction to the quotes below, which originate from Leo Sauermann&#39;s post about the Nepomuk Semantic Desktop for KDE: Ansgar Bernardi, deputy head of the Knowledge Management Department at Deutsches Forschungszentrum fÃ¼r KÃ¼nstliche Intelligenz (DFKI, or the German Research Center for Artificial Intelligence) and Nepomuk&#39;s coordinator, explains, &quot;The basic problem that we all face nowadays is how to handle vast amounts of information at a sensible rate.&quot; According to Bernardi, Nepomuk takes a traditional approach by creating a meta-data layer with well-defined elements that services can be built upon to create and manipulate the information. The comment above echoes my sentiments about the imminence of &quot;information overload&quot; due to the vast amounts of user generated content on the Internet as a whole. We are going to need to process more an more data within a fixed 24 hour timeframe, while attempting to balance our professional and personal lives. Be rest assured, this is a very serious issue, and you cannot event begin to address it without a Web of Linked Data. &quot;The first idea of building the semantic desktop arose from the fact that one of our colleagues could not remember the girlfriends of his friends,&quot; Bernard says, more than half-seriously. &quot;Because they kept changing -- you know how it is. The point is, you have a vast amount of information on your desktop, hidden in files, hidden in emails, hidden in the names and structures of your folders. Nepomuk gives a standard way to handle such information.&quot; If you get a personal URI for Entity &quot;You&quot;, via a Linked Data aware platform (e.g. OpenLink Data Spaces) that virtualizes data across your existing Web data spaces (blogs, feed subscriptions, wikis, shared bookmarks, photo galleries, calendars, etc.), you then only have to remember your URI whenever you need to &quot;Find&quot; something, imagine that! To conclude, &quot;information overload&quot; is the imminent challenge of our time, and the keys to challenge alleviation lie in our ability to construct and maintain (via solutions) few context lenses (URIs) that provide coherent conduits into the dense mesh of structured Linked Data on the Web.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>The title of this post is an expression of my gut reaction to the quotes below, which originate from <a href="http://leobard.twoday.net/" id="link-id104b2308">Leo Sauermann</a>&#39;s post about the <a href="http://leobard.twoday.net/stories/5151765/" id="link-id1889d5d8">Nepomuk Semantic Desktop for KDE</a>:</p>

<blockquote>
<cite><strong>Ansgar Bernardi</strong>, deputy head of the <a href="http://dbpedia.org/resource/Knowledge" id="link-id16d79970">Knowledge</a> Management Department at Deutsches Forschungszentrum fÃ¼r KÃ¼nstliche Intelligenz (DFKI, or the German Research Center for Artificial Intelligence) and Nepomuk&#39;s coordinator, explains, &quot;The basic problem that we all face nowadays is how to handle vast amounts of <a href="http://dbpedia.org/resource/Information" id="link-id13a01b58">information</a> at a sensible rate.&quot; According to Bernardi, Nepomuk takes a traditional approach by creating a meta-<a href="http://dbpedia.org/resource/Data">data</a> layer with well-defined elements that services can be built upon to create and manipulate the <a href="http://dbpedia.org/resource/Information" id="link-id102433e8">information</a>.</cite>
</blockquote>
<p>
The comment above echoes my sentiments about the imminence of &quot;<a href="http://dbpedia.org/resource/Information" id="link-id0x10dd6c20">information</a> overload&quot; due to the vast amounts of user generated content on the <a href="http://dbpedia.org/resource/Internet" id="link-id139926b0">Internet</a> as a whole. We are going to need to process more an more data within a fixed 24 hour timeframe, while attempting to balance our professional and personal lives. Be rest assured, this is a very serious issue, and you cannot event begin to address it without a <a href="http://dbpedia.org/resource/World_Wide_Web">Web</a> of <a href="http://dbpedia.org/resource/Linked_Data" id="link-id188ebc20">Linked Data</a>.</p>

<blockquote>
<cite>&quot;The first idea of building the semantic desktop arose from the fact that one of our colleagues could not remember the girlfriends of his friends,&quot; Bernard says, more than half-seriously. &quot;Because they kept changing -- you know how it is. The point is, you have a vast amount of <a href="http://dbpedia.org/resource/Information">information</a> on your desktop, hidden in files, hidden in emails, hidden in the names and structures of your folders. Nepomuk gives a standard way to handle such information.&quot;</cite>
</blockquote>

<p>If you get a personal <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id171dd2e0">URI</a> for <a href="http://dbpedia.org/resource/Entity" id="link-id18294318">Entity</a> &quot;You&quot;, via a <a href="http://dbpedia.org/resource/Linked_Data" id="link-id188a1b10">Linked Data</a> aware platform (e.g. <a href="http://dbpedia.org/resource/OpenLink_Data_Spaces" id="link-id167ad840">OpenLink Data Spaces</a>) that virtualizes data across your existing Web data spaces (blogs, feed subscriptions, wikis, shared bookmarks, photo galleries, calendars, etc.), you then only have to remember your <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id171c3ef0">URI</a> whenever you need to &quot;Find&quot; something, imagine that!</p> 

<p>To conclude, &quot;information overload&quot; is the imminent challenge of our time, and the keys to challenge alleviation lie in our ability to construct and maintain (via solutions) few <a href="http://dbpedia.org/resource/Context_%28language_use%29" id="link-id1074ade0">context</a> lenses (URIs) that provide coherent conduits into the dense mesh of structured <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0xd30b090">Linked Data</a> on the Web. </p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-07-17#1393">
  <rss:title>Virtuoso 5.0.7 Release, Now With Jena and Sesame APIs</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-07-17T17:18:09Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Virtuoso 5.0.7 Release, Now With Jena and Sesame APIs Improvements Full operation with Jena and Sesame RDF Frameworks. This fully replaces any previous attempts at interop, and introduces samples and test suites. Better support for alternate RDF indexing schemes Parallel operation of the RDF Sponger, importing multiple sources concurrently. New data formats supported for on-demand RDF-ization in the Sponger More efficient support for inference of subclass and sub-property; now capable of efficiently handling taxonomies of tens of thousands of classes OWL equivalentClass and equivalentProperty support. Dynamic IRI host part support for mapped data and for metadata of local resources. Renaming the host or using multiple virtual hosts will accept URIs with the right host part and refer to the same thing, no duplicate storage required. SPARQL optimizations for LIMIT and OFFSET Documentation How to read query plans and how to use the key performance meters How to diagnose SPARQL queries and how to decide what indexing scheme is right for each RDF use case How to debug RDF views Better documentation of SPARQL extensions and options A sample of correct RDF view usage with the Northwind demo data Bug Fixes Generally improved safety of built-in functions, better argument checking. Verified UTF8 international character support in all RDF use cases, SQL client/SPARQL protocol/all data formats.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<div style="display:none;">Virtuoso 5.0.7 Release, Now With Jena and Sesame APIs</div>
<h2>Improvements</h2>
<ul>
<li>
  <a href="http://docs.openlinksw.com:80/virtuoso/rdfnativestorageproviders.html" id="link-id13e54d98">Full operation</a> with <a href="http://jena.sourceforge.net/" id="link-id0x11a3d360">Jena</a> and <a href="http://sourceforge.net/projects/sesame/" id="link-id0x1108d428">Sesame</a> <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1288aa00">RDF</a> Frameworks. This fully replaces any previous attempts at interop, and introduces samples and test suites.</li>
<li>Better support for alternate RDF indexing schemes</li>
<li>Parallel operation of the RDF Sponger, importing multiple
sources concurrently.</li>
<li>New <a href="http://dbpedia.org/resource/Data" id="link-id0x128a9810">data</a> formats supported for on-demand RDF-ization in the
Sponger</li>
<li>More efficient support for inference of subclass and
sub-property; now capable of efficiently handling taxonomies of tens
of thousands of classes</li>
<li>
    <a href="http://dbpedia.org/resource/Web_Ontology_Language" id="link-id0x6af0678">OWL</a> <a href="http://docs.openlinksw.com:80/virtuoso/rdfsparqlrule.html#rdfsparqlruleintro" id="link-id104d58d8">equivalentClass and equivalentProperty</a> support.</li>
<li>
    <a href="http://docs.openlinksw.com:80/virtuoso/rdfdatarepresentation.html#rdfdynamiclocal" id="link-id109606a8">Dynamic IRI host part</a> support for mapped data and for metadata of local resources. Renaming the host or using multiple virtual hosts will accept URIs with the right host part and refer to the same thing, no duplicate storage required.</li>
<li>
    <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x12e0cc38">SPARQL</a> optimizations for <code>LIMIT</code> and <code>OFFSET</code>
</li>
</ul>
<h2>Documentation</h2>
<ul>
<li>
    <a href="http://docs.openlinksw.com:80/virtuoso/perfdiag.html#perfdiagqueryplans" id="link-id10a56dd0">How to read query plans and how to use the key performance meters</a>
  </li>
<li>
    <a href="http://docs.openlinksw.com:80/virtuoso/rdfperformancetuning.html#rdfperfcost" id="link-id106cb5c0">How to diagnose SPARQL queries and how to decide what indexing scheme is right for each RDF use case</a>
  </li>
<li>How to debug RDF views</li>
<ul>
  <li>
    <a href="http://docs.openlinksw.com:80/virtuoso/sparqldebug.html" id="link-id133b4420">Better documentation of SPARQL extensions and options</a>
  </li>
<li>
    <a href="http://docs.openlinksw.com:80/virtuoso/rdfviews.html#rdfviewnorthwindexample1" id="link-id1060fdd8">A sample of correct RDF view usage with the Northwind demo data</a>
  </li>
</ul>
</ul>
<h2>Bug Fixes</h2>
<ul>
<li>Generally improved safety of built-in functions, better
argument checking.</li>
<li>Verified UTF8 international character support in all RDF use
cases, <a href="http://dbpedia.org/resource/SQL" id="link-id0x12839fd0">SQL</a> client/<a href="http://www.w3.org/TR/rdf-sparql-protocol/" id="link-id0x1288f350">SPARQL protocol</a>/all data formats.</li>
</ul>
</div>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-07-17#1392">
  <rss:title>Virtuoso 5.0.7 Release, Now With Jena and Sesame APIs</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-07-17T17:16:19Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Improvements Full operation with Jena and Sesame RDF Frameworks. This fully replaces any previous attempts at interop, and introduces samples and test suites. Better support for alternate RDF indexing schemes Parallel operation of the RDF Sponger, importing multiple sources concurrently. New data formats supported for on-demand RDF-ization in the Sponger More efficient support for inference of subclass and sub-property; now capable of efficiently handling taxonomies of tens of thousands of classes OWL equivalentClass and equivalentProperty support. Dynamic IRI host part support for mapped data and for metadata of local resources. Renaming the host or using multiple virtual hosts will accept URIs with the right host part and refer to the same thing, no duplicate storage required. SPARQL optimizations for LIMIT and OFFSET Documentation How to read query plans and how to use the key performance meters How to diagnose SPARQL queries and how to decide what indexing scheme is right for each RDF use case How to debug RDF views Better documentation of SPARQL extensions and options A sample of correct RDF view usage with the Northwind demo data Bug Fixes Generally improved safety of built-in functions, better argument checking. Verified UTF8 international character support in all RDF use cases, SQL client/SPARQL protocol/all data formats.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<h2>Improvements</h2>
<ul>
<li>
  <a href="http://docs.openlinksw.com:80/virtuoso/rdfnativestorageproviders.html" id="link-id13e54d98">Full operation</a> with <a href="http://jena.sourceforge.net/" id="link-id0x11839970">Jena</a> and <a href="http://sourceforge.net/projects/sesame/" id="link-id0x118521a0">Sesame</a> <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x11e14758">RDF</a> Frameworks. This fully replaces any previous attempts at interop, and introduces samples and test suites.</li>
<li>Better support for alternate RDF indexing schemes</li>
<li>Parallel operation of the RDF Sponger, importing multiple
sources concurrently.</li>
<li>New <a href="http://dbpedia.org/resource/Data" id="link-id0x13661868">data</a> formats supported for on-demand RDF-ization in the
Sponger</li>
<li>More efficient support for inference of subclass and
sub-property; now capable of efficiently handling taxonomies of tens
of thousands of classes</li>
<li>
  <a href="http://dbpedia.org/resource/Web_Ontology_Language" id="link-id0x9df079b8">OWL</a> <a href="http://docs.openlinksw.com:80/virtuoso/rdfsparqlrule.html#rdfsparqlruleintro" id="link-id104d58d8">equivalentClass and equivalentProperty</a> support.</li>
<li>
    <a href="http://docs.openlinksw.com:80/virtuoso/rdfdatarepresentation.html#rdfdynamiclocal" id="link-id109606a8">Dynamic IRI host part</a> support for mapped data and for metadata of local resources. Renaming the host or using multiple virtual hosts will accept URIs with the right host part and refer to the same thing, no duplicate storage required.</li>
<li>
  <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x110e7688">SPARQL</a> optimizations for <code>LIMIT</code> and <code>OFFSET</code>
</li>
</ul>
<h2>Documentation</h2>
<ul>
<li>
    <a href="http://docs.openlinksw.com:80/virtuoso/perfdiag.html#perfdiagqueryplans" id="link-id10a56dd0">How to read query plans and how to use the key performance meters</a>
  </li>
<li>
    <a href="http://docs.openlinksw.com:80/virtuoso/rdfperformancetuning.html#rdfperfcost" id="link-id106cb5c0">How to diagnose SPARQL queries and how to decide what indexing scheme is right for each RDF use case</a>
  </li>
<li>How to debug RDF views</li>
<ul>
  <li>
    <a href="http://docs.openlinksw.com:80/virtuoso/sparqldebug.html" id="link-id133b4420">Better documentation of SPARQL extensions and options</a>
  </li>
<li>
    <a href="http://docs.openlinksw.com:80/virtuoso/rdfviews.html#rdfviewnorthwindexample1" id="link-id1060fdd8">A sample of correct RDF view usage with the Northwind demo data</a>
  </li>
</ul>
</ul>
<h2>Bug Fixes</h2>
<ul>
<li>Generally improved safety of built-in functions, better
argument checking.</li>
<li>Verified UTF8 international character support in all RDF use
cases, <a href="http://dbpedia.org/resource/SQL" id="link-id0x11140c28">SQL</a> client/<a href="http://www.w3.org/TR/rdf-sparql-protocol/" id="link-id0x110947e8">SPARQL protocol</a>/all data formats.</li>
</ul>
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-06-09#1381">
  <rss:title>The DARQ Matter of Federation</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-06-09T14:02:19Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">The DARQ Matter of Federation Astronomers propose that the universe is held together, so to speak, by the gravity of invisible &quot;dark matter&quot; spread in interstellar and intergalactic space. For the data web, it will be held together by federation, also an invisible factor. As in Minkowski space, so in cyberspace. To take the astronomical analogy further, putting too much visible stuff in one place makes a black hole, whose chief properties are that it is very heavy, can only get heavier and that nothing comes out. DARQ is Bastian Quilitz&#39;s federated extension of the Jena ARQ SPARQL processor. It has existed for a while and was also presented at ESWC2008. There is also SPARQL FED from Andy Seaborne, an explicit means of specifying which end point will process which fragment of a distributed SPARQL query. Still, for federation to deliver in an open, decentralized world, it must be transparent. For a specific application, with a predictable workload, it is of course OK to partition queries explicitly. Bastian had split DBpedia among five Virtuoso servers and was querying this set with DARQ. The end result was that there was a rather frightful cost of federation as opposed to all the data residing in a single Virtuoso. The other result was that if selectivity of predicates was not correctly guessed by the federation engine, the proposition was a non-starter. With correct join order it worked, though. Yet, we really want federation. Looking further down the road, we simply must make federation work. This is just as necessary as running on a server cluster for mid-size workloads. Since we are convinced of the cause, let&#39;s talk about the means. For DARQ as it now stands, there&#39;s probably an order of magnitude or even more to gain from a couple of simple tricks. If going to a SPARQL end point that is not the outermost in the loop join sequence, batch the requests together in one HTTP/1.1 message. So, if the query is &quot;get me my friends living in cities of over a million people,&quot; there will be the fragment &quot;get city where x lives&quot; and later &quot;ask if population of x greater than 1000000&quot;. If I have 100 friends, I send the 100 requests in a batch to each eligible server. Further, if running against a server of known brand, use a client-server connection and prepared statements with array parameters. This can well improve the processing speed at the remote end point by another order of magnitude. This gain may however not be as great as the latency savings from message batching. We will provide a sample of how to do this with Virtuoso over JDBC so Bastian can try this if interested. These simple things will give a lot of mileage and may even decide whether federation is an option in specific applications. For the open web however, these measures will not yet win the day. When federating SQL, colocation of data is sort of explicit. If two tables are joined and they are in the same source, then the join can go to the source. For SPARQL this is also so but with a twist: If a foaf:Person is found on a given server, this does not mean that the Person&#39;s geek code or email hash will be on the same server. Thus {?p name &quot;Johnny&quot; . ?p geekCode ?g . ?p emailHash ?h } does not necessarily denote a colocated join if many servers serve items of the vocabulary. However, in most practical cases, for obtaining a rapid answer, treating this as a colocated fragment will be appropriate. Thus, it may be necessary to be able to declare that geek codes will be assumed colocated with names. This will save a lot of message passing and offer decent, if not theoretically total recall. For search style applications, starting with such assumptions will make sense. If nothing is found, then we can partition each join step separately for the unlikely case that there were a server that gave geek codes but not names. For Virtuoso, we find that a federated query&#39;s asynchronous, parallel evaluation model is not so different from that on a local cluster. So the cluster version could have the option of federated query. The difference is that a cluster is local and tightly coupled and predictably partitioned but a federated setting is none of these. For description, we would take DARQ&#39;s description model and maybe extend it a little where needed. Also we would enhance the protocol to allow just asking for the query cost estimate given a query with literals specified. We will do this eventually. We would like to talk to Bastian about large improvements to DARQ, specially when working with Virtuoso. We&#39;ll see. Of course, one mode of federating is the crawl-as-you-go approach of the Virtuoso Sponger. This will bring in fragments following seeAlso or sameAs declarations or other references. This will however not have the recall of a warehouse or federation over well described SPARQL end-points. But up to a certain volume it has the speed of local storage. The emergence of voiD (Vocabulary of Interlinked Data) is a step in the direction of making federation a reality. There is a separate post about this.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<div style="display:none;">The DARQ Matter of Federation</div>
<p>Astronomers propose that the universe is held together, so to speak, by the gravity of invisible &quot;dark matter&quot; spread in interstellar and intergalactic space.</p>
<p>For the <a href="http://dbpedia.org/resource/Data" id="link-id0x19dbf410">data</a> web, it will be held together by federation, also an invisible factor. As in Minkowski space, so in <a href="http://dbpedia.org/resource/Cyberspace" id="link-id0x9fc13ff8">cyberspace</a>.</p>
<p>To take the astronomical analogy further, putting too much visible stuff in one place makes a black hole, whose chief properties are that it is very heavy, can only get heavier and that nothing comes out.</p>
<p>
  <a href="http://darq.sourceforge.net/" id="link-id0x1d06bd88">DARQ</a> is Bastian Quilitz&#39;s federated extension of the <a href="http://jena.sourceforge.net/" id="link-id0x1cf28f70">Jena</a> <a href="http://jena.sourceforge.net/ARQ/" id="link-id0x1cba22c8">ARQ</a> <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x171c7dc8">SPARQL</a> processor. It has existed for a while and was also presented at <a href="http://www.eswc2008.org/" id="link-id0x1ed53cd0">ESWC2008</a>. There is also SPARQL FED from Andy Seaborne, an explicit means of specifying which end point will process which fragment of a distributed SPARQL query. Still, for federation to deliver in an open, decentralized world, it must be transparent. For a specific application, with a predictable workload, it is of course OK to partition queries explicitly.</p>
<p>Bastian had split <a href="http://dbpedia.org/resource/DBpedia" id="link-id0x1ce846c0">DBpedia</a> among five <a href="http://virtuoso.openlinksw.com" id="link-id0x1cad0640">Virtuoso</a> servers and was querying this set with DARQ. The end result was that there was a rather frightful cost of federation as opposed to all the data residing in a single Virtuoso. The other result was that if selectivity of predicates was not correctly guessed by the federation engine, the proposition was a non-starter. With correct join order it worked, though.</p>
<p>Yet, we really want federation. Looking further down the road, we simply must make federation work. This is just as necessary as running on a server cluster for mid-size workloads.</p>
<p>Since we are convinced of the cause, let&#39;s talk about the means.</p>
<p>For DARQ as it now stands, there&#39;s probably an order of magnitude or even more to gain from a couple of simple tricks. If going to a SPARQL end point that is not the outermost in the loop join sequence, batch the requests together in one <a href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x19a48280">HTTP</a>/1.1 message. So, if the query is &quot;get me my friends living in cities of over a million people,&quot; there will be the fragment &quot;get city where x lives&quot; and later &quot;ask if population of x greater than 1000000&quot;. If I have 100 friends, I send the 100 requests in a batch to each eligible server.</p>
<p>Further, if running against a server of known brand, use a client-server connection and prepared statements with array parameters. This can well improve the processing speed at the remote end point by another order of magnitude. This gain may however not be as great as the latency savings from message batching. We will provide a sample of how to do this with Virtuoso over <a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id0x1cf18278">JDBC</a> so Bastian can try this if interested.</p>
<p>These simple things will give a lot of mileage and may even decide whether federation is an option in specific applications. For the open web however, these measures will not yet win the day.</p>
<p>When federating <a href="http://dbpedia.org/resource/SQL" id="link-id0x1cf7d0e8">SQL</a>, colocation of data is sort of explicit. If two tables are joined and they are in the same source, then the join can go to the source. For SPARQL this is also so but with a twist:</p>
<p>If a foaf:Person is found on a given server, this does not mean that the Person&#39;s geek code or email hash will be on the same server. Thus <code>{?p name &quot;Johnny&quot; . ?p geekCode ?g . ?p emailHash ?h }</code> does not necessarily denote a colocated join if many servers serve items of the vocabulary.</p>
<p>However, in most practical cases, for obtaining a rapid answer, treating this as a colocated fragment will be appropriate. Thus, it may be necessary to be able to declare that geek codes will be assumed colocated with names. This will save a lot of message passing and offer decent, if not theoretically total recall. For search style applications, starting with such assumptions will make sense. If nothing is found, then we can partition each join step separately for the unlikely case that there were a server that gave geek codes but not names.</p>
<p>For Virtuoso, we find that a federated query&#39;s asynchronous, parallel evaluation model is not so different from that on a local cluster. So the cluster version could have the option of federated query. The difference is that a cluster is local and tightly coupled and predictably partitioned but a federated setting is none of these.</p>
<p>For description, we would take DARQ&#39;s description model and maybe extend it a little where needed. Also we would enhance the protocol to allow just asking for the query cost estimate given a query with literals specified. We will do this eventually.</p>
<p>We would like to talk to Bastian about large improvements to DARQ, specially when working with Virtuoso. We&#39;ll see.</p>
<p>Of course, one mode of federating is the crawl-as-you-go approach of the Virtuoso <a href="http://virtuoso.openlinksw.com/Whitepapers/html/VirtSpongerWhitePaper.html" id="link-id0x1e163140">Sponger</a>. This will bring in fragments following seeAlso or sameAs declarations or other references. This will however not have the recall of a warehouse or federation over well described SPARQL end-points. But up to a certain volume it has the speed of local storage.</p>
<p>The emergence of voiD (Vocabulary of Interlinked Data) is a step in the direction of making federation a reality. There is <a href="http://www.openlinksw.com/weblog/oerling/?id=1377" id="link-id1109a4c8">a separate post</a> about this.</p>
</div>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-06-09#1380">
  <rss:title>Aspects of RDF to RDF Mapping</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-06-09T14:02:18Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Aspects of RDF to RDF Mapping The W3C has recently launched an incubator group about mapping relational data to RDF. From participating in the group for the few initial sessions, I get the following impressions. There is a segment of users, for example from the biomedical community, who do heavy duty data integration and look to RDF for managing complexity. Unifying heterogeneous data under OWL ontologies, reasoning, and data integrity, are points of interest. There is another segment that is concerned with semantifying the document web, which topic includes initiatives such as Triplify and semantic web search such as Sindice. The emphasis there is on minimizing entry cost and creating critical mass. The next one to come will clean up the semantics, if these need be cleaned up at all. (Some cleanup is taking place with Yago and Zitgist, but this is a matter for a different post.) Thus, technically speaking, the mapping landscape is diverse, but ETL (extract-transform-load) seems to predominate. The biomedical people make data warehouses for answering specific questions. The web people are interested in putting data out in the expectation that the next player will warehouse it and allow running complex meshups against the whole of the RDF-ized web. As one would expect, these groups see different issues and needs. Roughly speaking, one is about quality and structure and the other is about volume. Where do we stand? We are with the research data warehousers in saying that the mapping question is very complex and that it would indeed be nice to bypass ETL and go to the source RDBMS(s) on demand. Projects in this direction are ongoing. We are with the web people in building large RDF stores with scalable query answering for arbitrary RDF, for example, hosting a lot of the Linking Open Data sets, and working with Zitgist. These things are somewhat different. At present, both the research warehousers and the web scalers predominantly go for ETL. This is fine by us as we definitely are in the large RDF store race. Still, mapping has its point. A relational store will perform quite a bit faster than a quad store if it has the right covering indices or application-specific compressed columnar layout. Thus, there is nothing to block us from querying analytics in SPARQL, once the obviously necessary extensions of sub-query, expressions and aggregation are in place. To cite an example, the Ordnance Survey of the UK has a GIS system running on Oracle with an entry pretty much for each mailbox, lamp post, and hedgerow in the country. According to Ordnance Survey, this would be 1 petatriple, 1e15 triples. &quot;Such a big server farm that we&#39;d have to put it on our map,&quot; as Jenny Harding put it at ESWC2008. I&#39;d add that an even bigger map entry would be the power plant needed to run the 100,000 or so PCs this would take. This is counting 10 gigatriples per PC, which would not even give very good working sets. So, on-the-fly RDBMS-to-RDF mapping in some cases is simply necessary. Still, the benefits of RDF for integration can be preserved if the translation middleware is smart enough. Specifically, this entails knowing what tables can be joined with what other tables and pushing maximum processing to the RDBMS(s) involved in the query. You can download the slide set I used for the Virtuoso presentation for the RDB to RDF mapping incubator group (PPT; other formats coming soon). The main point is that real integration is hard and needs smart query splitting and optimization, as well as real understanding of the databases and subject matter from the information architect. Sometimes in the web space it can suffice to put data out there with trivial RDF translation and hope that a search engine or such will figure out how to join this with something else. For the enterprise, things are not so. Benefits are clear if one can navigate between disjoint silos but making this accurate enough for deriving business conclusions, as well as efficient enough for production, is a soluble and non-trivial question. We will show the basics of this with the TPC-H mapping, and by joining this with physical triples. We will also make a set of TPC-H format table sets, make mappings between keys in one to keys in the other, and show joins between the two. The SPARQL querying of one such data store is a done deal, including the SPARQL extensions for this. There is even a demo paper, Business Intelligence Extensions for SPARQL (PDF; other formats coming soon), by us on the subject in the ESWC 2008 proceedings. If there is an issue left, it is just the technicality of always producing SQL that looks hand-crafted and hence is better understood by the target RDBMS(s). For example, Oracle works better if one uses an IN sub-query instead of the equivalent existence test. Follow this blog for more on the topic; published papers are always a limited view on the matter.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<div style="display:none;">Aspects of RDF to RDF Mapping</div>
<p>The W3C has recently launched an <a href="http://www.w3.org/2005/Incubator/rdb2rdf/" id="link-idd763f48">incubator group about mapping relational data to RDF</a>.</p>
<p>From participating in the group for the few initial sessions, I get the following impressions.</p>
<p>There is a segment of users, for example from the biomedical community, who do heavy duty <a href="http://dbpedia.org/resource/Data" id="link-id0x17f9e6f8">data</a> integration and look to <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x17eabf48">RDF</a> for managing complexity. Unifying heterogeneous data under OWL ontologies, reasoning, and data integrity, are points of interest.</p>
<p>There is another segment that is concerned with semantifying the document web, which topic includes initiatives such as <a href="http://triplify.org/" id="link-id0x1a25cd28">Triplify</a> and <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x182c41e8">semantic web</a> search such as <a href="http://sindice.org/" id="link-id0x1a29c5e8">Sindice</a>. The emphasis there is on minimizing entry cost and creating critical mass. The next one to come will clean up the semantics, if these need be cleaned up at all.</p>
<p>(Some cleanup is taking place with <a href="http://www.mpi-inf.mpg.de/~suchanek/downloads/yago/" id="link-id0x17fd2b70">Yago</a> and <a href="http://zitgist.com/about/" id="link-id0x17e6ab88">Zitgist</a>, but this is a matter for a different post.)</p>
<p>Thus, technically speaking, the mapping landscape is diverse, but ETL (extract-transform-load) seems to predominate. The biomedical people make data warehouses for answering specific questions. The web people are interested in putting data out in the expectation that the next player will warehouse it and allow running complex meshups against the whole of the RDF-ized web.</p>
<p>As one would expect, these groups see different issues and needs. Roughly speaking, one is about quality and structure and the other is about volume.</p>
<p>Where do we stand?</p>
<p>We are with the research data warehousers in saying that the mapping question is very complex and that it would indeed be nice to bypass ETL and go to the source <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x182acd68">RDBMS</a>(s) on demand. Projects in this direction are ongoing.</p>
<p>We are with the web people in building large RDF stores with scalable query answering for arbitrary RDF, for example, hosting a lot of the Linking Open Data sets, and working with Zitgist.</p>
<p>These things are somewhat different.</p>
<p>At present, both the research warehousers and the web scalers predominantly go for ETL.</p>
<p>This is fine by us as we definitely are in the large RDF store race.</p>
<p>Still, mapping has its point. A relational store will perform quite a bit faster than a quad store if it has the right covering indices or application-specific compressed columnar layout. Thus, there is nothing to block us from querying analytics in <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x16c91438">SPARQL</a>, once the obviously necessary extensions of sub-query, expressions and aggregation are in place.</p>
<p>To cite an example, the Ordnance Survey of the UK has a GIS system running on <a href="http://dbpedia.org/resource/Oracle_Database" id="link-id0x17ee37c8">Oracle</a> with an entry pretty much for each mailbox, lamp post, and hedgerow in the country. According to Ordnance Survey, this would be 1 petatriple, 1e15 triples. &quot;Such a big server farm that we&#39;d have to put it on our map,&quot; as Jenny Harding put it at <a href="http://www.eswc2008.org/" id="link-id0x1cab6330">ESWC2008</a>. I&#39;d add that an even bigger map entry would be the power plant needed to run the 100,000 or so PCs this would take. This is counting 10 gigatriples per PC, which would not even give very good working sets.</p>
<p>So, on-the-fly RDBMS-to-RDF mapping in some cases is simply necessary. Still, the benefits of RDF for integration can be preserved if the translation middleware is smart enough. Specifically, this entails knowing what tables can be joined with what other tables and pushing maximum processing to the RDBMS(s) involved in the query.</p>
<p>You can download the slide set I used for the <a href="http://virtuoso.openlinksw.com" id="link-id0xa1fb7e8">Virtuoso</a> presentation for the RDB to RDF mapping incubator group (<a href="http://virtuoso.openlinksw.com/wiki/main/Main/VirtPresentations/Relational2RDF.ppt" id="link-id106f9e88">PPT</a>; <a href="http://virtuoso.openlinksw.com/wiki/main/Main/VirtPresentations" id="link-id10a8dc90">other formats</a> coming soon). The main point is that real integration is hard and needs smart query splitting and optimization, as well as real understanding of the databases and subject matter from the <a href="http://dbpedia.org/resource/Information" id="link-id0x17ee38a0">information</a> architect. Sometimes in the web space it can suffice to put data out there with trivial RDF translation and hope that a search engine or such will figure out how to join this with something else. For the enterprise, things are not so. Benefits are clear if one can navigate between disjoint silos but making this accurate enough for deriving business conclusions, as well as efficient enough for production, is a soluble and non-trivial question.</p>
<p>We will show the basics of this with the <a href="http://dbpedia.org/resource/TPC-H" id="link-id0x1844d718">TPC-H</a> mapping, and by joining this with physical triples. We will also make a set of TPC-H format table sets, make mappings between keys in one to keys in the other, and show joins between the two. The SPARQL querying of one such data store is a done deal, including the SPARQL extensions for this. There is even a demo paper, Business Intelligence Extensions for SPARQL (<a href="http://virtuoso.openlinksw.com/wiki/main/Main/VirtPresentations/RDFAndMapped_BI.pdf" id="link-id12ea4b18">PDF</a>; <a href="http://virtuoso.openlinksw.com/wiki/main/Main/VirtPresentations" id="link-id106e1810">other formats</a> coming soon), by us on the subject in the ESWC 2008 proceedings. If there is an issue left, it is just the technicality of always producing <a href="http://dbpedia.org/resource/SQL" id="link-id0x17fc8d60">SQL</a> that looks hand-crafted and hence is better understood by the target RDBMS(s). For example, Oracle works better if one uses an <code>IN</code> sub-query instead of the equivalent existence test.</p>
<p>Follow this <a href="http://dbpedia.org/resource/Blog" id="link-id0xa9bcef8">blog</a> for more on the topic; published papers are always a limited view on the matter.</p>
</div>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2008-06-09#1379">
  <rss:title>ESWC 2008</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-06-09T14:02:16Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">ESWC 2008 YrjÃ¤nÃ¤ Rankka and I attended ESWC2008 on behalf of OpenLink. We were invited at the last minute to give a Linked Open Data talk at Paolo Bouquet&#39;s Identity and Reference workshop. We also had a demo of SPARQL BI (PPT); other formats coming soon), our business intelligence extensions to SPARQL as well as joining between relational data mapped to RDF and native RDF data. i was also speaking at the social networks panel chaired by Harry Halpin. I have gathered a few impressions that I will share in the next few posts (1 - RDF Mapping, 2 - DARQ, 3 - voiD, 4 - Paradigmata). Caveat: This is not meant to be complete or impartial press coverage of the event but rather some quick comments on issues of personal/OpenLink interest. The fact that I do not mention something does not mean that it is unimportant. The voiD Graph Linked Open Data was well represented, with Chris Bizer, Tom Heath, ourselves and many others. The great advance for LOD this time around is voiD, the Vocabulary of Interlinked Datasets, a means to describe what in fact is inside the LOD cloud, how to join it with what and so forth. Big time important if there is to be a web of federatable data sources, feeding directly into what we have been saying for a while about SPARQL end-point self-description and discovery. There is reasonable hope of having something by the date of Linked Data Planet in a couple of weeks. Federating Bastian Quilitz gave a talk about his DARQ, a federated version of Jena&#39;s ARQ. Something like DARQ&#39;s optimization statistics should make their way into the SPARQL protocol as well as the voiD data set description. We really need federation but more on this in a separate post. XSPARQL Axel Polleres et al had a paper about XSPARQL, a merge of XQuery and SPARQL. While visiting DERI a couple of weeks back and again at the conference, we talked about OpenLink implementing the spec. It is evident that the engines must be in the same process and not communicate via the SPARQL protocol for this to be practical. We could do this. We&#39;ll have to see when. Politically, using XQuery to give expressions and XML synthesis to SPARQL would be fitting. These things are needed anyhow, as surely as aggregation and sub-queries but the latter would not so readily come from XQuery. Some rapprochement between RDF and XML folks is desirable anyhow. Panel: Will the Sem Web Rise to the Challenge of the Social Web? The social web panel presented the question of whether the sem web was ready for prime time with data portability. The main thrust was expressed in Harry Halpin&#39;s rousing closing words: &quot;Men will fight in a battle and lose a battle for a cause they believe in. Even if the battle is lost, the cause may come back and prevail, this time changed and under a different name. Thus, there may well come to be something like our semantic web, but it may not be the one we have worked all these years to build if we do not rise to the occasion before us right now.&quot; So, how to do this? Dan Brickley asked the audience how many supported, or were aware of, the latest Web 2.0 things, such as OAuth and OpenID. A few were. The general idea was that research (after all, this was a research event) should be more integrated and open to the world at large, not living at the &quot;outdated pace&quot; of a 3 year funding cycle. Stefan Decker of DERI acquiesced in principle. Of course there is impedance mismatch between specialization and interfacing with everything. I said that triples and vocabularies existed, that OpenLink had ODS (OpenLink Data Spaces, Community LinkedData) for managing one&#39;s data-web presence, but that scale would be the next thing. Rather large scale even, with 100 gigatriples (Gtriples) reached before one even noticed. It takes a lot of PCs to host this, maybe $400K worth at today&#39;s prices, without replication. Count 16G ram and a few cores per Gtriple so that one is not waiting for disk all the time. The tricks that Web 2.0 silos do with app-specific data structures and app-specific partitioning do not really work for RDF without compromising the whole point of smooth schema evolution and tolerance of ragged data. So, simple vocabularies, minimal inference, minimal blank nodes. Besides, note that the inference will have to be done at run time, not forward-chained at load time, if only because users will not agree on what sameAs and other declarations they want for their queries. Not to mention spam or malicious sameAs declarations! As always, there was the question of business models for the open data web and for semantic technologies in general. As we see it, information overload is the factor driving the demand. Better contextuality will justify semantic technologies. Due to the large volumes and complex processing, a data-as-service model will arise. The data may be open, but its query infrastructure, cleaning, and keeping up-to-date, can be monetized as services. Identity and Reference For the identity and reference workshop, the ultimate question is metaphysical and has no single universal answer, even though people, ever since the dawn of time and earlier, have occupied themselves with the issue. Consequently, I started with the Genesis quote where Adam called things by nominibus suis, off-hand implying that things would have some intrinsic ontologically-due names. This would be among the older references to the question, at least in widely known sources. For present purposes, the consensus seemed to be that what would be considered the same as something else depended entirely on the application. What was similar enough to warrant a sameAs for cooking purposes might not warrant a sameAs for chemistry. In fact, complete and exact sameness for URIs would be very rare. So, instead of making generic weak similarity assertions like similarTo or seeAlso, one would choose a set of strong sameAs assertions and have these in effect for query answering if they were appropriate to the granularity demanded by the application. Therefore sameAs is our permanent companion, and there will in time be malicious and spam sameAs. So, nothing much should be materialized on the basis of sameAs assertions in an open world. For an app-specific warehouse, sameAs can be resolved at load time. There was naturally some apparent tension between the Occam camp of entity name services and the LOD camp. I would say that the issue is more a perceived polarity than a real one. People will, inevitably, continue giving things names regardless of any centralized authority. Just look at natural language. But having a dictionary that is commonly accepted for established domains of discourse is immensely helpful. CYC and NLP The semantic search workshop was interesting, especially CYC&#39;s presentation. CYC is, as it were, the grand old man of knowledge representation. Over the long term, I would have support of the CYC inference language inside a database query processor. This would mostly be for repurposing the huge knowledge base for helping in search type queries. If it is for transactions or financial reporting, then queries will be SQL and make little or no use of any sort of inference. If it is for summarization or finding things, the opposite holds. For scaling, the issue is just making correct cardinality guesses for query planning, which is harder when inference is involved. We&#39;ll see. I will also have a closer look at natural language one of these days, quite inevitably, since Zitgist (for example) is into entity disambiguation. Scale Garlic gave a talk about their Data Patrol and QDOS. We agree that storing the data for these as triples instead of 1000 or so constantly changing relational tables could well make the difference between next-to-unmanageable and efficiently adaptive. Garlic probably has the largest triple collection in constant online use to date. We will soon join them with our hosting of the whole LOD cloud and Sindice/Zitgist as triples. Conclusions There is a mood to deliver applications. Consequently, scale remains a central, even the principal topic. So for now we make bigger centrally-managed databases. At the next turn around the corner we will have to turn to federation. The point here is that a planetary-scale, centrally-managed, online system can be made when the workload is uniform and anticipatable, but if it is free-form queries and complex analysis, we have a problem. So we move in the direction of federating and charging based on usage whenever the workload is more complex than making simple lookups now and then. For the Virtuoso roadmap, this changes little. Next we make data sets available on Amazon EC2, as widely promised at ESWC. With big scale also comes rescaling and repartitioning, so this gets additional weight, as does further parallelizing of single user workloads. As it happens, the same medicine helps for both. At Linked Data Planet, we will make more announcements.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<div style="display:none;">ESWC 2008</div>
<p>YrjÃ¤nÃ¤ Rankka and I attended <a href="http://www.eswc2008.org/" id="link-id10b7a038">ESWC2008</a> on behalf of OpenLink.</p>
<p>We were invited at the last minute to give a <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id105df758">Linked Open Data</a> talk at Paolo Bouquet&#39;s Identity and Reference workshop. We also had a demo of <a href="http://dbpedia.org/resource/SPARQL" id="link-id12eacca0">SPARQL</a> BI (<a href="http://virtuoso.openlinksw.com/wiki/main/Main/VirtPresentations/ESWC2008%20SPARQL%20BI%20OpenLink.ppt" id="link-id10b43e58">PPT</a>); <a href="http://virtuoso.openlinksw.com/wiki/main/Main/VirtPresentations" id="link-id1116d8f0">other formats coming soon</a>), our business intelligence extensions to <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x16c9bfc8">SPARQL</a> as well as joining between relational <a href="http://dbpedia.org/resource/Data" id="link-id10badc40">data</a> mapped to <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id108edaf8">RDF</a> and native <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x181a5ed8">RDF</a> <a href="http://dbpedia.org/resource/Data" id="link-id0x17e69910">data</a>. i was also speaking at the social networks panel chaired by Harry Halpin.</p>
<p>I have gathered a few impressions that I will share in the next few posts (<a href="http://www.openlinksw.com/weblog/oerling/?id=1375" id="link-id107298e0">1 - RDF Mapping</a>, <a href="http://www.openlinksw.com/weblog/oerling/?id=1376" id="link-id10b3a530">2 - DARQ</a>, <a href="http://www.openlinksw.com/weblog/oerling/?id=1377" id="link-id107290e0">3 - voiD</a>, <a href="http://www.openlinksw.com/weblog/oerling/?id=1378" id="link-id1071a950">4 - Paradigmata</a>). <i>Caveat: This is not meant to be complete or impartial press coverage of the event but rather some quick comments on issues of personal/OpenLink interest. The fact that I do not mention something does not mean that it is unimportant.</i>
</p>
<h2>The voiD Graph</h2>
<p>
  <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x1a87f110">Linked Open Data</a> was well represented, with Chris Bizer, Tom Heath, ourselves and many others. The great advance for <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id108f3c48">LOD</a> this time around is <a href="http://community.linkeddata.org/MediaWiki/index.php?MetaLOD#Kick-off_meeting_at_ESWC08" id="link-id10df9830">voiD, the Vocabulary of Interlinked Datasets</a>, a means to describe what in fact is inside the <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x1a089980">LOD</a> cloud, how to join it with what and so forth. Big time important if there is to be a <a href="http://www.openlinksw.com/weblog/oerling/?id=1377" id="link-iddf74578">web of federatable data sources</a>, feeding directly into what we have been saying for a while about SPARQL end-point self-description and discovery. There is reasonable hope of having something by the date of <a href="http://www.linkeddataplanet.com/" id="link-id10dd0848">Linked Data Planet</a> in a couple of weeks.</p>
<h2>Federating</h2>
<p>Bastian Quilitz gave a talk about his <a href="http://darq.sourceforge.net/" id="link-id108746e8">DARQ</a>, a federated version of Jena&#39;s ARQ.</p>
<p>Something like <a href="http://darq.sourceforge.net/" id="link-id0x1a2d9860">DARQ</a>&#39;s optimization statistics should make their way into the <a href="http://www.w3.org/TR/rdf-sparql-protocol/" id="link-id10992348">SPARQL protocol</a> as well as the voiD data set description.</p>
<p>We really need federation but more on this in <a href="http://www.openlinksw.com/weblog/oerling/?id=1376" id="link-id1059d688">a separate post</a>.</p>
<h2>
<a href="http://xsparql.deri.ie/" id="link-id10314308">XSPARQL</a>
</h2>
<p>Axel Polleres et al had a paper about <a href="http://xsparql.deri.ie/" id="link-id0x1ad77490">XSPARQL</a>, a merge of <a href="http://dbpedia.org/resource/XQuery" id="link-id10b98e90">XQuery</a> and SPARQL. While visiting DERI a couple of weeks back and again at the conference, we talked about OpenLink implementing the spec. It is evident that the engines must be in the same process and not communicate via the <a href="http://www.w3.org/TR/rdf-sparql-protocol/" id="link-id0x17e75190">SPARQL protocol</a> for this to be practical. We could do this. We&#39;ll have to see when.</p>
<p>Politically, using <a href="http://dbpedia.org/resource/XQuery" id="link-id0x18a9bf10">XQuery</a> to give expressions and XML synthesis to SPARQL would be fitting. These things are needed anyhow, as surely as aggregation and sub-queries but the latter would not so readily come from XQuery. Some rapprochement between RDF and XML folks is desirable anyhow.</p>
<h2>Panel: Will the Sem Web Rise to the Challenge of the Social Web?</h2>
<p>The social web panel presented the question of whether the sem web was ready for prime time with data portability.</p>
<p>The main thrust was expressed in Harry Halpin&#39;s rousing closing words: &quot;Men will fight in a battle and lose a battle for a cause they believe in. Even if the battle is lost, the cause may come back and prevail, this time changed and under a different name. Thus, there may well come to be something like our <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id122f4da0">semantic web</a>, but it may not be the one we have worked all these years to build if we do not rise to the occasion before us right now.&quot;</p>
<p>So, how to do this? Dan Brickley asked the audience how many supported, or were aware of, the latest Web 2.0 things, such as <a href="http://dbpedia.org/page/OAuth" id="link-idf300bc0">OAuth</a> and <a href="http://dbpedia.org/page/OpenID" id="link-id10ce7a40">OpenID</a>. A few were. The general idea was that research (after all, this was a research event) should be more integrated and open to the world at large, not living at the &quot;outdated pace&quot; of a 3 year funding cycle. Stefan Decker of DERI acquiesced in principle. Of course there is impedance mismatch between specialization and interfacing with everything.</p>
<p>I said that triples and vocabularies existed, that OpenLink had <a href="http://dbpedia.org/resource/OpenLink_Data_Spaces" id="link-id1210dbf8">ODS</a> (<a href="http://dbpedia.org/resource/OpenLink_Data_Spaces" id="link-id11076be8">OpenLink Data Spaces</a>, <a href="http://community.linkeddata.org/" id="link-id10d46710">Community LinkedData</a>) for managing one&#39;s data-web presence, but that scale would be the next thing. Rather large scale even, with 100 gigatriples (Gtriples) reached before one even noticed. It takes a lot of PCs to host this, maybe $400K worth at today&#39;s prices, without replication. Count 16G ram and a few cores per Gtriple so that one is not waiting for disk all the time.</p>
<p>The tricks that Web 2.0 silos do with app-specific data structures and app-specific partitioning do not really work for RDF without compromising the whole point of smooth schema evolution and tolerance of ragged data.</p>
<p>So, simple vocabularies, minimal inference, minimal blank nodes. Besides, note that the inference will have to be done at run time, not forward-chained at load time, if only because users will not agree on what sameAs and other declarations they want for their queries. Not to mention spam or malicious sameAs declarations!</p>
<p>As always, there was the question of business models for the open data web and for semantic technologies in general. As we see it, <a href="http://dbpedia.org/resource/Information" id="link-id108b7688">information</a> overload is the factor driving the demand. Better contextuality will justify semantic technologies. Due to the large volumes and complex processing, a data-as-service model will arise. The data may be open, but its query infrastructure, cleaning, and keeping up-to-date, can be monetized as services.</p>
<h2>Identity and Reference</h2>
<p>For the identity and reference workshop, the ultimate question is metaphysical and has no single universal answer, even though people, ever since the dawn of time and earlier, have occupied themselves with the issue. Consequently, I started with the Genesis quote where Adam called things by <i>nominibus suis</i>, off-hand implying that things would have some intrinsic ontologically-due names. This would be among the older references to the question, at least in widely known sources.</p>
<p>For present purposes, the consensus seemed to be that what would be considered the same as something else depended entirely on the application. What was similar enough to warrant a sameAs for cooking purposes might not warrant a sameAs for chemistry. In fact, complete and exact sameness for URIs would be very rare. So, instead of making generic weak similarity assertions like similarTo or seeAlso, one would choose a set of strong sameAs assertions and have these in effect for query answering if they were appropriate to the granularity demanded by the application.</p>
<p>Therefore sameAs is our permanent companion, and there will in time be malicious and spam sameAs. So, nothing much should be materialized on the basis of sameAs assertions in an <a href="http://dbpedia.org/resource/Open_world_assumption" id="link-id10c4dfd0">open world</a>. For an app-specific warehouse, sameAs can be resolved at load time.</p>
<p>There was naturally some apparent tension between the Occam camp of <a href="http://dbpedia.org/resource/Entity" id="link-id105fd240">entity</a> name services and the LOD camp. I would say that the issue is more a perceived polarity than a real one. People will, inevitably, continue giving things names regardless of any centralized authority. Just look at natural language. But having a dictionary that is commonly accepted for established domains of discourse is immensely helpful.</p>
<h2>CYC and NLP</h2>
<p>The semantic search workshop was interesting, especially CYC&#39;s presentation. CYC is, as it were, the grand old man of <a href="http://dbpedia.org/resource/Knowledge" id="link-id10568158">knowledge</a> representation. Over the long term, I would have support of the CYC inference language inside a database query processor. This would mostly be for repurposing the huge <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x1acff9d0">knowledge</a> base for helping in search type queries. If it is for transactions or financial reporting, then queries will be <a href="http://dbpedia.org/resource/SQL" id="link-id130a0a80">SQL</a> and make little or no use of any sort of inference. If it is for summarization or finding things, the opposite holds. For scaling, the issue is just making correct cardinality guesses for query planning, which is harder when inference is involved. We&#39;ll see.</p>
<p>I will also have a closer look at natural language one of these days, quite inevitably, since <a href="http://zitgist.com/about/" id="link-id10795828">Zitgist</a> (for example) is into <a href="http://dbpedia.org/resource/Entity" id="link-id0x18a12918">entity</a> disambiguation.</p>
<h2>Scale</h2>
<p>Garlic gave a talk about their Data Patrol and QDOS. We agree that storing the data for these as triples instead of 1000 or so constantly changing relational tables could well make the difference between next-to-unmanageable and efficiently adaptive.</p>
<p>Garlic probably has the largest triple collection in constant online use to date. We will soon join them with our hosting of the whole LOD cloud and <a href="http://sindice.org/" id="link-id0x17f18a38">Sindice</a>/<a href="http://zitgist.com/about/" id="link-id0x184e9e90">Zitgist</a> as triples.</p>
<h2>Conclusions</h2>
<p>There is a mood to deliver applications. Consequently, scale remains a central, even the principal topic. So for now we make bigger centrally-managed databases. At the next turn around the corner we will have to turn to federation. The point here is that a planetary-scale, centrally-managed, online system can be made when the workload is uniform and anticipatable, but if it is free-form queries and complex analysis, we have a problem. So we move in the direction of federating and charging based on usage whenever the workload is more complex than making simple lookups now and then.</p>
<p>For the <a href="http://virtuoso.openlinksw.com" id="link-id1026ac28">Virtuoso</a> roadmap, this changes little. Next we make data sets available on Amazon EC2, as widely promised at ESWC. With big scale also comes rescaling and repartitioning, so this gets additional weight, as does further parallelizing of single user workloads. As it happens, the same medicine helps for both. At <a href="http://www.linkeddataplanet.com/" id="link-id0x17ff5c20">Linked Data Planet</a>, we will make more announcements.</p>
</div>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-06-09#1376">
  <rss:title>The DARQ Matter of Federation</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-06-09T13:57:30Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Astronomers propose that the universe is held together, so to speak, by the gravity of invisible &quot;dark matter&quot; spread in interstellar and intergalactic space. For the data web, it will be held together by federation, also an invisible factor. As in Minkowski space, so in cyberspace. To take the astronomical analogy further, putting too much visible stuff in one place makes a black hole, whose chief properties are that it is very heavy, can only get heavier and that nothing comes out. DARQ is Bastian Quilitz&#39;s federated extension of the Jena ARQ SPARQL processor. It has existed for a while and was also presented at ESWC2008. There is also SPARQL FED from Andy Seaborne, an explicit means of specifying which end point will process which fragment of a distributed SPARQL query. Still, for federation to deliver in an open, decentralized world, it must be transparent. For a specific application, with a predictable workload, it is of course OK to partition queries explicitly. Bastian had split DBpedia among five Virtuoso servers and was querying this set with DARQ. The end result was that there was a rather frightful cost of federation as opposed to all the data residing in a single Virtuoso. The other result was that if selectivity of predicates was not correctly guessed by the federation engine, the proposition was a non-starter. With correct join order it worked, though. Yet, we really want federation. Looking further down the road, we simply must make federation work. This is just as necessary as running on a server cluster for mid-size workloads. Since we are convinced of the cause, let&#39;s talk about the means. For DARQ as it now stands, there&#39;s probably an order of magnitude or even more to gain from a couple of simple tricks. If going to a SPARQL end point that is not the outermost in the loop join sequence, batch the requests together in one HTTP/1.1 message. So, if the query is &quot;get me my friends living in cities of over a million people,&quot; there will be the fragment &quot;get city where x lives&quot; and later &quot;ask if population of x greater than 1000000&quot;. If I have 100 friends, I send the 100 requests in a batch to each eligible server. Further, if running against a server of known brand, use a client-server connection and prepared statements with array parameters. This can well improve the processing speed at the remote end point by another order of magnitude. This gain may however not be as great as the latency savings from message batching. We will provide a sample of how to do this with Virtuoso over JDBC so Bastian can try this if interested. These simple things will give a lot of mileage and may even decide whether federation is an option in specific applications. For the open web however, these measures will not yet win the day. When federating SQL, colocation of data is sort of explicit. If two tables are joined and they are in the same source, then the join can go to the source. For SPARQL this is also so but with a twist: If a foaf:Person is found on a given server, this does not mean that the Person&#39;s geek code or email hash will be on the same server. Thus {?p name &quot;Johnny&quot; . ?p geekCode ?g . ?p emailHash ?h } does not necessarily denote a colocated join if many servers serve items of the vocabulary. However, in most practical cases, for obtaining a rapid answer, treating this as a colocated fragment will be appropriate. Thus, it may be necessary to be able to declare that geek codes will be assumed colocated with names. This will save a lot of message passing and offer decent, if not theoretically total recall. For search style applications, starting with such assumptions will make sense. If nothing is found, then we can partition each join step separately for the unlikely case that there were a server that gave geek codes but not names. For Virtuoso, we find that a federated query&#39;s asynchronous, parallel evaluation model is not so different from that on a local cluster. So the cluster version could have the option of federated query. The difference is that a cluster is local and tightly coupled and predictably partitioned but a federated setting is none of these. For description, we would take DARQ&#39;s description model and maybe extend it a little where needed. Also we would enhance the protocol to allow just asking for the query cost estimate given a query with literals specified. We will do this eventually. We would like to talk to Bastian about large improvements to DARQ, specially when working with Virtuoso. We&#39;ll see. Of course, one mode of federating is the crawl-as-you-go approach of the Virtuoso Sponger. This will bring in fragments following seeAlso or sameAs declarations or other references. This will however not have the recall of a warehouse or federation over well described SPARQL end-points. But up to a certain volume it has the speed of local storage. The emergence of voiD (Vocabulary of Interlinked Data) is a step in the direction of making federation a reality. There is a separate post about this.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Astronomers propose that the universe is held together, so to speak, by the gravity of invisible &quot;dark matter&quot; spread in interstellar and intergalactic space.</p>
<p>For the <a href="http://dbpedia.org/resource/Data" id="link-id0x19bbd830">data</a> web, it will be held together by federation, also an invisible factor. As in Minkowski space, so in <a href="http://dbpedia.org/resource/Cyberspace" id="link-id0x19af2488">cyberspace</a>.</p>
<p>To take the astronomical analogy further, putting too much visible stuff in one place makes a black hole, whose chief properties are that it is very heavy, can only get heavier and that nothing comes out.</p>
<p>
<a href="http://darq.sourceforge.net/" id="link-id0x19b7a9c8">DARQ</a> is Bastian Quilitz&#39;s federated extension of the <a href="http://jena.sourceforge.net/" id="link-id0x19ce3da0">Jena</a> <a href="http://jena.sourceforge.net/ARQ/" id="link-id0xa569a258">ARQ</a> <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1a8d2270">SPARQL</a> processor. It has existed for a while and was also presented at <a href="http://www.eswc2008.org/" id="link-id0x1aad1d00">ESWC2008</a>. There is also SPARQL FED from Andy Seaborne, an explicit means of specifying which end point will process which fragment of a distributed SPARQL query. Still, for federation to deliver in an open, decentralized world, it must be transparent. For a specific application, with a predictable workload, it is of course OK to partition queries explicitly.</p>
<p>Bastian had split <a href="http://dbpedia.org/resource/DBpedia" id="link-id0x1a8ac770">DBpedia</a> among five <a href="http://virtuoso.openlinksw.com" id="link-id0x19601d30">Virtuoso</a> servers and was querying this set with DARQ. The end result was that there was a rather frightful cost of federation as opposed to all the data residing in a single Virtuoso. The other result was that if selectivity of predicates was not correctly guessed by the federation engine, the proposition was a non-starter. With correct join order it worked, though.</p>
<p>Yet, we really want federation. Looking further down the road, we simply must make federation work. This is just as necessary as running on a server cluster for mid-size workloads.</p>
<p>Since we are convinced of the cause, let&#39;s talk about the means.</p>
<p>For DARQ as it now stands, there&#39;s probably an order of magnitude or even more to gain from a couple of simple tricks. If going to a SPARQL end point that is not the outermost in the loop join sequence, batch the requests together in one <a href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x19b94818">HTTP</a>/1.1 message. So, if the query is &quot;get me my friends living in cities of over a million people,&quot; there will be the fragment &quot;get city where x lives&quot; and later &quot;ask if population of x greater than 1000000&quot;. If I have 100 friends, I send the 100 requests in a batch to each eligible server.</p>
<p>Further, if running against a server of known brand, use a client-server connection and prepared statements with array parameters. This can well improve the processing speed at the remote end point by another order of magnitude. This gain may however not be as great as the latency savings from message batching. We will provide a sample of how to do this with Virtuoso over <a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id0x17822258">JDBC</a> so Bastian can try this if interested.</p>
<p>These simple things will give a lot of mileage and may even decide whether federation is an option in specific applications. For the open web however, these measures will not yet win the day.</p>
<p>When federating <a href="http://dbpedia.org/resource/SQL" id="link-id0x1a651628">SQL</a>, colocation of data is sort of explicit. If two tables are joined and they are in the same source, then the join can go to the source. For SPARQL this is also so but with a twist:</p>
<p>If a foaf:Person is found on a given server, this does not mean that the Person&#39;s geek code or email hash will be on the same server. Thus <code>{?p name &quot;Johnny&quot; . ?p geekCode ?g . ?p emailHash ?h }</code> does not necessarily denote a colocated join if many servers serve items of the vocabulary.</p>
<p>However, in most practical cases, for obtaining a rapid answer, treating this as a colocated fragment will be appropriate. Thus, it may be necessary to be able to declare that geek codes will be assumed colocated with names. This will save a lot of message passing and offer decent, if not theoretically total recall. For search style applications, starting with such assumptions will make sense. If nothing is found, then we can partition each join step separately for the unlikely case that there were a server that gave geek codes but not names.</p>
<p>For Virtuoso, we find that a federated query&#39;s asynchronous, parallel evaluation model is not so different from that on a local cluster. So the cluster version could have the option of federated query. The difference is that a cluster is local and tightly coupled and predictably partitioned but a federated setting is none of these.</p>
<p>For description, we would take DARQ&#39;s description model and maybe extend it a little where needed. Also we would enhance the protocol to allow just asking for the query cost estimate given a query with literals specified. We will do this eventually.</p>
<p>We would like to talk to Bastian about large improvements to DARQ, specially when working with Virtuoso. We&#39;ll see.</p>
<p>Of course, one mode of federating is the crawl-as-you-go approach of the Virtuoso <a href="http://virtuoso.openlinksw.com/Whitepapers/html/VirtSpongerWhitePaper.html" id="link-id0x1dddce48">Sponger</a>. This will bring in fragments following seeAlso or sameAs declarations or other references. This will however not have the recall of a warehouse or federation over well described SPARQL end-points. But up to a certain volume it has the speed of local storage.</p>
<p>The emergence of voiD (Vocabulary of Interlinked Data) is a step in the direction of making federation a reality. There is <a href="http://www.openlinksw.com/weblog/oerling/?id=1377" id="link-id1109a4c8">a separate post</a> about this.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-06-09#1375">
  <rss:title>Aspects of RDF to RDF Mapping</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-06-09T13:52:20Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">The W3C has recently launched an incubator group about mapping relational data to RDF. From participating in the group for the few initial sessions, I get the following impressions. There is a segment of users, for example from the biomedical community, who do heavy duty data integration and look to RDF for managing complexity. Unifying heterogeneous data under OWL ontologies, reasoning, and data integrity, are points of interest. There is another segment that is concerned with semantifying the document web, which topic includes initiatives such as Triplify and semantic web search such as Sindice. The emphasis there is on minimizing entry cost and creating critical mass. The next one to come will clean up the semantics, if these need be cleaned up at all. (Some cleanup is taking place with Yago and Zitgist, but this is a matter for a different post.) Thus, technically speaking, the mapping landscape is diverse, but ETL (extract-transform-load) seems to predominate. The biomedical people make data warehouses for answering specific questions. The web people are interested in putting data out in the expectation that the next player will warehouse it and allow running complex meshups against the whole of the RDF-ized web. As one would expect, these groups see different issues and needs. Roughly speaking, one is about quality and structure and the other is about volume. Where do we stand? We are with the research data warehousers in saying that the mapping question is very complex and that it would indeed be nice to bypass ETL and go to the source RDBMS(s) on demand. Projects in this direction are ongoing. We are with the web people in building large RDF stores with scalable query answering for arbitrary RDF, for example, hosting a lot of the Linking Open Data sets, and working with Zitgist. These things are somewhat different. At present, both the research warehousers and the web scalers predominantly go for ETL. This is fine by us as we definitely are in the large RDF store race. Still, mapping has its point. A relational store will perform quite a bit faster than a quad store if it has the right covering indices or application-specific compressed columnar layout. Thus, there is nothing to block us from querying analytics in SPARQL, once the obviously necessary extensions of sub-query, expressions and aggregation are in place. To cite an example, the Ordnance Survey of the UK has a GIS system running on Oracle with an entry pretty much for each mailbox, lamp post, and hedgerow in the country. According to Ordnance Survey, this would be 1 petatriple, 1e15 triples. &quot;Such a big server farm that we&#39;d have to put it on our map,&quot; as Jenny Harding put it at ESWC2008. I&#39;d add that an even bigger map entry would be the power plant needed to run the 100,000 or so PCs this would take. This is counting 10 gigatriples per PC, which would not even give very good working sets. So, on-the-fly RDBMS-to-RDF mapping in some cases is simply necessary. Still, the benefits of RDF for integration can be preserved if the translation middleware is smart enough. Specifically, this entails knowing what tables can be joined with what other tables and pushing maximum processing to the RDBMS(s) involved in the query. You can download the slide set I used for the Virtuoso presentation for the RDB to RDF mapping incubator group (PPT; other formats coming soon). The main point is that real integration is hard and needs smart query splitting and optimization, as well as real understanding of the databases and subject matter from the information architect. Sometimes in the web space it can suffice to put data out there with trivial RDF translation and hope that a search engine or such will figure out how to join this with something else. For the enterprise, things are not so. Benefits are clear if one can navigate between disjoint silos but making this accurate enough for deriving business conclusions, as well as efficient enough for production, is a soluble and non-trivial question. We will show the basics of this with the TPC-H mapping, and by joining this with physical triples. We will also make a set of TPC-H format table sets, make mappings between keys in one to keys in the other, and show joins between the two. The SPARQL querying of one such data store is a done deal, including the SPARQL extensions for this. There is even a demo paper, Business Intelligence Extensions for SPARQL (PDF; other formats coming soon), by us on the subject in the ESWC 2008 proceedings. If there is an issue left, it is just the technicality of always producing SQL that looks hand-crafted and hence is better understood by the target RDBMS(s). For example, Oracle works better if one uses an IN sub-query instead of the equivalent existence test. Follow this blog for more on the topic; published papers are always a limited view on the matter.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>The W3C has recently launched an <a href="http://www.w3.org/2005/Incubator/rdb2rdf/" id="link-idd763f48">incubator group about mapping relational data to RDF</a>.</p>
<p>From participating in the group for the few initial sessions, I get the following impressions.</p>
<p>There is a segment of users, for example from the biomedical community, who do heavy duty <a href="http://dbpedia.org/resource/Data" id="link-id0x1b388bf0">data</a> integration and look to <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1a24b198">RDF</a> for managing complexity. Unifying heterogeneous data under OWL ontologies, reasoning, and data integrity, are points of interest.</p>
<p>There is another segment that is concerned with semantifying the document web, which topic includes initiatives such as <a href="http://triplify.org/" id="link-id0x16cb5c48">Triplify</a> and <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x1adcd2b8">semantic web</a> search such as <a href="http://sindice.org/" id="link-id0x1a462ee0">Sindice</a>. The emphasis there is on minimizing entry cost and creating critical mass. The next one to come will clean up the semantics, if these need be cleaned up at all.</p>
<p>(Some cleanup is taking place with <a href="http://www.mpi-inf.mpg.de/~suchanek/downloads/yago/" id="link-id0x17faa940">Yago</a> and <a href="http://zitgist.com/about/" id="link-id0x1acd23f0">Zitgist</a>, but this is a matter for a different post.)</p>
<p>Thus, technically speaking, the mapping landscape is diverse, but ETL (extract-transform-load) seems to predominate. The biomedical people make data warehouses for answering specific questions. The web people are interested in putting data out in the expectation that the next player will warehouse it and allow running complex meshups against the whole of the RDF-ized web.</p>
<p>As one would expect, these groups see different issues and needs. Roughly speaking, one is about quality and structure and the other is about volume.</p>
<p>Where do we stand?</p>
<p>We are with the research data warehousers in saying that the mapping question is very complex and that it would indeed be nice to bypass ETL and go to the source <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x17f28d60">RDBMS</a>(s) on demand. Projects in this direction are ongoing.</p>
<p>We are with the web people in building large RDF stores with scalable query answering for arbitrary RDF, for example, hosting a lot of the Linking Open Data sets, and working with Zitgist.</p>
<p>These things are somewhat different.</p>
<p>At present, both the research warehousers and the web scalers predominantly go for ETL.</p>
<p>This is fine by us as we definitely are in the large RDF store race.</p>
<p>Still, mapping has its point. A relational store will perform quite a bit faster than a quad store if it has the right covering indices or application-specific compressed columnar layout. Thus, there is nothing to block us from querying analytics in <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1a2c81c8">SPARQL</a>, once the obviously necessary extensions of sub-query, expressions and aggregation are in place.</p>
<p>To cite an example, the Ordnance Survey of the UK has a GIS system running on <a href="http://dbpedia.org/resource/Oracle_Database" id="link-id0x18a82010">Oracle</a> with an entry pretty much for each mailbox, lamp post, and hedgerow in the country. According to Ordnance Survey, this would be 1 petatriple, 1e15 triples. &quot;Such a big server farm that we&#39;d have to put it on our map,&quot; as Jenny Harding put it at <a href="http://www.eswc2008.org/" id="link-id0x16533418">ESWC2008</a>. I&#39;d add that an even bigger map entry would be the power plant needed to run the 100,000 or so PCs this would take. This is counting 10 gigatriples per PC, which would not even give very good working sets.</p>
<p>So, on-the-fly RDBMS-to-RDF mapping in some cases is simply necessary. Still, the benefits of RDF for integration can be preserved if the translation middleware is smart enough. Specifically, this entails knowing what tables can be joined with what other tables and pushing maximum processing to the RDBMS(s) involved in the query.</p>
<p>You can download the slide set I used for the <a href="http://virtuoso.openlinksw.com" id="link-id0x16c57ed0">Virtuoso</a> presentation for the RDB to RDF mapping incubator group (<a href="http://virtuoso.openlinksw.com/wiki/main/Main/VirtPresentations/Relational2RDF.ppt" id="link-id106f9e88">PPT</a>; <a href="http://virtuoso.openlinksw.com/wiki/main/Main/VirtPresentations" id="link-id10a8dc90">other formats</a> coming soon). The main point is that real integration is hard and needs smart query splitting and optimization, as well as real understanding of the databases and subject matter from the <a href="http://dbpedia.org/resource/Information" id="link-id0x1b132910">information</a> architect. Sometimes in the web space it can suffice to put data out there with trivial RDF translation and hope that a search engine or such will figure out how to join this with something else. For the enterprise, things are not so. Benefits are clear if one can navigate between disjoint silos but making this accurate enough for deriving business conclusions, as well as efficient enough for production, is a soluble and non-trivial question.</p>
<p>We will show the basics of this with the <a href="http://dbpedia.org/resource/TPC-H" id="link-id0x17fc7b58">TPC-H</a> mapping, and by joining this with physical triples. We will also make a set of TPC-H format table sets, make mappings between keys in one to keys in the other, and show joins between the two. The SPARQL querying of one such data store is a done deal, including the SPARQL extensions for this. There is even a demo paper, Business Intelligence Extensions for SPARQL (<a href="http://virtuoso.openlinksw.com/wiki/main/Main/VirtPresentations/RDFAndMapped_BI.pdf" id="link-id12ea4b18">PDF</a>; <a href="http://virtuoso.openlinksw.com/wiki/main/Main/VirtPresentations" id="link-id106e1810">other formats</a> coming soon), by us on the subject in the ESWC 2008 proceedings. If there is an issue left, it is just the technicality of always producing <a href="http://dbpedia.org/resource/SQL" id="link-id0x18439b70">SQL</a> that looks hand-crafted and hence is better understood by the target RDBMS(s). For example, Oracle works better if one uses an <code>IN</code> sub-query instead of the equivalent existence test.</p>
<p>Follow this <a href="http://dbpedia.org/resource/Blog" id="link-id0x16c29ea0">blog</a> for more on the topic; published papers are always a limited view on the matter.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2008-06-09#1374">
  <rss:title>ESWC 2008</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-06-09T13:49:15Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">YrjÃ¤nÃ¤ Rankka and I attended ESWC2008 on behalf of OpenLink. We were invited at the last minute to give a Linked Open Data talk at Paolo Bouquet&#39;s Identity and Reference workshop. We also had a demo of SPARQL BI (PPT); other formats coming soon), our business intelligence extensions to SPARQL as well as joining between relational data mapped to RDF and native RDF data. i was also speaking at the social networks panel chaired by Harry Halpin. I have gathered a few impressions that I will share in the next few posts (1 - RDF Mapping, 2 - DARQ, 3 - voiD, 4 - Paradigmata). Caveat: This is not meant to be complete or impartial press coverage of the event but rather some quick comments on issues of personal/OpenLink interest. The fact that I do not mention something does not mean that it is unimportant. The voiD Graph Linked Open Data was well represented, with Chris Bizer, Tom Heath, ourselves and many others. The great advance for LOD this time around is voiD, the Vocabulary of Interlinked Datasets, a means to describe what in fact is inside the LOD cloud, how to join it with what and so forth. Big time important if there is to be a web of federatable data sources, feeding directly into what we have been saying for a while about SPARQL end-point self-description and discovery. There is reasonable hope of having something by the date of Linked Data Planet in a couple of weeks. Federating Bastian Quilitz gave a talk about his DARQ, a federated version of Jena&#39;s ARQ. Something like DARQ&#39;s optimization statistics should make their way into the SPARQL protocol as well as the voiD data set description. We really need federation but more on this in a separate post. XSPARQL Axel Polleres et al had a paper about XSPARQL, a merge of XQuery and SPARQL. While visiting DERI a couple of weeks back and again at the conference, we talked about OpenLink implementing the spec. It is evident that the engines must be in the same process and not communicate via the SPARQL protocol for this to be practical. We could do this. We&#39;ll have to see when. Politically, using XQuery to give expressions and XML synthesis to SPARQL would be fitting. These things are needed anyhow, as surely as aggregation and sub-queries but the latter would not so readily come from XQuery. Some rapprochement between RDF and XML folks is desirable anyhow. Panel: Will the Sem Web Rise to the Challenge of the Social Web? The social web panel presented the question of whether the sem web was ready for prime time with data portability. The main thrust was expressed in Harry Halpin&#39;s rousing closing words: &quot;Men will fight in a battle and lose a battle for a cause they believe in. Even if the battle is lost, the cause may come back and prevail, this time changed and under a different name. Thus, there may well come to be something like our semantic web, but it may not be the one we have worked all these years to build if we do not rise to the occasion before us right now.&quot; So, how to do this? Dan Brickley asked the audience how many supported, or were aware of, the latest Web 2.0 things, such as OAuth and OpenID. A few were. The general idea was that research (after all, this was a research event) should be more integrated and open to the world at large, not living at the &quot;outdated pace&quot; of a 3 year funding cycle. Stefan Decker of DERI acquiesced in principle. Of course there is impedance mismatch between specialization and interfacing with everything. I said that triples and vocabularies existed, that OpenLink had ODS (OpenLink Data Spaces, Community LinkedData) for managing one&#39;s data-web presence, but that scale would be the next thing. Rather large scale even, with 100 gigatriples (Gtriples) reached before one even noticed. It takes a lot of PCs to host this, maybe $400K worth at today&#39;s prices, without replication. Count 16G ram and a few cores per Gtriple so that one is not waiting for disk all the time. The tricks that Web 2.0 silos do with app-specific data structures and app-specific partitioning do not really work for RDF without compromising the whole point of smooth schema evolution and tolerance of ragged data. So, simple vocabularies, minimal inference, minimal blank nodes. Besides, note that the inference will have to be done at run time, not forward-chained at load time, if only because users will not agree on what sameAs and other declarations they want for their queries. Not to mention spam or malicious sameAs declarations! As always, there was the question of business models for the open data web and for semantic technologies in general. As we see it, information overload is the factor driving the demand. Better contextuality will justify semantic technologies. Due to the large volumes and complex processing, a data-as-service model will arise. The data may be open, but its query infrastructure, cleaning, and keeping up-to-date, can be monetized as services. Identity and Reference For the identity and reference workshop, the ultimate question is metaphysical and has no single universal answer, even though people, ever since the dawn of time and earlier, have occupied themselves with the issue. Consequently, I started with the Genesis quote where Adam called things by nominibus suis, off-hand implying that things would have some intrinsic ontologically-due names. This would be among the older references to the question, at least in widely known sources. For present purposes, the consensus seemed to be that what would be considered the same as something else depended entirely on the application. What was similar enough to warrant a sameAs for cooking purposes might not warrant a sameAs for chemistry. In fact, complete and exact sameness for URIs would be very rare. So, instead of making generic weak similarity assertions like similarTo or seeAlso, one would choose a set of strong sameAs assertions and have these in effect for query answering if they were appropriate to the granularity demanded by the application. Therefore sameAs is our permanent companion, and there will in time be malicious and spam sameAs. So, nothing much should be materialized on the basis of sameAs assertions in an open world. For an app-specific warehouse, sameAs can be resolved at load time. There was naturally some apparent tension between the Occam camp of entity name services and the LOD camp. I would say that the issue is more a perceived polarity than a real one. People will, inevitably, continue giving things names regardless of any centralized authority. Just look at natural language. But having a dictionary that is commonly accepted for established domains of discourse is immensely helpful. CYC and NLP The semantic search workshop was interesting, especially CYC&#39;s presentation. CYC is, as it were, the grand old man of knowledge representation. Over the long term, I would have support of the CYC inference language inside a database query processor. This would mostly be for repurposing the huge knowledge base for helping in search type queries. If it is for transactions or financial reporting, then queries will be SQL and make little or no use of any sort of inference. If it is for summarization or finding things, the opposite holds. For scaling, the issue is just making correct cardinality guesses for query planning, which is harder when inference is involved. We&#39;ll see. I will also have a closer look at natural language one of these days, quite inevitably, since Zitgist (for example) is into entity disambiguation. Scale Garlic gave a talk about their Data Patrol and QDOS. We agree that storing the data for these as triples instead of 1000 or so constantly changing relational tables could well make the difference between next-to-unmanageable and efficiently adaptive. Garlic probably has the largest triple collection in constant online use to date. We will soon join them with our hosting of the whole LOD cloud and Sindice/Zitgist as triples. Conclusions There is a mood to deliver applications. Consequently, scale remains a central, even the principal topic. So for now we make bigger centrally-managed databases. At the next turn around the corner we will have to turn to federation. The point here is that a planetary-scale, centrally-managed, online system can be made when the workload is uniform and anticipatable, but if it is free-form queries and complex analysis, we have a problem. So we move in the direction of federating and charging based on usage whenever the workload is more complex than making simple lookups now and then. For the Virtuoso roadmap, this changes little. Next we make data sets available on Amazon EC2, as widely promised at ESWC. With big scale also comes rescaling and repartitioning, so this gets additional weight, as does further parallelizing of single user workloads. As it happens, the same medicine helps for both. At Linked Data Planet, we will make more announcements.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>YrjÃ¤nÃ¤ Rankka and I attended <a href="http://www.eswc2008.org/" id="link-id10b7a038">ESWC2008</a> on behalf of OpenLink.</p>
<p>We were invited at the last minute to give a <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id105df758">Linked Open Data</a> talk at Paolo Bouquet&#39;s Identity and Reference workshop. We also had a demo of <a href="http://dbpedia.org/resource/SPARQL" id="link-id12eacca0">SPARQL</a> BI (<a href="http://virtuoso.openlinksw.com/wiki/main/Main/VirtPresentations/ESWC2008%20SPARQL%20BI%20OpenLink.ppt" id="link-id10b43e58">PPT</a>); <a href="http://virtuoso.openlinksw.com/wiki/main/Main/VirtPresentations" id="link-id1116d8f0">other formats coming soon</a>), our business intelligence extensions to <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1843a368">SPARQL</a> as well as joining between relational <a href="http://dbpedia.org/resource/Data" id="link-id10badc40">data</a> mapped to <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id108edaf8">RDF</a> and native <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1843a3b0">RDF</a> <a href="http://dbpedia.org/resource/Data" id="link-id0x1843a3c8">data</a>. i was also speaking at the social networks panel chaired by Harry Halpin.</p>
<p>I have gathered a few impressions that I will share in the next few posts (<a href="http://www.openlinksw.com/weblog/oerling/?id=1375" id="link-id107298e0">1 - RDF Mapping</a>, <a href="http://www.openlinksw.com/weblog/oerling/?id=1376" id="link-id10b3a530">2 - DARQ</a>, <a href="http://www.openlinksw.com/weblog/oerling/?id=1377" id="link-id107290e0">3 - voiD</a>, <a href="http://www.openlinksw.com/weblog/oerling/?id=1378" id="link-id1071a950">4 - Paradigmata</a>). <i>Caveat: This is not meant to be complete or impartial press coverage of the event but rather some quick comments on issues of personal/OpenLink interest. The fact that I do not mention something does not mean that it is unimportant.</i>
</p>
<h2>The voiD Graph</h2>
<p>
<a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x16c781e0">Linked Open Data</a> was well represented, with Chris Bizer, Tom Heath, ourselves and many others. The great advance for <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id108f3c48">LOD</a> this time around is <a href="http://community.linkeddata.org/MediaWiki/index.php?MetaLOD#Kick-off_meeting_at_ESWC08" id="link-id10df9830">voiD, the Vocabulary of Interlinked Datasets</a>, a means to describe what in fact is inside the <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x16c78228">LOD</a> cloud, how to join it with what and so forth. Big time important if there is to be a <a href="http://www.openlinksw.com/weblog/oerling/?id=1377" id="link-iddf74578">web of federatable data sources</a>, feeding directly into what we have been saying for a while about SPARQL end-point self-description and discovery. There is reasonable hope of having something by the date of <a href="http://www.linkeddataplanet.com/" id="link-id10dd0848">Linked Data Planet</a> in a couple of weeks.</p>
<h2>Federating</h2>
<p>Bastian Quilitz gave a talk about his <a href="http://darq.sourceforge.net/" id="link-id108746e8">DARQ</a>, a federated version of Jena&#39;s ARQ.</p>
<p>Something like <a href="http://darq.sourceforge.net/" id="link-id0x16c782e8">DARQ</a>&#39;s optimization statistics should make their way into the <a href="http://www.w3.org/TR/rdf-sparql-protocol/" id="link-id10992348">SPARQL protocol</a> as well as the voiD data set description.</p>
<p>We really need federation but more on this in <a href="http://www.openlinksw.com/weblog/oerling/?id=1376" id="link-id1059d688">a separate post</a>.</p>
<h2>
<a href="http://xsparql.deri.ie/" id="link-id10314308">XSPARQL</a>
</h2>
<p>Axel Polleres et al had a paper about <a href="http://xsparql.deri.ie/" id="link-id0x1a2d8458">XSPARQL</a>, a merge of <a href="http://dbpedia.org/resource/XQuery" id="link-id10b98e90">XQuery</a> and SPARQL. While visiting DERI a couple of weeks back and again at the conference, we talked about OpenLink implementing the spec. It is evident that the engines must be in the same process and not communicate via the <a href="http://www.w3.org/TR/rdf-sparql-protocol/" id="link-id0x1d99c1d0">SPARQL protocol</a> for this to be practical. We could do this. We&#39;ll have to see when.</p>
<p>Politically, using <a href="http://dbpedia.org/resource/XQuery" id="link-id0x1acae1f0">XQuery</a> to give expressions and XML synthesis to SPARQL would be fitting. These things are needed anyhow, as surely as aggregation and sub-queries but the latter would not so readily come from XQuery. Some rapprochement between RDF and XML folks is desirable anyhow.</p>
<h2>Panel: Will the Sem Web Rise to the Challenge of the Social Web?</h2>
<p>The social web panel presented the question of whether the sem web was ready for prime time with data portability.</p>
<p>The main thrust was expressed in Harry Halpin&#39;s rousing closing words: &quot;Men will fight in a battle and lose a battle for a cause they believe in. Even if the battle is lost, the cause may come back and prevail, this time changed and under a different name. Thus, there may well come to be something like our <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id122f4da0">semantic web</a>, but it may not be the one we have worked all these years to build if we do not rise to the occasion before us right now.&quot;</p>
<p>So, how to do this? Dan Brickley asked the audience how many supported, or were aware of, the latest Web 2.0 things, such as <a href="http://dbpedia.org/page/OAuth" id="link-idf300bc0">OAuth</a> and <a href="http://dbpedia.org/page/OpenID" id="link-id10ce7a40">OpenID</a>. A few were. The general idea was that research (after all, this was a research event) should be more integrated and open to the world at large, not living at the &quot;outdated pace&quot; of a 3 year funding cycle. Stefan Decker of DERI acquiesced in principle. Of course there is impedance mismatch between specialization and interfacing with everything.</p>
<p>I said that triples and vocabularies existed, that OpenLink had <a href="http://dbpedia.org/resource/OpenLink_Data_Spaces" id="link-id1210dbf8">ODS</a> (<a href="http://dbpedia.org/resource/OpenLink_Data_Spaces" id="link-id11076be8">OpenLink Data Spaces</a>, <a href="http://community.linkeddata.org/" id="link-id10d46710">Community LinkedData</a>) for managing one&#39;s data-web presence, but that scale would be the next thing. Rather large scale even, with 100 gigatriples (Gtriples) reached before one even noticed. It takes a lot of PCs to host this, maybe $400K worth at today&#39;s prices, without replication. Count 16G ram and a few cores per Gtriple so that one is not waiting for disk all the time.</p>
<p>The tricks that Web 2.0 silos do with app-specific data structures and app-specific partitioning do not really work for RDF without compromising the whole point of smooth schema evolution and tolerance of ragged data.</p>
<p>So, simple vocabularies, minimal inference, minimal blank nodes. Besides, note that the inference will have to be done at run time, not forward-chained at load time, if only because users will not agree on what sameAs and other declarations they want for their queries. Not to mention spam or malicious sameAs declarations!</p>
<p>As always, there was the question of business models for the open data web and for semantic technologies in general. As we see it, <a href="http://dbpedia.org/resource/Information" id="link-id108b7688">information</a> overload is the factor driving the demand. Better contextuality will justify semantic technologies. Due to the large volumes and complex processing, a data-as-service model will arise. The data may be open, but its query infrastructure, cleaning, and keeping up-to-date, can be monetized as services.</p>
<h2>Identity and Reference</h2>
<p>For the identity and reference workshop, the ultimate question is metaphysical and has no single universal answer, even though people, ever since the dawn of time and earlier, have occupied themselves with the issue. Consequently, I started with the Genesis quote where Adam called things by <i>nominibus suis</i>, off-hand implying that things would have some intrinsic ontologically-due names. This would be among the older references to the question, at least in widely known sources.</p>
<p>For present purposes, the consensus seemed to be that what would be considered the same as something else depended entirely on the application. What was similar enough to warrant a sameAs for cooking purposes might not warrant a sameAs for chemistry. In fact, complete and exact sameness for URIs would be very rare. So, instead of making generic weak similarity assertions like similarTo or seeAlso, one would choose a set of strong sameAs assertions and have these in effect for query answering if they were appropriate to the granularity demanded by the application.</p>
<p>Therefore sameAs is our permanent companion, and there will in time be malicious and spam sameAs. So, nothing much should be materialized on the basis of sameAs assertions in an <a href="http://dbpedia.org/resource/Open_world_assumption" id="link-id10c4dfd0">open world</a>. For an app-specific warehouse, sameAs can be resolved at load time.</p>
<p>There was naturally some apparent tension between the Occam camp of <a href="http://dbpedia.org/resource/Entity" id="link-id105fd240">entity</a> name services and the LOD camp. I would say that the issue is more a perceived polarity than a real one. People will, inevitably, continue giving things names regardless of any centralized authority. Just look at natural language. But having a dictionary that is commonly accepted for established domains of discourse is immensely helpful.</p>
<h2>CYC and NLP</h2>
<p>The semantic search workshop was interesting, especially CYC&#39;s presentation. CYC is, as it were, the grand old man of <a href="http://dbpedia.org/resource/Knowledge" id="link-id10568158">knowledge</a> representation. Over the long term, I would have support of the CYC inference language inside a database query processor. This would mostly be for repurposing the huge <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x17f7dd40">knowledge</a> base for helping in search type queries. If it is for transactions or financial reporting, then queries will be <a href="http://dbpedia.org/resource/SQL" id="link-id130a0a80">SQL</a> and make little or no use of any sort of inference. If it is for summarization or finding things, the opposite holds. For scaling, the issue is just making correct cardinality guesses for query planning, which is harder when inference is involved. We&#39;ll see.</p>
<p>I will also have a closer look at natural language one of these days, quite inevitably, since <a href="http://zitgist.com/about/" id="link-id10795828">Zitgist</a> (for example) is into <a href="http://dbpedia.org/resource/Entity" id="link-id0x1a2c8bd0">entity</a> disambiguation.</p>
<h2>Scale</h2>
<p>Garlic gave a talk about their Data Patrol and QDOS. We agree that storing the data for these as triples instead of 1000 or so constantly changing relational tables could well make the difference between next-to-unmanageable and efficiently adaptive.</p>
<p>Garlic probably has the largest triple collection in constant online use to date. We will soon join them with our hosting of the whole LOD cloud and <a href="http://sindice.org/" id="link-id0x1b383720">Sindice</a>/<a href="http://zitgist.com/about/" id="link-id0x1b383738">Zitgist</a> as triples.</p>
<h2>Conclusions</h2>
<p>There is a mood to deliver applications. Consequently, scale remains a central, even the principal topic. So for now we make bigger centrally-managed databases. At the next turn around the corner we will have to turn to federation. The point here is that a planetary-scale, centrally-managed, online system can be made when the workload is uniform and anticipatable, but if it is free-form queries and complex analysis, we have a problem. So we move in the direction of federating and charging based on usage whenever the workload is more complex than making simple lookups now and then.</p>
<p>For the <a href="http://virtuoso.openlinksw.com" id="link-id1026ac28">Virtuoso</a> roadmap, this changes little. Next we make data sets available on Amazon EC2, as widely promised at ESWC. With big scale also comes rescaling and repartitioning, so this gets additional weight, as does further parallelizing of single user workloads. As it happens, the same medicine helps for both. At <a href="http://www.linkeddataplanet.com/" id="link-id0x1a2c7eb0">Linked Data Planet</a>, we will make more announcements.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2008-05-02#1357">
  <rss:title>Comments about recent Semantic Gang Podcast</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-05-02T21:44:31Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">After listening to the latest Semantic Web Gang podcast, I found myself agreeing with some of the points made by Alex Iskold, specifically: -- Linked Data does not implicitly imply making all your data public -- Linked Data principles benefit Intranet and Extranet style data integration (trumps alternative distributed database integration approaches any day) -- Business exploitation of Linked Data on the Web will certainly be driven by the correlation of opportunity costs (which is more than likely what Alex meant by &quot;use cases&quot;) associated with the lack of URIs originating from the domain of a given business (Tom Heath: also effectively alluded to this via his BBC and URI land grab anecdotes; same applies Georgi&#39;s examples) -- History is a great tutor, answers to many of today&#39;s problems always lie somewhere in plain sight of the past. Of course, I also believe that Linked Data serves Web Data Integration across the Internet very well too, and the fact that it will be beneficial to businesses in a big way. No individual or organization is an island, I think the Internet and Web have done a good job of demonstrating that thus far :-) We&#39;re all data nodes in a Giant Global Graph. Daniel lewis did shed light on the read-write aspects of the Linked Data Web, which is actually very close to the callout for a Wikipedia for Data. TimBL has been working on this via Tabulator (see Tabulator Editing Screencast), Bengamin Nowack also added similar functionality to ARC, and of course we support the same SPARQL UPDATE into an RDF information resource via the RDF Sink feature of our WebDAV and ODS-Briefcase implementations.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>After listening to the <a href="http://semanticgang.talis.com/2008/05/02/april-2008-the-semantic-web-gang-discuss-a-wikipedia-for-data/" id="link-id1089e218">latest Semantic Web Gang podcast</a>, I found myself agreeing with some of the points made by <a href="http://www.linkedin.com/in/iskold" id="link-id10b91e58">Alex Iskold</a>, specifically:

</p>
<ul>-- <a href="http://dbpedia.org/resource/Linked_Data" id="link-id106e24e0">Linked Data</a> does not implicitly imply making all your <a href="http://dbpedia.org/resource/Data" id="link-id17ab3d48">data</a> public</ul>
<ul>-- <a href="http://dbpedia.org/resource/Linked_Data" id="link-id11fdcef0">Linked Data</a> principles benefit <a href="http://dbpedia.org/resource/Intranet" id="link-id109756e8">Intranet</a> and <a href="http://dbpedia.org/resource/Extranet" id="link-id1099cfd8">Extranet</a> style <a href="http://dbpedia.org/resource/Data" id="link-id10cd25b0">data</a> integration (trumps alternative <a href="http://dbpedia.org/resource/federated_database_system" id="link-id14f29940">distributed database</a> integration approaches any day)</ul>
<ul>-- Business exploitation of <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0xca51940">Linked Data</a> on the <a href="http://dbpedia.org/resource/World_Wide_Web">Web</a> will certainly be driven by the correlation of opportunity costs (which is more than likely what Alex meant by &quot;use cases&quot;) associated with the lack of URIs originating from the domain of a given business (Tom Heath: also effectively alluded to this via his <a href="http://dbpedia.org/resource/BBC" id="link-id16f33348">BBC</a> and <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id10decf38">URI</a> land grab anecdotes; same applies Georgi&#39;s examples)</ul>
<ul>-- History is a great tutor, answers to many of today&#39;s problems always lie somewhere in plain sight of the past.</ul>

<p>Of course, I also believe that <a href="http://dbpedia.org/resource/Linked_Data">Linked Data</a> serves Web <a href="http://dbpedia.org/resource/Data" id="link-id0x1afebd58">Data</a> Integration across the <a href="http://dbpedia.org/resource/Internet" id="link-id10aa5668">Internet</a> very well too, and the fact that it will be beneficial to businesses in a big way. No individual or organization is an island, I think the <a href="http://dbpedia.org/resource/Internet" id="link-id0xb25fbd0">Internet</a> and Web have done a good job of demonstrating that thus far :-) We&#39;re all <a href="http://dbpedia.org/resource/Data">data</a> nodes in a <a href="http://dbpedia.org/resource/Giant_Global_Graph" id="link-id5d8a3a8">Giant Global Graph</a>.</p>

<p>
<a href="http://myopenlink.net/dataspace/person/danieljohnlewis#this" id="link-id17cac8a0">Daniel lewis</a> did shed light on the read-write aspects of the Linked Data <a href="http://dbpedia.org/resource/Giant_Global_Graph" id="link-id10be8590">Web</a>, which is actually very close to the callout for a Wikipedia for Data. <a href="http://www.w3.org/People/Berners-Lee/card#i" id="link-id10a810c0">TimBL</a> has been working on this via <a href="http://dig.csail.mit.edu/2005/ajar/release/tabulator/0.8/tab.html" id="link-id184b7108">Tabulator</a> (see <a href="http://dig.csail.mit.edu/2007/tab/tutorial/editing.mov" id="link-id1416f1e8">Tabulator Editing Screencast</a>), <a href="http://bnode.org/about" id="link-id17e33750">Bengamin Nowack</a> also added <a href="http://arc.semsol.org/download/plugins/data_wiki" id="link-id1688cc40">similar functionality to ARC</a>, and of course we support the same <a href="http://dbpedia.org/resource/SPARQL" id="link-id10bff7c8">SPARQL</a> UPDATE into an <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id168ace08">RDF</a> <a href="http://dbpedia.org/resource/Information" id="link-id10641878">information</a> resource via the <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xddb5240">RDF</a> Sink feature of our WebDAV and <a href="http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/OdsBriefcase" id="link-id0x11199310">ODS</a>-Briefcase implementations.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2008-03-27#1330">
  <rss:title>The Cost of doing the Right Thing</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-03-27T18:41:43Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">One of the biggest impediments to the adoption of technology is the cost burden typically associated with doing the right thing. For instance, requirements for making the Linked Data Web (GGG) buzz would include the following (paraphrasing TimBL&#39;s original Linked Data meme): -- identifying the things you observe, or stumble upon, using URIs (aka Entity IDs) -- construct URIs using HTTP so that the Web provides a channel for referencing things elsewhere (remote object referencing) -- Expose things in your Data Space(s) that are potentially useful to other Web users via URIs -- Link to other Web accessible things using their URIs. The list is nice, but actual execution can be challenging. For instance, when writing a blog post, or constructing a WikiWord, would you have enough disposable time to go searching for these URIs? Or would you compromise and continue to inject &quot;Literal&quot; values into the Web, leaving it to the reasoning endowed human reader to connect the dots? Anyway, OpenLink Data Spaces is now equipped with a Glossary system that allows me to manage terms, meaning of terms, and hyper-linking of phrases and words matching associated with my terms. The great thing about all of this is that everything I do is scoped to my Data Space (my universe of discourse), I don&#39;t break or impede the other meanings of these terms outside my Data Space. The Glossary system can be shared with anyone I choose to share it with, and even better, it makes my upstreaming (rules based replication) style of blogging even more productive :-) Remember, on the Linked Data Web, who you know doesn&#39;t matter as much as what your are connected to, directly or indirectly. Jason Kolb covers this issue in his post: People as Data Connectors, and so doesFrederick Giasson via a recent post titled: Networks are everywhere. For instance, this blog post (or the entire Blog) is a bona fide RDF Linked Data Source, you can use it as the Data Source of a SPARQL Query to find things that aren&#39;t even mentioned in this post, since all you are doing is beaming a query through my Data Space (a container of Linked Data Graphs). On that note, let&#39;s re-watch Jon Udell&#39;s &quot;On-Demand-Blogosphere&quot; screencast from 2006 :-)</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>One of the biggest impediments to the adoption of technology is the cost burden typically associated with doing the right thing. For instance, requirements for making the <a href="http://dbpedia.org/resource/Linked_Data">Linked Data</a> <a href="http://dbpedia.org/resource/World_Wide_Web">Web</a> (<a href="http://dbpedia.org/resource/Giant_Global_Graph">GGG</a>) buzz would include the following (paraphrasing <a href="http://www.w3.org/People/Berners-Lee/card#i">TimBL</a>&#39;s original <a href="http://www.w3.org/DesignIssues/LinkedData.html">Linked Data meme</a>): </p>

<ul>-- identifying the things you observe, or stumble upon, using URIs (aka <a href="http://dbpedia.org/resource/Entity">Entity</a> IDs)</ul>

<ul>-- construct URIs using <a href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol">HTTP</a> so that the Web provides a channel for referencing things elsewhere (remote object referencing)</ul>

<ul>-- Expose things in your <a href="http://dbpedia.org/resource/Data">Data</a> <a href="http://en.wikipedia.org/wiki/Data_Spaces">Space</a>(s) that are potentially useful to other Web users via URIs</ul>

<ul>-- Link to other Web accessible things using their URIs.</ul>

<p>The list is nice, but actual execution can be challenging. For instance, when writing a <a href="http://dbpedia.org/resource/Blog">blog</a> post, or constructing a <a href="http://dbpedia.org/resource/WikiWord">WikiWord</a>, would you have enough disposable time to go searching for these URIs? Or would you compromise and continue to inject &quot;Literal&quot; values into the Web, leaving it to the reasoning endowed human reader to connect the dots?</p>

<p>Anyway, <a href="http://dbpedia.org/resource/OpenLink_Data_Spaces">OpenLink Data Spaces</a> is now equipped with a <a href="http://dbpedia.org/resource/Glossary">Glossary</a> system that allows me to manage terms, meaning of terms, and hyper-linking of phrases and words matching associated with my terms. The great thing about all of this is that everything I do is scoped to <a href="http://myopenlink.net/dataspace/kidehen">my Data Space</a> (my universe of discourse), I don&#39;t break or impede the other meanings of these terms outside my Data Space. The Glossary system can be shared with anyone I choose to share it with, and even better, it makes my upstreaming (rules based replication) style of blogging even more productive :-) </p>

<p>Remember, on the Linked Data Web, who you know doesn&#39;t matter as much as what your are connected to, directly or indirectly. <a href="http://www.jasonkolb.com/">Jason Kolb</a> covers this issue in his post: <a href="http://www.jasonkolb.com/weblog/2008/03/users-as-data-c.html" id="link-id1586a468">People as Data Connectors</a>, and so doesFrederick Giasson via a recent post titled: <a href="http://fgiasson.com/blog/index.php/2008/03/11/networks-are-everywhere/" id="link-id108b9010">Networks are everywhere</a>. For instance, this blog post (or the entire Blog) is a bona fide <a href="http://dbpedia.org/resource/Resource_Description_Framework">RDF</a> Linked Data Source, you can use it as the Data Source of a <a href="http://dbpedia.org/resource/SPARQL">SPARQL</a> Query to find things that aren&#39;t even mentioned in this post, since all you are doing is beaming a query through my Data Space (a container of Linked Data Graphs). On that note, let&#39;s re-watch <a href="http://blog.jonudell.net/">Jon Udell</a>&#39;s <a href="http://weblog.infoworld.com/udell/gems/queryingBlogs.html" id="link-id108c0908">&quot;On-Demand-Blogosphere&quot; screencast from 2006</a> :-)</p>

]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2007-11-21#1274">
  <rss:title>RDF Benchmarking, Role, Motives, and Rationale</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2007-11-21T14:19:39Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Arising from the recent W3C workshop on mapping relational data to RDF, there is some discussion on starting a benchmarking oriented experimental group under the W3C. I&#39;ll here make some comments on where this might fit and how this might serve our nascent industry. To the public, basically any recipient of the semantic data web message, the benchmarking activity should communicate: The semantic data web claims to allow integrating any legacy data from wherever and allow translating this into common, mutually joinable vocabularies, and make the web into a big database capable of answering structured queries on any open data. The benchmarking activity is to prove that this is not a pipe dream that Gartner Group forecast for 2027. Instead, there exists an industry, a degree of consensus within the industry concerning what the semantic data web is for, and products that are beyond experimental and can deliver at least some of the claimed benefits of the semantic data web. To the general public, the message will be best delivered by the existence of online services that do interesting things with linked data, starting from search and going to more specialized derivative products of structured information on the web. To those intending to apply some semantic data web things themselves, the benchmark activity should give a directory of products to look at. The reason why a benchmark suite backed by some industry consortium is useful is that it adds to the end user&#39;s confidence that the use case being measured is of somewhat general relevance and not just made to demonstrate any single product&#39;s strengths. Besides this, the TPC idea of disclosing scale, throughput, price per throughput and date is fine because it makes for easy tabulation of results. The intricacies in the full disclosure is effectively masked and it is my guess that very few read the actual full disclosures. The inference that an evaluator draws from benchmark results is that some product figuring there consistently is somewhat serious and can be studied further. Being in the running is like a stamp of approval. The benchmarks are complex and the evaluator seldom goes to the trouble of really analyzing performance by individual query or transaction even if these are and must be given. It is a bit like Formula 1 viewers do not generally read the rules on car engine or aerodynamics, let alone understand their finer points. For credibility to be thus given to products and hence the industry, we should just have a couple of well defined and agreed upon benchmarks, just like TPC. The third public is the developer. As a DBMS developer, I am a great fan of TPC. The great benefit I derive from their work is that they give a test suite for measuring effects of code changes on performance. Also, assuming that the TPC workload mix is representative, it also allows ranking what optimizations are more important than others. Lastly, TPC gives a great way of describing results, e.g., changes resulting in x% improvement on throughput of y. In such usage, the benchmarks are pretty much never run by the rules but results obtained are still good for internal comparison. Communication about IS should allow for short, simple messages: Release XX Halves Price per Throughput. The existence of benchmarks is, if not absolutely necessary, then at least a great help for such communication. Besides, people are culturally used to all kinds of racing and sports results so this is even a familiar format. Now the TPC is also not perfect. In the high end, the measured configurations are so large that one does not see them very often in practice. It is like the techno sports of Formula 1 or America&#39;s Cup. Interesting for the curiosity value but not immediately relevant to the regular car buyer or weekend yachtsman. Further, sponsoring a by-the-book audited TPC result is not so simple. Not as expensive as putting out an America&#39;s Cup challenge but still some trouble and expense. So, for us to benefit by the benchmarking activity, we must find a group that can both agree and be somewhat representative. Then we must put out a simple message: This here is for integration of relational sources and this here for storage and query of RDF. Furthermore, in so far we derive from relational or similar sources, the technology should not do less than the established alternative. This sends the wrong message. Entering the running should not be overly difficult for vendors, hence we should not have too many benchmarks and the ones that there are should be representative and sufficiently varied workloads. The results should be compact and easy to state. One more reason why I like TPC&#39;s work is the fact that the benchmarks have an easy to understand, unified use case behind them. Approximately what is done in each becomes clear from a very short and succinct description even though the details can be complex. I suspect this is one side of their appeal. I would venture the guess that a single use case story is easier to sell than a composite metric of disparate tests. Also in the scientific computing world, we have use cases, like NAS for aerodynamics, so having a use case story is quite common and a factor for making a benchmark&#39;s relevance understandable. Is this all possible? To play the devil&#39;s advocate, I could say that the use cases are not as well settled as the relational ones hence formulating a generally representative benchmark is not possible. Now this is certainly not a message that this community wishes to send. Besides, there exists decades worth of history of the problems of information integration and a great deal of RDF data out there, , even a compilation of dozens of industry use cases by the SWEO, so we are not exactly in the dark here. Can there be political agreement in reasonable time? If we look at the TPC as a precedent, judging by the rate of publication and revision, the process is not exactly quick. Now, for the TPC, it does not have to be. Judging by the frequency of published test results, hardware vendors are happy enough to have a forum to show off and do so at every turn. Now we are not at this stage of maturity yet. Composing a TPC style test spec is possible in a reasonable time for an individual but likely not for a committee. It is quite voluminous but also quite formulaic. While TPC&#39;s material is their own, I see no reason that we could not reference or link to it it where applicable. Who would be motivated by such activity? How to pitch the activity to would be participants? I don&#39;t think that just talking about what to measure and how is interesting enough. This is covered ground. Vendors want to promote themselves and end users want to have vendors compete at solving their problems. Or so it would be in a simpler world. Personally, I&#39;d like to see a benchmark with a use case story people can relate to emerge in the next few months. Now I am not necessarily holding my breath waiting for this. For purposes of ongoing development, there is the real data out there and we can for example do the social web workload mix I suggested a couple of blog posts back on that and it is good enough for us. But that is not good enough for the industry&#39;s messaging. I&#39;d say that we have to assume that people play in good faith and simply ask who want to run and get an extra edge by being in on the design of the race track. By good faith I here mean a sincere wish to have the race take place in the first place. The sport is exciting for the players and spectators alike if there is a use case story that they can relate to and an actual tournament. So this is what we should aim for. Because this is so far a niche public, we should not fragment the activity too much and we should consider how understandable and relevant the benchmark activity is to likely semantic data web adopters.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Arising from the recent W3C workshop on <a href="http://www.w3.org/2007/03/RdfRDB/" id="link-id10679c70">mapping relational data to RDF</a>, there is some discussion on <a href="http://www.openlinksw.com/weblog/oerling/?id=1268" id="link-id1258dca8">starting a benchmarking oriented experimental group</a> under the W3C. I&#39;ll here make some comments on where this might fit and how this might serve our nascent industry.</p>
<p>To the public, basically any recipient of the semantic <a href="http://dbpedia.org/resource/Data" id="link-id0xa203a350">data</a> web message, the benchmarking activity should communicate:</p>
<ul>
 <li>
  <p>The semantic data web claims to</p>
<ol>
    <li> allow integrating any legacy data from wherever and allow translating this into common, mutually joinable vocabularies, and</li>
<li>make the web into a big database capable of answering structured queries on any open data.</li>
  </ol>
 </li>
<li>
  <p>The benchmarking activity is to prove that this is not a pipe dream that Gartner Group forecast for 2027. Instead, there exists </p>
<ol>
    <li>an industry, </li>
<li>a degree of consensus within the industry concerning what the semantic data web is for, and</li>
<li>products that are beyond experimental and can deliver at least some of the claimed benefits of the semantic data web.</li>
  </ol>
</li>
</ul>
<p>To the general public, the message will be best delivered by the existence of online services that do interesting things with <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x1404e708">linked data</a>, starting from search and going to more specialized derivative products of structured <a href="http://dbpedia.org/resource/Information" id="link-id0xa1e12cd8">information</a> on the web.</p>
<p>To those intending to apply some semantic data web things themselves, the benchmark activity should give a directory of products to look at. The reason why a benchmark suite backed by some industry consortium is useful is that it adds to the end user&#39;s confidence that the use case being measured is of somewhat general relevance and not just made to demonstrate any single product&#39;s strengths. Besides this, the TPC idea of disclosing scale, throughput, price per throughput and date is fine because it makes for easy tabulation of results. The intricacies in the full disclosure is effectively masked and it is my guess that very few read the actual full disclosures.</p>
<p>The inference that an evaluator draws from benchmark results is that some product figuring there consistently is somewhat serious and can be studied further. Being in the running is like a stamp of approval. The benchmarks are complex and the evaluator seldom goes to the trouble of really analyzing performance by individual query or transaction even if these are and must be given. It is a bit like Formula 1 viewers do not generally read the rules on car engine or aerodynamics, let alone understand their finer points.</p>
<p>For credibility to be thus given to products and hence the industry, we should just have a couple of well defined and agreed upon benchmarks, just like TPC.</p>
<p>The third public is the developer. As a DBMS developer, I am a great fan of TPC. The great benefit I derive from their work is that they give a test suite for measuring effects of code changes on performance. Also, assuming that the TPC workload mix is representative, it also allows ranking what optimizations are more important than others. Lastly, TPC gives a great way of describing results, e.g., changes resulting in x% improvement on throughput of y. In such usage, the benchmarks are pretty much never run by the rules but results obtained are still good for internal comparison.</p>
<p>Communication about IS should allow for short, simple messages: Release XX Halves Price per Throughput.</p>
<p>The existence of benchmarks is, if not absolutely necessary, then at least a great help for such communication. Besides, people are culturally used to all kinds of racing and sports results so this is even a familiar format.</p>
<p>Now the TPC is also not perfect. In the high end, the measured configurations are so large that one does not see them very often in practice. It is like the techno sports of Formula 1 or America&#39;s Cup. Interesting for the curiosity value but not immediately relevant to the regular car buyer or weekend yachtsman. Further, sponsoring a by-the-book audited TPC result is not so simple. Not as expensive as putting out an America&#39;s Cup challenge but still some trouble and expense.</p>
<p>So, for us to benefit by the benchmarking activity, we must find a group that can both agree and be somewhat representative. Then we must put out a simple message: This here is for integration of relational sources and this here for storage and query of <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xa192c590">RDF</a>.</p>
<p>Furthermore, in so far we derive from relational or similar sources, the technology should not do less than the established alternative. This sends the wrong message.</p>
<p>Entering the running should not be overly difficult for vendors, hence we should not have too many benchmarks and the ones that there are should be representative and sufficiently varied workloads. The results should be compact and easy to state. One more reason why I like TPC&#39;s work is the fact that the benchmarks have an easy to understand, unified use case behind them. Approximately what is done in each becomes clear from a very short and succinct description even though the details can be complex. I suspect this is one side of their appeal. I would venture the guess that a single use case story is easier to sell than a composite metric of disparate tests. Also in the scientific computing world, we have use cases, like NAS for aerodynamics, so having a use case story is quite common and a factor for making a benchmark&#39;s relevance understandable.</p>
<p>Is this all possible?</p>
<p>To play the devil&#39;s advocate, I could say that the use cases are not as well settled as the relational ones hence formulating a generally representative benchmark is not possible. Now this is certainly not a message that this community wishes to send. Besides, there exists decades worth of history of the problems of information integration and a great deal of RDF data out there, , even a compilation of dozens of industry use cases by the SWEO, so we are not exactly in the dark here.</p>
<p>Can there be political agreement in reasonable time? If we look at the TPC as a precedent, judging by the rate of publication and revision, the process is not exactly quick. Now, for the TPC, it does not have to be. Judging by the frequency of published test results, hardware vendors are happy enough to have a forum to show off and do so at every turn.</p>
<p>Now we are not at this stage of maturity yet.</p>
<p>Composing a TPC style test spec is possible in a reasonable time for an individual but likely not for a committee. It is quite voluminous but also quite formulaic. While TPC&#39;s material is their own, I see no reason that we could not reference or link to it it where applicable.</p>
<p>Who would be motivated by such activity? How to pitch the activity to would be participants? I don&#39;t think that just talking about what to measure and how is interesting enough. This is covered ground. Vendors want to promote themselves and end users want to have vendors compete at solving their problems. Or so it would be in a simpler world.</p>
<p>Personally, I&#39;d like to see a benchmark with a use case story people can relate to emerge in the next few months. Now I am not necessarily holding my breath waiting for this. For purposes of ongoing development, there is the real data out there and we can for example do the social web workload mix I suggested a couple of <a href="http://dbpedia.org/resource/Blog" id="link-id0x1f671bf8">blog</a> posts back on that and it is good enough for us. But that is not good enough for the industry&#39;s messaging.</p>
<p>I&#39;d say that we have to assume that people play in good faith and simply ask who want to run and get an extra edge by being in on the design of the race track. By good faith I here mean a sincere wish to have the race take place in the first place.</p>
<p>The sport is exciting for the players and spectators alike if there is a use case story that they can relate to and an actual tournament. So this is what we should aim for. Because this is so far a niche public, we should not fragment the activity too much and we should consider how understandable and relevant the benchmark activity is to likely semantic data web adopters.</p> ]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2007-08-28#1248">
  <rss:title>Virtuoso and cluster capacity allocation</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2007-08-28T11:54:30Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Virtuoso and cluster capacity allocation I just read Google&#39;s Bigtable paper. It is relevant here because it talks about keeping petabyte scale (1024TB) tables on a variable size cluster of machines. I have talked about partitioning versus distributed cache in the second to last post. The problem in short is that you do not expect a DBA to really know how to partition things, and even if the indices are correctly partitioned initially, repartitioning them is so bad that doing it online can be a problem. And repartitioning is needed whenever adding machines, unless the size increment is a doubling, which it will never be. So Oracle has really elegantly stepped around the whole problem by not partitioning for clustering in the first place. So incremental capacity change does not require repartitioning. Oracle has partitioning for other purposes but this is not tied to their cluster proposition. I did not go the cache fusion route because I could not figure a way to know with near certainty where to send a request for a given key value. In the case we are interested in, the job simply must go to the data and not the other way around. Besides, not being totally dependent on a microsecond latency interconnect and a SAN for performance enhances deployment options. Sending large batches of functions tolerates latency better than cache consistency messages which are a page at a time, unless of course you kill yourself with extra trickery for batching these too. So how to adapt to capacity change? Well, by making the unit of capacity allocation much smaller than a machine, of course. Google has done this in Bigtable by a scheme of dynamic range partitioning. The partition size is in the tens to hundreds of megabytes, something that can be moved around within reason. When the partition, called a tablet, gets too big, it splits. Just like a Btree index. The tree top must be common knowledge, as well as the allocation of partitions to servers but these can be cached here and there and do not change all the time. So how could we do something of the sort here? I know for an experiential fact that when people cannot change the server memory pool size, let alone correctly set up disk striping, they simply cannot be expected to deal with partitioning. Besides, even if you know exactly what you are doing and why, configuring and refilling large numbers of partitions by hand is error prone, tedious, time consuming, and will run out of disk and require restoring backups and all sorts of DBA activity that will have everything down for a long time, unless of course you have MIS staff such as is not easily found. The solution is not so complex. We start with a set number of machines and make a file group on each. A file group has a bunch of disk stripes and a log file and can be laid out on the local file system in the usual manner. The data goes into the file group, partitioned as defined. You still specify partitioning columns but not where each partition goes. The system will decide this by itself. When a server&#39;s file group gets too big, it splits. One half of each key&#39;s partition in the original stays where it was and the other half goes to the copy. The copies will hold rows that no longer belong there but these can be removed in the background. The new file group will be managed by the same server process and the partitioning information on all servers gets updated to reflect the existence of the new file group and the range of hash values that belong there. If a file group is kept at some reasonable size, under a few GB, these can be moved around between servers, even dynamically. If data is kept replicated, then the replicas have to split at the same time and the system will have to make sure that the replicas are kept on separate machines. So what happens to disk locality when file groups split? Nothing much. Firstly, partitioning will be set up so that consecutive values go to the same hash value, so that key compression is not ruined. Thus, consecutive numbers will be on the same page. Imagine an integer key partitioned two ways on bits 10-20. Values 0-1K go together, values 1K-2K go another way, values 2K-3K go the first way etc. Now let us suppose the first partition, the even K&#39;s splits. It could split so that multiples of 4 go one way and the rest another way. Now we&#39;d have 0-1K in place, 2K-3K in the new partition, 4K-5K in place and so on. A sequential disk read, with some read ahead, would scan the partitions in parallel but the disk access would be made sequential by the read ahead logic â remember that these are controlled by the same server process. For purposes of sending functions, the file group would be the recipient, not the host, per se. The allocation of file groups to hosts could change. Now picture a transaction that touches multiple file groups. The requests going to collocated file groups can travel in the same batch and the recipient server process can run them sequentially or with a thread per file group, as may be convenient. Multiple threads per query on the same index make contention and needless thread switches. But since distinct file groups have their distinct mutexes there is less interference. For purposes of transactions, we might view a file group as deserving a its own branch. In this way we would not have to abort transactions if file groups moved. A file group split would probably have to kill all uncommitted transactions on it so as not to have to split one branch in two or deal with uncommitted data in the split. This is hardly a problem, the event being rare. For purposes of checkpoints, logging, log archival, recovery, and such, a file group is its own unit. The Bigtable paper had some ideas about combining transaction logs and such, all quite straightforward and intuitive. Writing the clustering logic with the file group, not the database process, as the main unit of location is a good idea and an entirely trivial change. This will make it possible to adjust capacity in almost real time without bringing everything to a halt by re-inserting terabytes of data in system wide repartitioning runs. Implementing this on the current Virtuoso is not a real difficulty. There is already a concept of file group, although we use only two, one for the data and one for temp. Using multiple ones is not a big deal. Supporting capacity allocation at the file group level instead of the server level can be introduced towards the middle of the clustering effort and will not greatly impact timetables.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<div style="display:none;">Virtuoso and cluster capacity allocation</div>
<p>I just read <a href="http://labs.google.com/papers/bigtable.html" id="link-id140a68b0">Google&#39;s Bigtable</a> paper. It is relevant here because it talks about keeping petabyte scale (1024TB) tables on a variable size cluster of machines.</p>
<p>I have talked about partitioning versus distributed cache in the <a href="http://www.openlinksw.com/weblog/oerling/?id=1229" id="link-id13f70dc8">second to last post</a>. The problem in short is that you do not expect a DBA to really know how to partition things, and even if the indices are correctly partitioned initially, repartitioning them is so bad that doing it online can be a problem. And repartitioning is needed whenever adding machines, unless the size increment is a doubling, which it will never be.</p>
<p>So <a href="http://dbpedia.org/resource/Oracle_Database" id="link-id0xa2957e38">Oracle</a> has really elegantly stepped around the whole problem by not partitioning for clustering in the first place. So incremental capacity change does not require repartitioning. Oracle has partitioning for other purposes but this is not tied to their cluster proposition.</p>
<p>I did not go the cache fusion route because I could not figure a way to know with near certainty where to send a request for a given key value. In the case we are interested in, the job simply must go to the <a href="http://dbpedia.org/resource/Data" id="link-id0xa20ce8e0">data</a> and not the other way around. Besides, not being totally dependent on a microsecond latency interconnect and a SAN for performance enhances deployment options. Sending large batches of functions tolerates latency better than cache consistency messages which are a page at a time, unless of course you kill yourself with extra trickery for batching these too.</p>
<p>So how to adapt to capacity change? Well, by making the unit of capacity allocation much smaller than a machine, of course.</p>
<p>Google has done this in Bigtable by a scheme of dynamic range partitioning. The partition size is in the tens to hundreds of megabytes, something that can be moved around within reason. When the partition, called a tablet, gets too big, it splits. Just like a Btree index. The tree top must be common <a href="http://dbpedia.org/resource/Knowledge" id="link-id0xa20cebb0">knowledge</a>, as well as the allocation of partitions to servers but these can be cached here and there and do not change all the time.</p>
<p>So how could we do something of the sort here? I know for an experiential fact that when people cannot change the server memory pool size, let alone correctly set up disk striping, they simply cannot be expected to deal with partitioning. Besides, even if you know exactly what you are doing and why, configuring and refilling large numbers of partitions by hand is error prone, tedious, time consuming, and will run out of disk and require restoring backups and all sorts of DBA activity that will have everything down for a long time, unless of course you have MIS staff such as is not easily found.</p>
<p>The solution is not so complex. We start with a set number of machines and make a file group on each. A file group has a bunch of disk stripes and a log file and can be laid out on the local file system in the usual manner. The data goes into the file group, partitioned as defined. You still specify partitioning columns but not where each partition goes. The system will decide this by itself. When a server&#39;s file group gets too big, it splits. One half of each key&#39;s partition in the original stays where it was and the other half goes to the copy. The copies will hold rows that no longer belong there but these can be removed in the background. The new file group will be managed by the same server process and the partitioning <a href="http://dbpedia.org/resource/Information" id="link-id0xc803148">information</a> on all servers gets updated to reflect the existence of the new file group and the range of hash values that belong there.</p>
<p>If a file group is kept at some reasonable size, under a few GB, these can be moved around between servers, even dynamically.  </p>
<p>If data is kept replicated, then the replicas have to split at the same time and the system will have to make sure that the replicas are kept on separate machines.</p>
<p>So what happens to disk locality when file groups split? Nothing much. Firstly, partitioning will be set up so that consecutive values go to the same hash value, so that key compression is not ruined. Thus, consecutive numbers will be on the same page. Imagine an integer key partitioned two ways on bits 10-20. Values 0-1K go together, values 1K-2K go another way, values 2K-3K go the first way etc.  </p>
<p>Now let us suppose the first partition, the even K&#39;s splits. It could split so that multiples of 4 go one way and the rest another way. Now we&#39;d have 0-1K in place, 2K-3K in the new partition, 4K-5K in place and so on. A sequential disk read, with some read ahead, would scan the partitions in parallel but the disk access would be made sequential by the read ahead logic â remember that these are controlled by the same server process.</p>
<p>For purposes of sending functions, the file group would be the recipient, not the host, per se. The allocation of file groups to hosts could change.  </p>
<p>Now picture a transaction that touches multiple file groups. The requests going to collocated file groups can travel in the same batch and the recipient server process can run them sequentially or with a thread per file group, as may be convenient. Multiple threads per query on the same index make contention and needless thread switches. But since distinct file groups have their distinct mutexes there is less interference.</p>
<p>For purposes of transactions, we might view a file group as deserving a its own branch. In this way we would not have to abort transactions if file groups moved. A file group split would probably have to kill all uncommitted transactions on it so as not to have to split one branch in two or deal with uncommitted data in the split. This is hardly a problem, the event being rare. For purposes of checkpoints, logging, log archival, recovery, and such, a file group is its own unit. The Bigtable paper had some ideas about combining transaction logs and such, all quite straightforward and intuitive.</p>
<p>Writing the clustering logic with the file group, not the database process, as the main unit of location is a good idea and an entirely trivial change. This will make it possible to adjust capacity in almost real time without bringing everything to a halt by re-inserting terabytes of data in system wide repartitioning runs.</p>
<p>Implementing this on the current <a href="http://virtuoso.openlinksw.com" id="link-id0xa21de0e0">Virtuoso</a> is not a real difficulty. There is already a concept of file group, although we use only two, one for the data and one for temp. Using multiple ones is not a big deal.</p>
<p>Supporting capacity allocation at the file group level instead of the server level can be introduced towards the middle of the clustering effort and will not greatly impact timetables.</p>
</div>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2007-08-28#1246">
  <rss:title>Virtuoso and cluster capacity allocation</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2007-08-28T10:08:25Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">I just read Google&#39;s Bigtable paper. It is relevant here because it talks about keeping petabyte scale (1024TB) tables on a variable size cluster of machines. I have talked about partitioning versus distributed cache in the second to last post. The problem in short is that you do not expect a DBA to really know how to partition things, and even if the indices are correctly partitioned initially, repartitioning them is so bad that doing it online can be a problem. And repartitioning is needed whenever adding machines, unless the size increment is a doubling, which it will never be. So Oracle has really elegantly stepped around the whole problem by not partitioning for clustering in the first place. So incremental capacity change does not require repartitioning. Oracle has partitioning for other purposes but this is not tied to their cluster proposition. I did not go the cache fusion route because I could not figure a way to know with near certainty where to send a request for a given key value. In the case we are interested in, the job simply must go to the data and not the other way around. Besides, not being totally dependent on a microsecond latency interconnect and a SAN for performance enhances deployment options. Sending large batches of functions tolerates latency better than cache consistency messages which are a page at a time, unless of course you kill yourself with extra trickery for batching these too. So how to adapt to capacity change? Well, by making the unit of capacity allocation much smaller than a machine, of course. Google has done this in Bigtable by a scheme of dynamic range partitioning. The partition size is in the tens to hundreds of megabytes, something that can be moved around within reason. When the partition, called a tablet, gets too big, it splits. Just like a Btree index. The tree top must be common knowledge, as well as the allocation of partitions to servers but these can be cached here and there and do not change all the time. So how could we do something of the sort here? I know for an experiential fact that when people cannot change the server memory pool size, let alone correctly set up disk striping, they simply cannot be expected to deal with partitioning. Besides, even if you know exactly what you are doing and why, configuring and refilling large numbers of partitions by hand is error prone, tedious, time consuming, and will run out of disk and require restoring backups and all sorts of DBA activity that will have everything down for a long time, unless of course you have MIS staff such as is not easily found. The solution is not so complex. We start with a set number of machines and make a file group on each. A file group has a bunch of disk stripes and a log file and can be laid out on the local file system in the usual manner. The data goes into the file group, partitioned as defined. You still specify partitioning columns but not where each partition goes. The system will decide this by itself. When a server&#39;s file group gets too big, it splits. One half of each key&#39;s partition in the original stays where it was and the other half goes to the copy. The copies will hold rows that no longer belong there but these can be removed in the background. The new file group will be managed by the same server process and the partitioning information on all servers gets updated to reflect the existence of the new file group and the range of hash values that belong there. If a file group is kept at some reasonable size, under a few GB, these can be moved around between servers, even dynamically. If data is kept replicated, then the replicas have to split at the same time and the system will have to make sure that the replicas are kept on separate machines. So what happens to disk locality when file groups split? Nothing much. Firstly, partitioning will be set up so that consecutive values go to the same hash value, so that key compression is not ruined. Thus, consecutive numbers will be on the same page. Imagine an integer key partitioned two ways on bits 10-20. Values 0-1K go together, values 1K-2K go another way, values 2K-3K go the first way etc. Now let us suppose the first partition, the even K&#39;s splits. It could split so that multiples of 4 go one way and the rest another way. Now we&#39;d have 0-1K in place, 2-3K in the new partition, 4K-5K in place and so on. A sequential disk read, with some read ahead, would scan the partitions in parallel but the disk access would be made sequential by the read ahead logic â remember that these are controlled by the same server process. For purposes of sending functions, the file group would be the recipient, not the host, per se. The allocation of file groups to hosts could change. Now picture a transaction that touches multiple file groups. The requests going to collocated file groups can travel in the same batch and the recipient server process can run them sequentially or with a thread per file group, as may be convenient. Multiple threads per query on the same index make contention and needless thread switches. But since distinct file groups have their distinct mutexes there is less interference. For purposes of transactions, we might view a file group as deserving a its own branch. In this way we would not have to abort transactions if file groups moved. A file group split would probably have to kill all uncommitted transactions on it so as not to have to split one branch in two or deal with uncommitted data in the split. This is hardly a problem, the event being rare. For purposes of checkpoints, logging, log archival, recovery, and such, a file group is its own unit. The Bigtable paper had some ideas about combining transaction logs and such, all quite straightforward and intuitive. Writing the clustering logic with the file group, not the database process, as the main unit of location is a good idea and an entirely trivial change. This will make it possible to adjust capacity in almost real time without bringing everything to a halt by re-inserting terabytes of data in system wide repartitioning runs. Implementing this on the current Virtuoso is not a real difficulty. There is already a concept of file group, although we use only two, one for the data and one for temp. Using multiple ones is not a big deal. Supporting capacity allocation at the file group level instead of the server level can be introduced towards the middle of the clustering effort and will not greatly impact timetables. Â </dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>I just read <a href="http://labs.google.com/papers/bigtable.html" id="link-id10967a78">Google&#39;s Bigtable</a> paper. It is relevant here because it talks about keeping petabyte scale (1024TB) tables on a variable size cluster of machines.</p>
<p>I have talked about partitioning versus distributed cache in the <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1229" id="link-id10913318">second to last post</a>. The problem in short is that you do not expect a DBA to really know how to partition things, and even if the indices are correctly partitioned initially, repartitioning them is so bad that doing it online can be a problem. And repartitioning is needed whenever adding machines, unless the size increment is a doubling, which it will never be.</p>
<p>So <a href="http://dbpedia.org/resource/Oracle_Database" id="link-id0x1c4caaa0">Oracle</a> has really elegantly stepped around the whole problem by not partitioning for clustering in the first place. So incremental capacity change does not require repartitioning. Oracle has partitioning for other purposes but this is not tied to their cluster proposition.</p>
<p>I did not go the cache fusion route because I could not figure a way to know with near certainty where to send a request for a given key value. In the case we are interested in, the job simply must go to the <a href="http://dbpedia.org/resource/Data" id="link-id0xa1b52ab8">data</a> and not the other way around. Besides, not being totally dependent on a microsecond latency interconnect and a SAN for performance enhances deployment options. Sending large batches of functions tolerates latency better than cache consistency messages which are a page at a time, unless of course you kill yourself with extra trickery for batching these too.</p>
<p>So how to adapt to capacity change? Well, by making the unit of capacity allocation much smaller than a machine, of course.</p>
<p>Google has done this in Bigtable by a scheme of dynamic range partitioning. The partition size is in the tens to hundreds of megabytes, something that can be moved around within reason. When the partition, called a tablet, gets too big, it splits. Just like a Btree index. The tree top must be common <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x9f94a5f8">knowledge</a>, as well as the allocation of partitions to servers but these can be cached here and there and do not change all the time.</p>
<p>So how could we do something of the sort here? I know for an experiential fact that when people cannot change the server memory pool size, let alone correctly set up disk striping, they simply cannot be expected to deal with partitioning. Besides, even if you know exactly what you are doing and why, configuring and refilling large numbers of partitions by hand is error prone, tedious, time consuming, and will run out of disk and require restoring backups and all sorts of DBA activity that will have everything down for a long time, unless of course you have MIS staff such as is not easily found.</p>
<p>The solution is not so complex. We start with a set number of machines and make a file group on each. A file group has a bunch of disk stripes and a log file and can be laid out on the local file system in the usual manner. The data goes into the file group, partitioned as defined. You still specify partitioning columns but not where each partition goes. The system will decide this by itself. When a server&#39;s file group gets too big, it splits. One half of each key&#39;s partition in the original stays where it was and the other half goes to the copy. The copies will hold rows that no longer belong there but these can be removed in the background. The new file group will be managed by the same server process and the partitioning <a href="http://dbpedia.org/resource/Information" id="link-id0x1a3e17a0">information</a> on all servers gets updated to reflect the existence of the new file group and the range of hash values that belong there.</p>
<p>If a file group is kept at some reasonable size, under a few GB, these can be moved around between servers, even dynamically.  </p>
<p>If data is kept replicated, then the replicas have to split at the same time and the system will have to make sure that the replicas are kept on separate machines.</p>
<p>So what happens to disk locality when file groups split? Nothing much. Firstly, partitioning will be set up so that consecutive values go to the same hash value, so that key compression is not ruined. Thus, consecutive numbers will be on the same page. Imagine an integer key partitioned two ways on bits 10-20. Values 0-1K go together, values 1K-2K go another way, values 2K-3K go the first way etc.  </p>
<p>Now let us suppose the first partition, the even K&#39;s splits. It could split so that multiples of 4 go one way and the rest another way. Now we&#39;d have 0-1K in place, 2-3K in the new partition, 4K-5K in place and so on. A sequential disk read, with some read ahead, would scan the partitions in parallel but the disk access would be made sequential by the read ahead logic â remember that these are controlled by the same server process.</p>
<p>For purposes of sending functions, the file group would be the recipient, not the host, per se. The allocation of file groups to hosts could change.  </p>
<p>Now picture a transaction that touches multiple file groups. The requests going to collocated file groups can travel in the same batch and the recipient server process can run them sequentially or with a thread per file group, as may be convenient. Multiple threads per query on the same index make contention and needless thread switches. But since distinct file groups have their distinct mutexes there is less interference.</p>
<p>For purposes of transactions, we might view a file group as deserving a its own branch. In this way we would not have to abort transactions if file groups moved. A file group split would probably have to kill all uncommitted transactions on it so as not to have to split one branch in two or deal with uncommitted data in the split. This is hardly a problem, the event being rare. For purposes of checkpoints, logging, log archival, recovery, and such, a file group is its own unit. The Bigtable paper had some ideas about combining transaction logs and such, all quite straightforward and intuitive.</p>
<p>Writing the clustering logic with the file group, not the database process, as the main unit of location is a good idea and an entirely trivial change. This will make it possible to adjust capacity in almost real time without bringing everything to a halt by re-inserting terabytes of data in system wide repartitioning runs.</p>
<p>Implementing this on the current <a href="http://virtuoso.openlinksw.com" id="link-id0x1a35c638">Virtuoso</a> is not a real difficulty. There is already a concept of file group, although we use only two, one for the data and one for temp. Using multiple ones is not a big deal.</p>
<p>Supporting capacity allocation at the file group level instead of the server level can be introduced towards the middle of the clustering effort and will not greatly impact timetables.</p> Â ]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2007-02-09#1137">
  <rss:title>Hello Data Web (Take 2 - with Screenshots)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2007-02-09T01:46:50Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">While I continue to wrestle with screencast production etc.. Here is are some screenshots that guide you through the process of providing Data Web URIs to the SPARQL Query Builder (first cut of an MS Query or MS ACCESS type tool for the Data Web). Step 1 - Enter a Data Source URI Step 2 - Click on the Run Control (&quot;&gt;&quot; video control icon) Step 3 - Interact with Custom Grid hosted results (comprised of Resource Identifiers (S), Properties (P), and Property Values (O). Once you grasp the concept of entering values into the &quot;Default Data Source URI field&quot;, take a look at: http://programmableweb.com and other URIs (hint: scroll through the results grid to the QEDWiki demo item)</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>While I continue to wrestle with screencast production etc.. Here is are some screenshots that guide you through the process of providing Data Web URIs to the SPARQL Query Builder (first cut of an MS Query or MS ACCESS type tool for the Data Web).</p>
<ol>
<li>
  <a href="http://www.openlinksw.com/blog/~kidehen/briefcase/Public/Screenshots/sparql_qbe1.png">Step 1 - Enter a Data Source URI</a>
</li>
<li>
  <a href="http://www.openlinksw.com/blog/~kidehen/briefcase/Public/Screenshots/sparql_qbe2.png">Step 2 - Click on the Run Control (&quot;&gt;&quot; video control icon)</a>
</li>
<li>
  <a href="http://www.openlinksw.com/blog/~kidehen/briefcase/Public/Screenshots/sparql_qbe3.png">Step 3 - Interact with Custom Grid hosted results (comprised of Resource Identifiers (S), Properties (P), and Property Values (O).</a>
</li>
</ol>
<p>Once you grasp the concept of entering values into the &quot;Default Data Source URI field&quot;, take a look at: http://programmableweb.com and other URIs (hint: scroll through the results grid to the QEDWiki demo item)</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2007-02-08#1134">
  <rss:title>Hello Data Web!</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2007-02-08T19:13:48Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">The simple demo use our Ajax based Visual Query Builder for the SPARQL Query Language (this isn&#39;t Grandma&#39;s Data Web UI, but not to worry, that is on it&#39;s way also). Here goes: go to http://demo.openlinksw.com/isparql Enter any of the following values into the &quot;Default Data URI&quot;; field: - http://www.mkbergman.com/?p=336 - http://radar.oreilly.com/archives/2007/02/pipes_and_filte.html - http://jeremy.zawodny.com/blog/archives/008513.html - Other URIs What I am demonstrating is how existing Web Content hooks transperently into the &quot;Data Web&quot;. Zero RDF Tax :-) Everything is good! Note: Please look to the bottom of the screen for the &quot;Run Query&quot; Button. Remember, it not quite Grandma&#39;s UI but should do for Infonauts etc.. A screencast will follow.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>The simple demo use our Ajax based Visual Query Builder for the SPARQL Query Language (this isn&#39;t Grandma&#39;s Data Web UI, but not to worry, that is on it&#39;s way also). Here goes:</p>

<ol>
<li>
go to http://demo.openlinksw.com/isparql
</li>
<li>
Enter any of the following values into the &quot;Default Data URI&quot;; field:
</li>

<ul>- http://www.mkbergman.com/?p=336</ul>

<ul>- http://radar.oreilly.com/archives/2007/02/pipes_and_filte.html</ul>

<ul>- http://jeremy.zawodny.com/blog/archives/008513.html</ul>

<ul>- Other URIs
</ul>
</ol>
<p>
What I am demonstrating is how existing Web Content hooks transperently into the <a href="http://www.openlinksw.com/weblog/public/search.vspx?blogid=127&q=data%20web&type=text&output=html">&quot;Data Web&quot;</a>. Zero RDF Tax :-) Everything is good!</p>

<p>Note: Please look to the bottom of the screen for the &quot;Run Query&quot; Button. Remember, it not quite Grandma&#39;s UI but should do for Infonauts etc.. A screencast will follow.</p>
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2007-01-18#1122">
  <rss:title>Semantic Web &amp; Data Integration</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2007-01-18T00:36:25Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Stefano Mazzocchi, via his blog: Stefano&#39;s Linotype, delivers insightful contribution to the ongoing effort to recapture the essence of the original Semantic Web vision. The Semantic Web is about granular exposure of the underlying web-of-data that fuels the World Wide Web. It models &quot;Web Data&quot; using a Directed Graph Data Model (back-to-the-future: Network Model Database) called RDF. In line with contemporary database technology thinking, the Semantic Web also seeks to expose Web Data to architects, developers, and users via a concrete Conceptual Layer that is defined using RDF Schema. The abstract nature of Conceptual Models implies that actual instance data (Entities, Attributes, and Relationships/Associations) occurs by way of &quot;Logical to Conceptual&quot; schema mapping and data generation that can involve a myriad of logical data sources (SQL, XML, Object databases, traditional web content, RSS/Atom feeds etc.). Thus, by implication, it is safe assume that the Semantic Web&#39;s construction is basically a Data Integration and exposure effort. The point that Stefano alludes to in the blog post excerpts that follow: The semantic web is really just data integration at a global scale. Some of this data might end up being consistent, detailed and small enough to perform symbolic reasoning on, but even if this is the case, that would be such a small, expensive and fragile island of knowledge that it would have the same impact on the world as calculus had on deciding to invade Iraq. The biggest problem we face right now is a way to &#39;link&#39; information that comes from different sources that can scale to hundreds of millions of statements (and hundreds of thousands of equivalences). Equivalences and subclasses are the only things that we have ever needed of OWL and RDFS, we want to &#39;connect&#39; dots that otherwise would be unconnected. We want to suggest people to use whatever ontology pleases them and then think of just mapping it against existing ones later. This is easier to bootstrap than to force them to agree on a conceptualization before they even know how to start! Additional insightful material from Stefano: A No-Nonsense Guide to Semantic Web Specs for XML People [Part I] A No-nonsense Guide to Semantic Web Specs for XML People [Part II] Benjamin Nowack also chimes into this conversation via his simple guide to understanding Data, Information, and Knowledge in relation so the Semantic Web.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>
<a href="http://www.betaversion.org/~stefano/">Stefano Mazzocchi</a>, via his blog: <a href="http://www.betaversion.org/~stefano/linotype/">Stefano&#39;s Linotype</a>, delivers <a href="http://www.betaversion.org/~stefano/linotype/news/99/">insightful contribution</a> to the ongoing effort to recapture the essence of the original <a href="http://en.wikipedia.org/wiki/Semantic_Web">Semantic Web </a>vision.</p>

<p>The Semantic Web is about granular exposure of the underlying web-of-data that fuels the World Wide Web. It models &quot;<a href="http://www.w3.org/1999/04/WebData">Web Data</a>&quot; using a <a href="http://en.wikipedia.org/wiki/Graph_(mathematics)">Directed Graph</a> Data Model (back-to-the-future: <a href="http://en.wikipedia.org/wiki/Network_model">Network Model Database</a>) called <a href="http://www.w3.org/TR/rdf-primer/">RDF</a>.</p>
<p>In line with contemporary database technology thinking, the Semantic Web also seeks to expose Web Data to architects, developers, and users via a concrete <a href="http://en.wikipedia.org/wiki/Conceptual_schema">Conceptual Layer</a> that is defined using <a href="http://www.w3.org/TR/rdf-schema/">RDF Schema</a>.</p>
<p>The abstract nature of Conceptual Models implies that actual instance data (<a href="http://en.wikipedia.org/wiki/Entity-relationship_diagrams">Entities, Attributes, and Relationships/Associations</a>) occurs by way of &quot;Logical to Conceptual&quot; schema mapping and data generation that can involve a myriad of logical data sources (SQL, XML, Object databases, traditional web content, <a href="http://en.wikipedia.org/wiki/Rss_%28file_format%29">RSS</a>/<a href="http://en.wikipedia.org/wiki/Atom_%28standard%29">Atom</a> feeds etc.). Thus, by implication, it is safe assume that the Semantic Web&#39;s construction is basically a <a href="http://en.wikipedia.org/wiki/Data_integration">Data Integration</a> and exposure effort. The point that Stefano alludes to in the blog post excerpts that follow: </p>
<blockquote>
<p>The semantic web is really just data integration at a global scale. Some of this data might end up being consistent, detailed and small enough to perform symbolic reasoning on, but even if this is the case, that would be such a small, expensive and fragile island of knowledge that it would have the same impact on the world as calculus had on deciding to invade Iraq.</p>

<p>The biggest problem we face right now is a way to &#39;link&#39; information that comes from different sources that can scale to hundreds of millions of statements (and hundreds of thousands of equivalences). Equivalences and subclasses are the only things that we have ever needed of <a href="http://www.w3.org/TR/owl-features/">OWL</a> and RDFS, we want to &#39;connect&#39; dots that otherwise would be unconnected. We want to suggest people to use whatever ontology pleases them and then think of just mapping it against existing ones later. This is easier to bootstrap than to force them to agree on a conceptualization before they even know how to start!</p>
</blockquote>

<p>Additional insightful material from Stefano:</p>
<ol>
<li>
<a href="http://www.betaversion.org/~stefano/linotype/news/57/">A No-Nonsense Guide to Semantic Web Specs for XML People [Part I]</a>
</li>
<li>
<a href="http://www.betaversion.org/~stefano/linotype/news/78/">A No-nonsense Guide to Semantic Web Specs for XML People [Part II]</a>
</li>
</ol>
<p>
<a href="http://bnode.org/blog/sw_en">Benjamin Nowack</a> also chimes into this conversation via his <a href="http://rdfer.com/swk/data-information-knowledge">simple guide to understanding Data, Information, and Knowledge</a> in relation so the Semantic Web.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2006-12-07#1095">
  <rss:title>SPARQL, Ajax, Tagging, Folksonomies, Share Ontologies and Semantic Web</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2006-12-07T17:35:29Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">A quick dump that demonstrates how I integrate tags and links from del.icio.us with links from my local bookmark database via one of my public Data Spaces (this demo uses the kidehen Data Space). SPARQL (query language for the Semantic Web) basically enables me to query a collection of typed links (predicates/properties/attributes) in my Data Space (ODS based of course) without breaking my existing local bookmarks database or the one I maintain at del.icio.us. I am also demonstrating how Web 2.0 concepts such as Tagging mesh nicely with the more formal concepts of Topics in the Semantic Web realm. The key to all of this is the ability to generate RDF Data Model Instance Data based on Shared Ontologies such as SIOC (from DERI&#39;s SIOC Project) and SKOS (again showing that Ontologies and Folksonomies are complimentary). This demo also shows that Ajax also works well in the Semantic Web realm (or web dimension of interaction 3.0) especially when you have a toolkit with Data Aware controls (for SQL, RDF, and XML) such as OAT (OpenLink Ajax Toolkit). For instance, we&#39;ve successfully used this to build a Visual Query Building Tool for SPARQL (alpha) that really takes a lot of the pain out of constructing SPARQL Queries (there is much more to come on this front re. handling of DISTINCT, FILTER, ORDER BY etc..). For now, take a look at the SPARQL Query dump generated by this SIOC &amp; SKOS SPARQL QBE Canvas Screenshot. You can cut and paste the queries that follow into the Query Builder or use the screenshot to build your variation of this query sample. Alternatively, you can simply click on *This* SPARQL Protocol URL to see the query results in a basic HTML Table. And one last thing, you can grab the SPARQL Query File saved into my ODS-Briefcase (the WebDAV repository aspect of my Data Space). Note the following SPARQL Protocol Endpoints: MyOpenLink Data Space Experimental Data Space SPARQL Query Builder (you need to register at http://myopenlink.net:8890/ods to use this version) Live Demo Sever Demo Server SPARQL Query Builder (use: demo for both username and pwd when prompted) My beautified Version of the SPARQL Generated by QBE (you can cut and paste into &quot;Advanced Query&quot; section of QBE) is presented below: PREFIX rdf: &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt; PREFIX sioc: &lt;http://rdfs.org/sioc/ns#&gt; PREFIX dct: &lt;http://purl.org/dc/elements/1.1/&gt; PREFIX skos: &lt;http://www.w3.org/2004/02/skos/core#&gt; SELECT distinct ?forum_name, ?owner, ?post, ?title, ?link, ?url, ?tag FROM &lt;http://myopenlink.net/dataspace&gt; WHERE { ?forum a sioc:Forum; sioc:type &quot;bookmark&quot;; sioc:id ?forum_name; sioc:has_member ?owner. ?owner sioc:id &quot;kidehen&quot;. ?forum sioc:container_of ?post . ?post dct:title ?title . optional { ?post sioc:link ?link } optional { ?post sioc:links_to ?url } optional { ?post sioc:topic ?topic. ?topic a skos:Concept; skos:prefLabel ?tag}. } Unmodified dump from the QBE (this will be beautified automatically in due course by the QBE): PREFIX rdf: &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt; PREFIX sioc: &lt;http://rdfs.org/sioc/ns#&gt; PREFIX dct: &lt;http://purl.org/dc/elements/1.1/&gt; PREFIX skos: &lt;http://www.w3.org/2004/02/skos/core#&gt; SELECT ?var8 ?var9 ?var13 ?var14 ?var24 ?var27 ?var29 ?var54 ?var56 WHERE { graph ?graph { ?var8 rdf:type sioc:Forum . ?var8 sioc:container_of ?var9 . ?var8 sioc:type &quot;bookmark&quot; . ?var8 sioc:id ?var54 . ?var8 sioc:has_member ?var56 . ?var9 rdf:type sioc:Post . OPTIONAL {?var9 dc:title ?var13} . OPTIONAL {?var9 sioc:links_to ?var14} . OPTIONAL {?var9 sioc:link ?var29} . ?var9 sioc:has_creator ?var37 . OPTIONAL {?var9 sioc:topic ?var24} . ?var24 rdf:type skos:Concept . OPTIONAL {?var24 skos:prefLabel ?var27} . ?var56 rdf:type sioc:User . ?var56 sioc:id &quot;kidehen&quot; . } } Current missing items re. Visual QBE for SPARQL are: Ability to Save properly to WebDAV so that I can then expose various saved SPARQL Queries (.rq file) from my Data Space via URIs Handling of DISTINCT, FILTERS (note: OPTIONAL is handled via dotted predicate-links) General tidying up re. click event handling etc. Note: You can even open up your own account (using our Live Demo or Live Experiment Data Space servers) which enables you to repeat this demo by doing the following (post registration/sign-up): Export some bookmarks from your local browser to the usual HTML bookmarks dump file Create an ODS-Bookmarks Instance using your new ODS account Use the ODS-Bookmark Instance to import your local bookmarks from the HTML dump file Repeat the same import sequence using the ODS-Bookmark Instance, but this time pick the del.icio.us option Build your query (change &#39;kidehen&#39; to your ODS-user-name) That&#39;s it you now have Semantic Web presence in the form of a Data Space for your local and del.icio.us hosted bookmarks with tags integrated Quick Query Builder Tip: You will need to import the following (using the Import Button in the Ontologies &amp; Schemas side-bar); http://www.w3.org/1999/02/22-rdf-syntax-ns# (RDF) http://rdfs.org/sioc/ns# (SIOC) http://purl.org/dc/elements/1.1/ (Dublin Core) http://www.w3.org/2004/02/skos/core# (SKOS) Browser Support: The SPARQL QBE is SVG based and currently works fine with the following browsers; Firefox 1.5/2.0, Camino (Cocoa variant of Firefox for Mac OS X), Webkit (Safari pre-release / advanced sibling), Opera 9.x. We are evaluating the use of the Adobe SVG plugin re. IE 6/7 support. Of course this should be a screencast, but I am the middle of a plethora of things right now :-)</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>A quick dump that demonstrates how I integrate tags and links from del.icio.us with links from my local bookmark database via one of my public Data Spaces (this demo uses the <a href="http://myopenlink.net:8890/dataspace/kidehen">kidehen Data Space</a>).</p>

<p>
<a href="http://www.w3.org/TR/rdf-sparql-query/">SPARQL</a> (query language for the Semantic Web) basically enables me to query a collection of typed links (predicates/properties/attributes) in my Data Space (<a href="http://virtuoso.openlinksw.com/wiki/main/Main/OdsIndex">ODS</a> based of course) without breaking my existing local bookmarks database or the one I maintain at del.icio.us.</p>

<p>I am also demonstrating how <a href="http://en.wikipedia.org/wiki/Web_2.0">Web 2.0</a> concepts such as <a href="http://en.wikipedia.org/wiki/Tags">Tagging</a> mesh nicely with the more formal concepts of Topics in the Semantic Web realm. The key to all of this is the ability to generate <a href="http://www.w3.org/TR/rdf-primer/">RDF Data Model</a> Instance Data based on <a href="http://en.wikipedia.org/wiki/Upper_ontology_(computer_science)">Shared Ontologies</a> such as <a href="http://rdfs.org/sioc/spec/">SIOC</a> (from <a href="http://www.semanticweb.org/">DERI</a>&#39;s <a href="http://sioc-project.org/">SIOC Project</a>) and <a href="http://www.w3.org/2004/02/skos/">SKOS</a> (again showing that <a href="http://tomgruber.org/writing/ontology-of-folksonomy.htm">Ontologies and Folksonomies</a> are complimentary).</p>

<p>This demo also shows that Ajax also works well in the Semantic Web realm (or <a href="http://www.openlinksw.com/blog/~kidehen/?id=1037">web dimension of interaction 3.0</a>) especially when you have a toolkit with Data Aware controls (for SQL, RDF, and XML) such as OAT (<a href="http://demo.openlinksw.com/DAV/JS/demo/index.html">OpenLink Ajax Toolkit</a>). For instance, we&#39;ve successfully used this to build a <a href="http://myopenlink.net:8890/isparl/">Visual Query Building Tool for SPARQL</a> (alpha) that really takes a lot of the pain out of constructing SPARQL Queries (there is much more to come on this front re. handling of DISTINCT, FILTER, ORDER BY etc..). </p>

<p>For now, take a look at the SPARQL Query dump generated by this <a href="http://myopenlink.net:8890/DAV/home/kidehen/gallery/my_photos/sparql_qbe_sioc_skos_shot1.png">SIOC &amp; SKOS SPARQL QBE Canvas Screenshot</a>. </p>

<p>You can cut and paste the queries that follow into the Query Builder or use the screenshot to build your variation of this query sample. Alternatively, you can simply click on *<a href="http://myopenlink.net:8890/sparql?default-graph-uri=http%3A%2F%2Fmyopenlink.net%2Fdataspace&query=PREFIX+rdf%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23%3E%0D%0APREFIX+sioc%3A+++%3Chttp%3A%2F%2Frdfs.org%2Fsioc%2Fns%23%3E%0D%0APREFIX+dct%3A+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Felements%2F1.1%2F%3E%0D%0APREFIX+skos%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23%3E%0D%0A%0D%0ASELECT+distinct+%3Fforum_name%2C+%3Fowner%2C+%3Fpost%2C+%3Ftitle%2C+%3Flink%2C+%3Furl+%3Ftag%0D%0AFROM+%3Chttp%3A%2F%2Fmyopenlink.net%2Fdataspace%3E%0D%0AWHERE+%7B%0D%0A++++++++%3Fforum+a+sioc%3AForum.%0D%0A++++++++%3Fforum+sioc%3Atype+%22bookmark%22.%0D%0A++++++++%3Fforum+sioc%3Aid+%3Fforum_name.%0D%0A++++++++%3Fforum+sioc%3Ahas_member+%3Fowner.%0D%0A++++++++%3Fowner+sioc%3Aid+%22kidehen%22.%0D%0A++++++++%3Fforum+sioc%3Acontainer_of+%3Fpost+.%0D%0A++++++++%3Fpost++dct%3Atitle+%3Ftitle+.%0D%0A++++++++optional+%7B+%3Fpost+sioc%3Atopic+%3Ftopic.%0D%0A+++++++++++++++++++%3Ftopic+a+skos%3AConcept%3B%0D%0A+++++++++++++++++++++++++skos%3AprefLabel+%3Ftag.+%7D%0D%0A++++++++optional%7B+%3Fpost+sioc%3Alink+%3Flink++%7D+.%0D%0A++++++++optional%7B+%3Fpost+sioc%3Alinks_to+%3Furl+%7D%0D%0A++++++%7D%0D%0AORDER+BY+%3Ftitle&format=text%2Fhtml">This</a>* <a href="http://www.w3.org/TR/rdf-sparql-protocol/">SPARQL Protocol</a> URL to see the query results in a basic HTML Table. And one last thing, you can grab the <a href="http://myopenlink.net:8890/DAV/home/kidehen/SPARQL/tagging_sioc_skos_delicios_my_bookmarks.rq">SPARQL Query File</a> saved into my <a href="http://virtuoso.openlinksw.com/wiki/main/Main/OdsBriefcase">ODS-Briefcase</a> (the <a href="http://en.wikipedia.org/wiki/WebDAV">WebDAV</a> repository aspect of my Data Space).
</p>

<p>
<b>Note the following SPARQL Protocol Endpoints:</b>
</p>
<ol>
<li>
  <a href="http://myopenlink.net:8890/sparql/">MyOpenLink Data Space</a>
</li>
<li>
  <a href="http://myopenlink.net:8890/isparql/">Experimental Data Space SPARQL Query Builder</a> (you need to register at http://myopenlink.net:8890/ods to use this version)</li>
 <li>
  <a href="http://demo.openlinksw.com/sparql/">Live Demo Sever</a>
 </li>
<li>
  <a href="http://demo.openlinksw.com/isparql/">Demo Server SPARQL Query Builder</a> (use: demo for both username and pwd when prompted)</li>
</ol>

<p>My beautified Version of the SPARQL Generated by QBE (you can cut and paste into &quot;Advanced Query&quot; section of QBE) is presented below:</p>
<pre>
PREFIX rdf: &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt;
PREFIX sioc: &lt;http://rdfs.org/sioc/ns#&gt;
PREFIX dct: &lt;http://purl.org/dc/elements/1.1/&gt;
PREFIX skos: &lt;http://www.w3.org/2004/02/skos/core#&gt;
<br />
SELECT distinct 
       ?forum_name, 
       ?owner, 
       ?post, 
       ?title, 
       ?link, 
       ?url, 
       ?tag
FROM &lt;http://myopenlink.net/dataspace&gt;
WHERE {
       ?forum a sioc:Forum;
                   sioc:type &quot;bookmark&quot;;
                   sioc:id ?forum_name;
                   sioc:has_member ?owner.
       ?owner sioc:id &quot;kidehen&quot;.
       ?forum sioc:container_of ?post .
       ?post  dct:title ?title .
       optional { ?post sioc:link ?link  }
       optional { ?post sioc:links_to ?url }
       optional { ?post sioc:topic ?topic.
                        ?topic a skos:Concept;
                                  skos:prefLabel ?tag}.
     } 
</pre>
<p>Unmodified dump from the QBE (this will be beautified automatically in due course by the QBE):</p>

<pre>
PREFIX rdf: &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt;
PREFIX sioc: &lt;http://rdfs.org/sioc/ns#&gt;
PREFIX dct: &lt;http://purl.org/dc/elements/1.1/&gt;
PREFIX skos: &lt;http://www.w3.org/2004/02/skos/core#&gt;
<br />
SELECT ?var8 ?var9 ?var13 ?var14 ?var24 ?var27 ?var29 ?var54 ?var56
WHERE
{
graph ?graph {
 ?var8 rdf:type sioc:Forum .
 ?var8 sioc:container_of ?var9 .
 ?var8 sioc:type &quot;bookmark&quot; .
 ?var8 sioc:id ?var54 .
 ?var8 sioc:has_member ?var56 .
 ?var9 rdf:type sioc:Post .
 OPTIONAL {?var9 dc:title ?var13} .
 OPTIONAL {?var9 sioc:links_to ?var14} .
 OPTIONAL {?var9 sioc:link ?var29} .
 ?var9 sioc:has_creator ?var37 .
 OPTIONAL {?var9 sioc:topic ?var24} .
 ?var24 rdf:type skos:Concept .
 OPTIONAL {?var24 skos:prefLabel ?var27} .
 ?var56 rdf:type sioc:User .
 ?var56 sioc:id &quot;kidehen&quot; .
 }
} 
</pre>

<p>
Current missing items re. Visual QBE for SPARQL are:</p>
<ol>
<li>
Ability to Save properly to WebDAV so that I can then expose various saved SPARQL Queries (.rq file) from my Data Space via URIs
</li>
<li>
Handling of DISTINCT, FILTERS (note: OPTIONAL is handled via dotted predicate-links)
</li>
<li>General tidying up re. click event handling etc.
</li>
</ol>

Note:
You can even open up your own account (using our <a href="http://demo.openlinksw.com/ods">Live Demo</a> or <a href="http://myopenlink.net:8890/ods">Live Experiment Data</a> Space servers) which enables you to repeat this demo by doing the following (post registration/sign-up):

<ol>
<li>Export some bookmarks from your local browser to the usual HTML bookmarks dump file</li>
<li>Create an ODS-Bookmarks Instance using your new ODS account</li>
<li>Use the ODS-Bookmark Instance to import your local bookmarks from the HTML dump file</li>
<li>Repeat the same import sequence using the ODS-Bookmark Instance, but this time pick the del.icio.us option</li>
<li>Build your query (change &#39;kidehen&#39; to your ODS-user-name)</li>
<li>That&#39;s it you now have Semantic Web presence in the form of a Data Space for your local and del.icio.us hosted bookmarks with tags integrated</li>
</ol>

<p>Quick Query Builder Tip:
You will need to import the following (using the Import Button in the Ontologies &amp; Schemas side-bar); </p>
<ol>
<li>
  <a href="http://www.w3.org/1999/02/22-rdf-syntax-ns#">http://www.w3.org/1999/02/22-rdf-syntax-ns#</a> (<a href="http://www.w3.org/TR/rdf-primer/">RDF</a>)</li>
<li>
  <a href="http://rdfs.org/sioc/ns#">http://rdfs.org/sioc/ns#</a> (<a href="http://rdfs.org/sioc/spec/">SIOC</a>)</li>
<li>
  <a href="http://purl.org/dc/elements/1.1/">http://purl.org/dc/elements/1.1/</a> (<a href="http://dublincore.org/">Dublin Core</a>)</li>
<li>
  <a href="http://www.w3.org/2004/02/skos/core#">http://www.w3.org/2004/02/skos/core#</a> (<a href="http://www.w3.org/TR/2005/WD-swbp-skos-core-guide-20050510/">SKOS</a>)</li>
</ol>

<p>Browser Support: The SPARQL QBE is SVG based and currently works fine with the following browsers; Firefox 1.5/2.0, Camino (Cocoa variant of Firefox for Mac OS X), Webkit (Safari pre-release / advanced sibling), Opera 9.x. We are evaluating the use of the Adobe SVG plugin re. IE 6/7 support.</p>

<p>Of course this should be a screencast, but I am the middle of a plethora of things right now :-)
</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2006-07-15#1006">
  <rss:title>GeoRSS &amp; Geonames for Philanthropy re. Kiva Microfinance</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2006-07-15T14:11:47Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">(Via Geospatial Semantic Web Blog.) GeoRSS &amp; Geonames for Philanthropy: &quot; I heard about Kiva.ORG in a BusinessWeek podcast. After visiting its website, I think there are few places where GeoRSS (in the RDF/A syntax) and Geonames can be used to enhance the siteâs functionality. Kiva.ORG Background Itâs a microfinance website for people in the developing countries. Its business model is in the intersection between peer-to-peer financing and philanthropy. The goal is to help developing country businesses to borrow small loans from a large group of Web users, so that they can avoid paying high interests to the banks. For example, a person in Uganda can request a $500 loan and use it for buying and selling more poultry. One or more lenders (anyone on the Web) may decide to grant loans to that person in increments as tiny as $25. After few years, that person will pay back the loans to the lenders. How GeoRSS and Geonames Can Help I went to the website and discovered the site has a relative weak search and browsing interface. In particular, there is no way to group loan requests based on geographical locations (e.g., countries, cities and regions). Took a look at individual loan pages. Each page actually has standard ways to describe location information â e.g., Location: Mbale, Uganda. It should be relative easy to add GeoRSS points (in the RDF/A syntax) to describe these location information (an alternative maybe using Microformat Geo or W3C Geo). Once the location information is annotated, one can imagine building a map mashup to display loan requests in a geospatial perspective. One can also build search engines to support spatial queries such as âfind me all loans with from Mbaleâ. Since Kiva.ORG webmasters may not be GIS experts, it will be nice if we can find ways to automatically geocode location information and describe that using GeoRSS. This automatic geocoding procedure can be developed using Geonamesâs webservices. Take a string âMbaleâ or âUgandaâ, and send to Geonamesâs search service. The procedure will get back JSON or XML description of the location, which include latitude and longitude. This will then be used to annotate the location information in a Kiva loan page. Can you think of other ways to help Kiva.ORG to become more âgeospatially intelligentâ? You can learn more about Kiva.ORG at its website and listen to this podcast. &quot;</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>(Via <a href="http://www.geospatialsemanticweb.com">Geospatial Semantic Web Blog</a>.)</p>
<p>
<a href="http://www.geospatialsemanticweb.com/2006/07/14/georss-geonames-for-philanthropy#comments">GeoRSS &amp; Geonames for Philanthropy</a>: &quot;</p>
<p>I heard about <a title="kiva.org" href="http://www.kiva.org">Kiva.ORG</a> in a BusinessWeek podcast. After visiting its website, I think there are few places where GeoRSS (in the RDF/A syntax) and Geonames can be used to enhance the siteâs functionality.</p>
<h5>Kiva.ORG Background</h5>
<h5>
<img align="left" title="kiva.org" id="image92" alt="kiva.org" src="http://www.geospatialsemanticweb.com/wp-content/uploads/2006/07/kiva-bannersmall.png" />
</h5>
<p>Itâs a microfinance website for people in the developing countries. Its business model is in the intersection between peer-to-peer financing and philanthropy. The goal is to help developing country businesses to borrow small loans from a large group of Web users, so that they can avoid paying high interests to the banks.</p>
<p>For example, a person in Uganda can <a target="_blank" title="Kiva Loan Request" href="http://kiva.org/app.php?page=businesses&action=about&id=564">request</a> a $500 loan and use it for buying and selling more poultry. One or more lenders (anyone on the Web) may decide to grant loans to that person in increments as tiny as $25. After few years, that person will pay back the loans to the lenders.</p>
<h5>How GeoRSS and Geonames Can Help</h5>
<p>I went to the website and discovered the site has a relative weak search and browsing interface. In particular, there is no way to group loan requests based on geographical locations (e.g., countries, cities and regions).<br />
<a id="more-90"></a>
<br />
Took a look at individual loan pages. Each page actually has standard ways to describe location information â e.g., <strong>Location:</strong> Mbale, Uganda.</p>
<p>It should be relative easy to add <a title="GeoRSS" target="_blank" href="http://www.georss.org/">GeoRSS</a> points (in <a title="Mixing GeoRSS with RDF/A" target="_blank" href="http://www.geospatialsemanticweb.com/2006/06/08/mixing-rdfa-with-georss">the RDF/A syntax</a>) to describe these location information (an alternative maybe using <a title="geocode with microformat" target="_blank" href="http://www.geospatialsemanticweb.com/2006/01/03/how-to-geocode-your-blog">Microformat Geo</a> or <a title="w3c geo" target="_blank" href="http://www.w3.org/2003/01/geo/">W3C Geo</a>). Once the location information is annotated, one can imagine building a map mashup to display loan requests in a geospatial perspective. One can also build search engines to support spatial queries such as âfind me all loans with from Mbaleâ.</p>
<p>Since Kiva.ORG webmasters may not be GIS experts, it will be nice if we can find ways to automatically geocode location information and describe that using GeoRSS. This automatic geocoding procedure can be developed using <a title="geonames webservices" target="_blank" href="http://www.geonames.org/export/geonames-search.html">Geonamesâs webservices</a>. Take a string âMbaleâ or âUgandaâ, and send to Geonamesâs search service. The procedure will get back <a target="_blank" title="geonames json saerch" href="http://ws.geonames.org/searchJSON?q=Mbale&maxRows=10">JSON</a> or <a target="_blank" title="geonames xml search" href="http://ws.geonames.org/search?q=Mbale&maxRows=10">XML</a> description of the location, which include latitude and longitude. This will then be used to annotate the location information in a Kiva loan page.</p>
<p>Can you think of other ways to help Kiva.ORG  to become more âgeospatially intelligentâ?<br />
You can learn more about <a title="kiva.org" target="_blank" href="http://www.kiva.org">Kiva.ORG</a> at its website and listen to <a title="An eBay for Microfinance" target="_blank" href="http://www.businessweek.com/mediacenter/podcasts/innovation/innovation_07_11_06.htm">this podcast</a>.
</p>&quot;]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2006-06-01#988">
  <rss:title>Contd: Ajax Database Connectivity Demos</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2006-06-02T02:48:00Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Last week I put out a series of screencast style demos that sought to demonstrate the core elements of our soon to be released Javascript Toolkit called OAT (OpenLink Ajax Toolkit) and its Ajax Database Connectivity layer. The screencasts covered the following functionality realms: SQL Query By Example (basic) SQL Query By Example (advanced - pivot table construction) Web Form Design (basic database driven map based mashup) Web Form Design (advanced database driven map based mashup) To bring additional clarity to the screencasts demos and OAT in general, I have saved a number of documents that are the by products of activities in the screenvcasts: Live XML Document produced using SQL Query By Example (basic) (you can use drag and drop columns across the grid to reorder and sort presentation) Live XML Document produced using QBE and Pivot Functionality (you can drag and drop the aggregate columns and rows to create your own views etc..) Basic database driven map based mashup (works with FireFox, Webkit, Camino; click on pins to see national flag) Advanced database driven map based mashup (works with FireFox, Webkit, Camino; records, 36, 87, and 257 will unveil pivots via lookup pin) Notes: “Advanced”, as used above, simply means that I am embedding images (employee photos and national flags) and a database driven pivot into the map pins that serve as details lookups in classic SQL master/details type scenarios. The “Ajax Call In Progress..” dialog is there to show live interaction with a remote database (in this case Virtuoso but this could be any ODBC, JDBC, OLEDB, ADO.NET, or XMLA accessible data source) The data access magic source (if you want to call it that) is XMLA - a standard that has been in place for years but completely misunderstood and as a result under utilized You can see a full collection of saved documents at the following locations: My Mashups demo directory (Google and Yahoo! demo variants but note these do not work with Safari or IE at the current time. IE7 issues will be resolved in the next day or so) My Pivots demo directory (other Pivots will be added as I build and save them) My Saved Queries (a collection of saved QBE generated queries)</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[
<p> Last week I put out a series of screencast style demos that sought to demonstrate the core elements of our soon to be released Javascript Toolkit called OAT (<a href="http://www.openlinksw.com/oat/">OpenLink Ajax Toolkit</a>) and its Ajax Database Connectivity layer. </p> <p> The screencasts covered the following functionality realms: </p> <ol> <li>   <a href="http://www.openlinksw.com/blog/%7Ekidehen/index.vspx?page=&id=982">SQL Query By Example (basic)</a> </li> <li>   <a href="http://www.openlinksw.com/blog/%7Ekidehen/index.vspx?page=&id=983">SQL Query By Example (advanced - pivot table construction)</a> </li> <li>   <a href="http://www.openlinksw.com/blog/%7Ekidehen/index.vspx?page=&id=981">Web Form Design (basic database driven map based mashup)</a> </li> <li>   <a href="http://www.openlinksw.com/blog/%7Ekidehen/index.vspx?page=&id=985">Web Form Design (advanced database driven map based mashup)</a> </li> </ol> <p> To bring additional clarity to the screencasts demos and OAT in general, I have saved a number of documents that are the by products of activities in the screenvcasts: </p> <ol> <li>   <a href="http://demo.openlinksw.com/public_demos/queries/customer_qry1.xml">Live XML Document produced using SQL Query By Example (basic)</a> (you can use drag and drop columns across the grid to reorder and sort presentation)</li> <li>   <a href="http://demo.openlinksw.com/public_demos/reports/Pivots/employee_sales_by_ship_country_pivot.xml">Live XML Document produced using QBE and Pivot Functionality</a> (you can drag and drop the aggregate columns and rows to create your own views etc..)</li> <li>   <a href="http://demo.openlinksw.com/public_demos/reports/MapMashups/country_flags_google_frm2.xml">Basic database driven map based mashup</a> (works with FireFox, Webkit, Camino; click on pins to see national flag)</li> <li>   <a href="http://demo.openlinksw.com/public_demos/reports/MapMashups/employee_sales_by_ship_country_pivot_google.xml">Advanced database driven map based mashup</a> (works with FireFox, Webkit, Camino; records, 36, 87, and 257 will unveil pivots via lookup pin)</li> </ol> <p> Notes: </p> <ul> <li>“Advanced”, as used above,  simply means that I am embedding images (employee photos and national flags) and a database driven pivot into the map pins that serve as details lookups in classic SQL master/details type scenarios.</li> <li>The “Ajax Call In Progress..” dialog is there to show live interaction with a remote database (in this case <a href="http://virtuoso.openlinksw.com">Virtuoso</a> but this could be any ODBC, JDBC, OLEDB, ADO.NET, or XMLA accessible data source)</li> <li>The data access magic source (if you want to call it that) is XMLA - a standard that has been in place for years but completely misunderstood and as a result under utilized</li> </ul> <p> You can see a full collection of saved documents at the following locations:   </p> <ul> <li>   <a href="http://demo.openlinksw.com/public_demos/reports/MapMashups/">My Mashups demo directory</a> (Google and Yahoo! demo variants but note these do not work with Safari or IE at the current time. IE7 issues will be resolved in the next day or so) </li> <li>   <a href="http://demo.openlinksw.com/public_demos/reports/Pivots/">My Pivots demo directory</a> (other Pivots will be added as I build and save them) </li> <li>   <a href="http://demo.openlinksw.com/public_demos/queries/">My Saved Queries</a>  (a collection of saved QBE generated queries)</li> </ul>
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2006-05-26#986">
  <rss:title>Screencast: Yahoo! Maps variation of Ajax Database Connectivity Maps Mash-up</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2006-05-26T22:49:00Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">This is a Yahoo! maps variation of the Google Maps based Forms Designer mash-up screencast.  </dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[
  This is a Yahoo! maps variation of the <a href="http://www.openlinksw.com/blog/%7Ekidehen/index.vspx?page=&id=985">Google Maps based Forms Designer mash-up screencast</a>.<br /> <br /> <br />    
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2006-05-26#985">
  <rss:title>Screencast: Building Database Centric Web 2.0 Mash-ups using Ajax Database Connectivity</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2006-05-26T22:38:00Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">This screencast covers the actual codeless process of building a database centric Web 2.0 mash-up using OAT&#39;s database-aware Forms Designer. This is basically the simplicity of Paradox or Microsoft ACCESS form building delivered via Ajax without any database or operating system lock-in. This demo uses the Google Mapping Service (note: there is a Yahoo! Mapping Service screencast demo that follows this post). Also note that fact that in this demonstration I actually incorporate the Pivot building functionality from an earlier Ajax based Pivot Building screencast.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[
     This screencast covers the actual codeless process of building a database centric Web 2.0 mash-up using OAT&#39;s database-aware Forms Designer. This is basically the simplicity of Paradox or Microsoft ACCESS form building delivered via Ajax without any database or operating system lock-in. This demo uses the Google Mapping Service (note: there is a <a href="http://www.openlinksw.com/dataspace/%7Ekidehen/blog/public/Screencasts/oat-formdesigner-mashup-yahoo-maps-demo1.mov">Yahoo! Mapping Service screencast demo</a> that follows this post). Also note that fact that in this demonstration I actually incorporate the Pivot building functionality from an earlier <a href="http://www.openlinksw.com/blog/%7Ekidehen/index.vspx?page=&id=983">Ajax based Pivot Building screencast</a>.<br /> <br />       
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2006-05-26#983">
  <rss:title>Building Pivot Tables using Ajax Database Connectivity</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2006-05-26T22:08:00Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">This screencast demo (enclosure attached) is a continuation from my earlier Ajax and QBE screencast demo. This time the focus is on building Excel like Pivot tables using data exposed via Ajax Database Connectivity.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[
    This screencast demo (enclosure attached) is a continuation from my earlier <a href="http://www.openlinksw.com/blog/%7Ekidehen/index.vspx?page=&id=982">Ajax and QBE screencast</a> demo. This time the focus is on building Excel like Pivot tables using data exposed via Ajax Database Connectivity.<br />      
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2006-05-26#982">
  <rss:title>Screencast: Ajax Database Connectivity and SQL Query By Example</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2006-05-26T21:59:00Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">AJAX Database Connectivity is the Data Access Component of OAT (OpenLink AJAX Toolkit). It&#39;s basically an XML for Analysis (XMLA) client that enables the development and deployment of database independent Rich Internet Applications (RIAs). Thus, you can now develop database centric AJAX applications without lock-in at the Operating System, Database Connectivity mechanism (ODBC, JDBC, OLEDB, ADO.NET), or back-end Database levels. XMLA has been around for a long time. Its fundamental goal was to provide Web Applications with Tabular and Multi-dimensional data access before it fell off the radar (a story too long to tell in this post). AJAX Database connectivity only requires your target DBMS to be XMLA (direct), ODBC, JDBC, OLEDB, or ADO.NET accessible. I have attached a Query By Example (QBE) screencast movie enclosure to this post (should you be reading this post Web 1.0 style). The demo shows how Paradox-, Quattro Pro-, Access-, and MS Query-like user friendly querying is achieved using AJAX Database  Connect Connectivity</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[
  AJAX Database Connectivity is the Data Access Component of OAT (<a href="http://www.openlinksw.com/oat/">OpenLink AJAX Toolkit</a>). It&#39;s basically an <a href="http://www.xmla.org/">XML for Analysis</a> (XMLA) client that enables the development and deployment of database independent Rich Internet Applications (RIAs). Thus, you can now develop database centric AJAX applications without lock-in at the Operating System, Database Connectivity mechanism (ODBC, JDBC, OLEDB, ADO.NET), or back-end Database levels. <br /> <br />XMLA has been around for a long time. Its fundamental goal was to provide Web Applications with Tabular and Multi-dimensional data access before it fell off the radar (a story too long to tell in this post).<br /> <br />AJAX Database connectivity only requires your target DBMS to be XMLA (direct), ODBC, JDBC, OLEDB, or ADO.NET accessible. <br /> <br />I have attached a Query By Example (QBE) screencast movie enclosure to this post (should you be reading this post Web 1.0 style). The demo shows how Paradox-, Quattro Pro-, Access-, and MS Query-like user friendly querying is achieved using AJAX Database  Connect Connectivity<br /> <br />]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2006-05-25#981">
  <rss:title>A Web 2.0 Style Mash-up using the OpenLink Ajax Toolkit (OAT)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2006-05-25T20:47:00Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We are now on the verge of finally releasing one of the many items discussed in my recent chat with Jon Udell. The item in question is the OpenLink Ajax Toolkit (OAT) that enables the rapid development of Database Independent Rich Internet Applications. My very first public screencast is deliberately silent (since its a live work in progress etc.). The screencast style demo covers the production of a map based mashup that simply unveils the national flag of each country underneath its map marker (a lookup associated with geocoded map pin). This post is also a deliberate test of the automatic production of IPod and Yahoo RSS sytle syndication gems based on the content of my blog post. Naturally, this is a demonstration of the soon to be unveiled OpenLink Data Spaces technology (the one that supports GData and SPARQL Query Services). BTW - The the Data Space that is this blog has been GData aware for a few weeks now (I digress, just watch the movie!): Note: If you are reading this post Web 1.0 style (i.e. via traditional non aggregating browser UI) then click on the &quot;enclosure&quot; link to grab the quicktime movie file. If on the other hand your are reading via a Web 2.0 aggregator, note that the Podcast Gem should alert you to the existence of the movie enclosure.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[
           We are now on the verge of finally releasing one of the many items discussed in my recent <a href="http://www.usnet.private:8889/weblog/kidehen@openlinksw.com/127/index.vspx?page=&id=965&sid=e295397b4a9d07fa9c12baf31569aa97&realm=wa">chat with Jon Udell</a>. The item in question is the OpenLink Ajax Toolkit (OAT) that enables the rapid development of Database Independent Rich Internet Applications. My very first public screencast is deliberately silent (since its a live work in progress etc.). <br /> <br />The screencast style demo covers the production of a map based mashup that simply unveils the national flag of each country underneath its map marker (a lookup associated with geocoded map pin).<br /> <br />This post is also a deliberate test of the automatic production of IPod and Yahoo RSS sytle syndication gems based on the content of my blog post. Naturally, this is a demonstration of the soon to be unveiled OpenLink Data Spaces technology (the one that supports GData and SPARQL Query Services).<br /> <br />BTW - The the Data Space that is this blog has been <a href="http://www.openlinksw.com/dataspace/%7Ekidehen/GData">GData</a> aware for a few weeks now (I digress, just watch the movie!):<br /> <br />Note: If you are reading this post Web 1.0 style (i.e. via traditional non aggregating browser UI) then click on the &quot;enclosure&quot; link to grab the quicktime movie file. If on the other hand your are reading via a Web 2.0 aggregator, note that the Podcast Gem should alert you to the existence of the movie enclosure.<br />             
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2006-05-15#974">
  <rss:title>Two graphs that explain most IT dysfunction (Part I)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2006-05-15T16:06:05Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Dumped verbatim below, is a timeless post by Louche Cannon. It is especially poignant in light of the many misguided perceptions about the mutual exclusivity of Web 2.0 and the Semantic Web. Enjoy! Two graphs that explain most IT dysfunction (Part I): &quot; Inspired by reading about other peopleâs blogging weaknesses, Iâve decided to finally get this one off the back burner and post it. Iâm pretty sure that this isnât original, but I started thinking about this way back in 1996 (pre-social-bookmarking) and Iâve lost my pointer to whatever influenced it. Anybody who can set me straight- Iâd appreciate it. So here goes. There are two graphs which, when seen together, explain a hell of a lot about various forms of dysfunction that you see in the technology world. In this first graph, X represents relative âtechnical expertiseâ and Y represents the âperceived benefitâ in the introduction of a new technology: The summary is that technical neophytes (A) tend to see high potential benefit in new technologies, while people who have a bit of technology experience (B) grow increasingly cynical about technology claims and can rattle-off the names of technologies that they have seen over-hyped and that have under-delivered. The interesting thing though, is that, as people become really expert in technology (C), their view of the potential benefits in new technology starts to increase again. At the far right of this scale Iâm talking about the real experts- the alpha-geeks of the world. In the second graph, X again represents technical expertise, but Y represents âperceived riskâ associated with the introduction of a new technology: Here the curve is inverted, but the basic pattern is the same. The neophytes (A) are blissfully unaware of the things that can go wrong with the introduction of a new technology. The tech-savvy (B) are battle-scarred and have seen (and possibly caused) countless disasters. The alpha-geeks (C) have also seen their share of problems, but they have also learned from their mistakes and know how to avoid them in the future. The alpha-geeks understand how to manage the risk. Now things get interesting when you map these two dynamics against each other: You see that neophytes in group A have essentially the same world view as the alpha-geeks in group C, but for completely different reasons. The trouble starts when you realize that most of senior executives, venture capitalists and members of the popular press are in group A. At the other extreme, most R&amp;D groups, architecture groups, independent consultancies, technology pundits, etc. are in group C . There are a few problems with this: People in group A will often talk to and solicit advice from people in group C There are relatively few people in group C Most of the people who actually have to implement new technologies are in group B. So you can start to see the problem. In Part II Iâl talk some more about group B and Iâll discuss some of the classic patterns that emerge when A, B and C try to work with each other. &quot;</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Dumped verbatim below, is a timeless post by <a href="http://www.breakawayrepublic.com/blog">Louche Cannon</a>. It is especially poignant in light of the many <a href="http://www.oreillynet.com/xml/blog/2006/05/wheres_the_semantic_web_excite.html">misguided perceptions about the mutual exclusivity of Web 2.0 and the Semantic Web</a>. Enjoy!</p>

<blockquote>
<p>
<a href="http://www.breakawayrepublic.com/blog/?p=42#comments">Two graphs that explain most IT dysfunction (Part I)</a>: &quot;</p>
<p>Inspired by reading about other peopleâs <a href="http://edu-blogger.blogspot.com/2005/05/my-blogging-weakness.html">blogging weaknesses</a>, Iâve decided to finally get this one off the back burner and post it. Iâm pretty sure that this isnât original, but I started thinking about this way back in 1996 (pre-social-bookmarking) and Iâve lost my pointer to whatever influenced it. Anybody who can set me straight- Iâd appreciate it.</p>
<p>So here goes.</p>
<p>There are two graphs which, when seen together, explain a hell of a lot about various forms of dysfunction that you see in the technology world.</p>
<p>In this first graph, <strong>X</strong> represents relative âtechnical expertiseâ and <strong>Y</strong> represents the âperceived benefitâ in the introduction of a new technology:</p>
<p>
 <a href="http://www.breakawayrepublic.com/blog/wp-content/benefit.png" onclick="window.open('http://www.breakawayrepublic.com/blog/wp-content/benefit.png','popup','width=676,height=600,scrollbars=no,resizable=yes,toolbar=no,directories=no,location=no,menubar=no,status=yes,left=0,top=0');return false"><img src="http://www.breakawayrepublic.com/blog/wp-content/benefit-tm.jpg" height="100" width="112" border="1" hspace="4" vspace="4" alt="Benefit" />
 </a>
</p>
<p>The summary is that technical neophytes (A) tend to see high potential benefit in new technologies, while people who have a bit of technology experience (B)  grow increasingly cynical about technology claims and can rattle-off the names of technologies that they have seen over-hyped and that have under-delivered. The interesting thing though, is that, as people become really expert in technology (C), their view of the potential benefits in new technology starts to increase again. At the far right of this scale Iâm talking about the real experts- the alpha-geeks of the world.</p>
<p>In the second graph, <strong>X</strong> again represents technical expertise, but <strong>Y</strong> represents âperceived riskâ associated with the introduction of a new technology:</p>
<p>
 <a href="http://www.breakawayrepublic.com/blog/wp-content/risk.png" onclick="window.open('http://www.breakawayrepublic.com/blog/wp-content/risk.png','popup','width=676,height=600,scrollbars=no,resizable=yes,toolbar=no,directories=no,location=no,menubar=no,status=yes,left=0,top=0');return false"><img src="http://www.breakawayrepublic.com/blog/wp-content/risk-tm.jpg" height="100" width="112" border="1" hspace="4" vspace="4" alt="Risk" />
 </a>
</p>
<p>Here the curve is inverted, but the basic pattern is the same. The neophytes (A) are blissfully unaware of the things that can go wrong with the introduction of a new technology. The tech-savvy (B) are battle-scarred and have seen (and possibly caused) countless disasters.  The alpha-geeks (C) have also seen their share of problems, but they have also learned from their mistakes and know how to avoid them in the future. The alpha-geeks understand how to manage the risk.</p>
<p>Now things get interesting when you map these two dynamics against each other:</p>
<p>
 <a href="http://www.breakawayrepublic.com/blog/wp-content/benefit_risk.png" onclick="window.open('http://www.breakawayrepublic.com/blog/wp-content/benefit_risk.png','popup','width=676,height=600,scrollbars=no,resizable=yes,toolbar=no,directories=no,location=no,menubar=no,status=yes,left=0,top=0');return false"><img src="http://www.breakawayrepublic.com/blog/wp-content/benefit_risk-tm.jpg" height="100" width="112" border="1" hspace="4" vspace="4" alt="Benefit Risk" />
 </a>
</p>
<p>You see that neophytes in group A have essentially the same world view as the alpha-geeks in group C, but for completely different reasons. The trouble starts when you realize that most of senior executives, venture capitalists and members of the popular press are in group A. At the other extreme, most R&amp;D groups, architecture groups, independent consultancies, technology pundits, etc. are in group C . There are a few problems with this:</p>
<ul>
<li>People in group A will often talk to and solicit advice from people in group C</li>
<li>There are relatively few people in group C</li>
<li>Most of the people who actually have to implement new technologies are in group B.</li>
</ul>
<p>So you can start to see the problem.</p>
<p>In <strong><a href="http://www.breakawayrepublic.com/blog/?p=44">Part II</a></strong> Iâl talk some more about group B and Iâll discuss some of the classic patterns that emerge when A, B and C try to work with each other.
</p>&quot;

</blockquote>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/vdb/blog/?date=2006-04-24#962">
  <rss:title>Virtuoso and Database Scalability</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2006-04-24T16:06:00Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Virtuoso and Database Scalability We have a new technical article, benchmarking Virtuoso on different hardware configurations. This is useful reading for anyone interested in using Virtuoso as a database back end for online applications or simply anyone interested in relational database scalability, no matter what specific DBMS. We use an adaptation of the well known TPC-C benchmark to see what hardware configuration will give the best price/performance. We also explain how to tune Virtuoso and how and why different parameters affect the throughput.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<div style="display:none;">Virtuoso and Database Scalability</div>
<p>We have a new <a href="http://virtuoso.openlinksw.com/wiki/main/Main/VOSScale" id="link-id1068c3f8">technical article</a>, benchmarking <a href="http://virtuoso.openlinksw.com" id="link-id0xd413030">Virtuoso</a> on different hardware configurations.</p>
<p>This is useful reading for anyone interested in using Virtuoso as a database back end for online applications or simply anyone interested in relational database scalability, no matter what specific DBMS.</p>
<p>We use an adaptation of the well known <a href="http://dbpedia.org/resource/TPC-C" id="link-id0x1aed8170">TPC-C</a> benchmark to see what hardware configuration will give the best price/performance. We also explain how to tune Virtuoso and how and why different parameters affect the throughput.</p>
</div>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2006-04-24#961">
  <rss:title>Virtuoso and Database Scalability</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2006-04-24T15:27:23Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We have a new technical article, benchmarking Virtuoso on different hardware configurations. This is useful reading for anyone interested in using Virtuoso as a database back end for online applications or simply anyone interested in relational database scalability, no matter what specific DBMS. We use an adaptation of the well known TPC-C benchmark to see what hardware configuration will give the best price/performance. We also explain how to tune Virtuoso and how and why different parameters affect the throughput.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We have a new <a href="http://virtuoso.openlinksw.com/wiki/main/Main/VOSScale" id="link-id1068c3f8">technical article</a>, benchmarking <a href="http://virtuoso.openlinksw.com" id="link-id0xd32b4d8">Virtuoso</a> on different hardware configurations.</p>
<p>This is useful reading for anyone interested in using Virtuoso as a database back end for online applications or simply anyone interested in relational database scalability, no matter what specific DBMS.</p>
<p>We use an adaptation of the well known <a href="http://dbpedia.org/resource/TPC-C" id="link-id0xdae8e30">TPC-C</a> benchmark to see what hardware configuration will give the best price/performance. We also explain how to tune Virtuoso and how and why different parameters affect the throughput.</p>
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/weblog/oerling/?date=2006-04-17#958">
  <rss:title>New Article on XML, Full Text and Smart Alerts</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2006-04-17T17:07:53Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">There is a new article, XML and Full Text Indexing and Filtering in Virtuoso, on the Virtuoso Open Source Edition wiki. The article shows how to harvest ATOM feeds, search them, and register alerts that fire when a stored search condition is met by incoming data. This lets the new data index the stored queries and not the other way around. This is the first in a series of hands on technical articles on Virtuoso.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>There is a new article, <a href="http://virtuoso.openlinksw.com/wiki/main/Main/VOSArtText" id="link-id101ebda0">XML and Full Text Indexing and Filtering in Virtuoso</a>, on the <a href="http://virtuoso.openlinksw.com/wiki/main/Main" id="link-id105e7248">Virtuoso Open Source Edition</a> wiki. </p>
<p>The article shows how to harvest ATOM feeds, search them, and register alerts that fire when a stored search condition is met by incoming <a href="http://dbpedia.org/resource/Data" id="link-id0x18b85f38">data</a>. This lets the new data index the stored queries and not the other way around. This is the first in a series of hands on technical articles on <a href="http://virtuoso.openlinksw.com" id="link-id0x1646ca88">Virtuoso</a>.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2006-04-11#951">
  <rss:title>Virtuoso is Officially Open Source!</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2006-04-11T18:01:44Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">I am pleased to unveil (officially) the fact that Virtuoso is now available in Open Source form. What Is Virtuoso? A powerful next generation server product that implements otherwise distinct server functionality within a single server product. Think of Virtuoso as the server software analog of a dual core processor where each core represents a traditional server functionality realm. Where did it come from? The Virtuoso History page tells the whole story. What Functionality Does It Provide? The following: 1. Object-Relational DBMS Engine (ORDBMS like PostgreSQL and DBMS engine like MySQL) 2. XML Data Management (with support for XQuery, XPath, XSLT, and XML Schema) 3. RDF Triple Store (or Database) that supports SPARQL (Query Language, Transport Protocol, and XML Results Serialization format) 4. Service Oriented Architecture (it combines a BPEL Engine with an ESB) 5. Web Application Server (supports HTTP/WebDAV) 6. NNTP compliant Discussion Server And more. (see: Virtuoso Web Site) 90% of the aforementioned functionality has been available in Virtuoso since 2000 with the RDF Triple Store being the only 2006 item. What Platforms are Supported The Virtuoso build scripts have been successfully tested on Mac OS X (Universal Binary Target), Linux, FreeBSD, and Solaris (AIX, HP-UX, and True64 UNIX will follow soon). A Windows Visual Studio project file is also in the works (ETA some time this week). Why Open Source? Simple, there is no value in a product of this magnitude remaining the &quot;best kept secret&quot;. That status works well for our competitors, but absolutely works against the legions of new generation developers, systems integrators, and knowledge workers that need to be aware of what is actually achievable today with the right server architecture. What Open Source License is it under? GPL version 2. What&#39;s the business model? Dual licensing. The Open Source version of Virtuoso includes all of the functionality listed above. While the Virtual Database (distributed heterogeneous join engine) and Replication Engine (across heterogeneous data sources) functionality will only be available in the commercial version. Where is the Project Hosted? On SourceForge. Is there a product Blog? Of course! Up until this point, the Virtuoso Product Blog has been a covert live demonstration of some aspects of Virtuoso (Content Management). My Personal Blog and the Virtuoso Product Blog are actual Virtuoso instances, and have been so since I started blogging in 2003. Is There a product Wiki? Sure! The Virtuoso Product Wiki is also an instance of Virtuoso demonstrating another aspect of the Content Management prowess of Virtuoso. What About Online Documentation? Yep! Virtuoso Online Documentation is hosted via yet another Virtuoso instance. This particular instance also attempts to demonstrate Free Text search combined with the ability to repurpose well formed content in a myriad of forms (Atom, RSS, RDF, OPML, and OCS). What about Tutorials and Demos? The Virtuoso Online Tutorial Site has operated as a live demonstration and tutorial portal for a numbers of years. During the same timeframe (circa. 2001) we also assembled a few Screencast style demos (their look feel certainly show their age; updates are in the works). BTW - We have also updated the Virtuoso FAQ and also released a number of missing Virtuoso White Papers (amongst many long overdue action items).</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[
<p>I am pleased to unveil (officially) the fact that <a href="http://www.prnewswire.com/cgi-bin/stories.pl?ACCT=104&STORY=/www/story/04-11-2006/0004338324&EDATE=">Virtuoso is now available in Open Source form</a>.</p> <p></p> <h4>What Is Virtuoso?</h4> <p>A powerful next generation server product that implements otherwise distinct server functionality within a single server product. Think of Virtuoso as the server software analog of a dual core processor where each core represents a traditional server functionality realm.</p> <p></p> <h4>Where did it come from?</h4> <p>The <a href="http://virtuoso.openlinksw.com/wiki/main/Main/VOSHistory">Virtuoso History page</a> tells the whole story.</p> <p></p> <h4>What Functionality Does It Provide?</h4>  The following: <ul> 1. Object-Relational DBMS Engine (ORDBMS like PostgreSQL and DBMS engine like MySQL) </ul> <ul> 2. XML Data Management (with support for XQuery, XPath, XSLT, and XML Schema) </ul> <ul> 3. RDF Triple Store (or Database) that supports SPARQL (Query Language, Transport Protocol, and XML Results Serialization format) </ul> <ul> 4. Service Oriented Architecture (it combines a BPEL Engine with an ESB) </ul> <ul> 5. Web Application Server (supports HTTP/WebDAV) </ul> <ul> 6. NNTP compliant Discussion Server </ul>  And more. (see: <a href="http://virtuoso.openlinksw.com">Virtuoso Web Site</a>) <p> 90% of the aforementioned functionality has been available in Virtuoso since 2000 with the RDF Triple Store being the only 2006 item.</p> <p></p> <h4>What Platforms are Supported</h4> <p> The Virtuoso build scripts have been successfully tested on Mac OS X (Universal Binary Target), Linux, FreeBSD, and Solaris (AIX, HP-UX, and True64 UNIX will follow soon). A Windows Visual Studio project file is also in the works (ETA some time this week).</p> <p></p> <h4>Why Open Source?</h4> <p>Simple, there is no value in a product of this magnitude remaining the &quot;best kept secret&quot;. That status works well for our competitors, but absolutely works against the legions of new generation developers, systems integrators, and knowledge workers that need to be aware of what is actually achievable today with the right server architecture.</p> <p></p> <h4>What Open Source License is it under?</h4> <p>GPL version 2.</p> <p></p> <h4>What&#39;s the business model?</h4> <p>Dual licensing.</p> <p>The Open Source version of Virtuoso includes all of the functionality listed above. While the Virtual Database (distributed heterogeneous join engine) and Replication Engine (across heterogeneous data sources) functionality will only be available in the commercial version. </p> <p></p> <h4>Where is the Project Hosted?</h4> <p>On <a href="http://sourceforge.net/projects/virtuoso">SourceForge.</a> </p> <p></p> <h4>Is there a product Blog?</h4> <p>Of course! </p> <p>Up until this point, the <a href="http://virtuoso.openlinksw.com/blog/">Virtuoso Product Blog</a> has been a covert live demonstration of some aspects of Virtuoso (Content Management). My Personal Blog and the Virtuoso Product Blog are actual Virtuoso instances, and have been so since I started blogging in 2003.</p> <p>Is There a product Wiki?</p> <p>Sure! <a href="http://virtuoso.openlinksw.com/wiki/main/">The Virtuoso Product Wiki</a> is also an instance of Virtuoso demonstrating another aspect of the Content Management prowess of Virtuoso.</p> <p></p> <h4>What About Online Documentation?</h4> <p>Yep! <a href="http://docs.openlinksw.com/virtuoso/">Virtuoso Online Documentation</a> is hosted via yet another Virtuoso instance. This particular instance also attempts to demonstrate Free Text search combined with the ability to repurpose well formed content in a myriad of forms (Atom, RSS, RDF, OPML, and OCS).</p> <p></p> <h4>What about Tutorials and Demos?</h4> <p>The <a href="http://demo.openlinksw.com/tutorial/">Virtuoso Online Tutorial</a> Site has operated as a live demonstration and tutorial portal for a numbers of years. During the same timeframe (circa. 2001) we also assembled a few Screencast style demos (their look feel certainly show their age; updates are in the works).</p> <p>BTW - We have also updated the <a href="http://virtuoso.openlinksw.com/FAQ/">Virtuoso FAQ</a> and also released a number of missing <a href="http://virtuoso.openlinksw.com/Whitepapers/">Virtuoso White Papers</a> (amongst many long overdue action items).</p>
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2006-03-19#941">
  <rss:title>Getting Closer (Booting solved): WinXP and OSX dual boot in MacBook Pro</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2006-03-19T22:40:55Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">(Directly From Nirlog.com:) WinXP and OSX dual boot in MacBook Pro: &quot; Finally Iâve succeeded in installing Windows XP in MacBook Pro. Now it can dual boot between Windows XP and MacOS X. Thereâre few issues with windows xp but being able to boot smoothly between these 2 OSes are really amazing. Iâve followed this HOWTO where more and more information is being added every few hours. I think most of the minor problems will be solved soon. If you want to install it for your self or want more information this wiki is the best place to go. Here Iâm posting the photos of major installation sequence and some problems I encountered. Installation 1. Downloaded winxponmac0.1.zip Windows XP Pro CD that came with my Samsung Notebook is SP1 but the patch works only with SP2. So this is what I did: 2. Downloaded WinXP SP2 separately. 3. Used the free tool nLite to integrate the WinXP SP2 with the XP Pro CD (SP1) and created the WinXP SP2 CD source. 4. Then followed Step-by-step-instruction Burned the customized WinXP CD. Partitioned the disk using OSX CD. Installed OSX. 5. Started Windows XP installation. 6. I encountered a problem with the partition listing. I was presented with following options. C: Partition 1 (EFI) [FAT32] unpartitioned space E: Partition 2 [unknown] unpartitioned space According to the guide the correct option should be as following: E: Partition1 (EFI) [FAT32] C: Partition2 [Unknown] F: Partition3 [Unknown] If you choose the Partition2 then youâll get follwing error: 7. To solve the above problem I selected the first &#39;unpartitioned space,&#39; then pressed &#39;C&#39; to create a new partition. As described in this solution. After this things went smoothly. 8. Finally itâs installed 9. System Properties 10. Device Manager with unrecognized devices. 11. Downloaded the drivers from here. Ethernet works fine. Wireless doesnât work. If I press restart it will shutdown. 12. Browsing my blog. 13. Boot Choice: Mac OSX 14. Boot Choice: Windows XP Now thereâre few driver issues Iâm quite sure theyâll be solved soon.&quot;</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>(Directly From <a href="http://nirlog.com">Nirlog.com</a>:)</p>
<p>
<a href="http://nirlog.com/2006/03/18/winxp-and-osx-dual-boot-in-macbook-pro/#comments">WinXP and OSX dual boot in MacBook Pro</a>: &quot;</p>
<p>
<img height="332" width="484" style="margin:5px;" alt="" src="http://nirlog.com/wp-content/uploads/2006/03/xponmac-start2.gif" />
</p>
<p>Finally Iâve succeeded in installing Windows XP in MacBook Pro. Now it can dual boot between Windows XP and MacOS X. Thereâre few issues with windows xp but being able to boot smoothly between these 2 OSes are really amazing. Iâve followed this <a href="http://wiki.onmac.net/index.php/HOWTO">HOWTO</a> where more and more information is being added every few hours. I think most of the minor problems will be solved soon. If you want to install it for your self or want more information <a href="http://wiki.onmac.net/index.php/Main_Page">this wiki</a> is the best place to go. Here Iâm posting the photos of major installation sequence and some problems I encountered.</p>
<p>
<a id="more-96"></a>
</p>
<p>
<strong>Installation</strong>
</p>
<p>    1. Downloaded <a href="http://download.onmac.net/Winxponmac_0.1.zip">winxponmac0.1.zip</a>
</p>
<p>Windows XP Pro CD that came with my Samsung Notebook is SP1 but the patch works only with SP2. So this is what I did:</p>
<p>    2. Downloaded <a href="http://www.microsoft.com/downloads/details.aspx?FamilyID=049C9DBE-3B8E-4F30-8245-9E368D3CDB5A&displaylang=en">WinXP SP2</a> separately.</p>
<p>    3. Used the free tool <a href="http://www.nliteos.com/nlite.html">nLite</a> to integrate the WinXP SP2 with the XP Pro CD (SP1)  and created the WinXP SP2    CD source.</p>
<p>    4. Then followed <a href="http://wiki.onmac.net/index.php/HOWTO#Step-by-step_Instructions">Step-by-step-instruction</a>
</p>
<ul>
<li>        Burned the customized WinXP CD.</li>
<li>        Partitioned the disk using OSX CD.</li>
<li>        Installed OSX.</li>
</ul>
<p>
<img height="282" width="484" style="margin:5px;" alt="" src="http://nirlog.com/wp-content/uploads/2006/03/xponmac-burn-cd.gif" />
</p>
<p>    5. Started Windows XP installation.</p>
<p>
<img height="356" width="484" style="margin:5px;" alt="" src="http://nirlog.com/wp-content/uploads/2006/03/xponmac-xpinstall.gif" />
</p>
<p>    6. I encountered a problem with the partition listing. I was presented with following options.</p>
<ul>
<li>    C: Partition 1 (EFI) [FAT32]</li>
<li>         unpartitioned space</li>
<li>    E: Partition 2 [unknown]</li>
<li>        unpartitioned space</li>
</ul>
<p>
<img height="341" width="484" style="margin:5px;" alt="" src="http://nirlog.com/wp-content/uploads/2006/03/xponmac-partition-problem.gif" />
</p>
<p>According to the guide the correct option should be as following:</p>
<ul>
<li>E: Partition1 (EFI) [FAT32]</li>
<li>C: Partition2 [Unknown]</li>
<li>F: Partition3 [Unknown]</li>
</ul>
<p>If you choose the Partition2 then youâll get follwing error:</p>
<p>
<img height="344" width="484" style="margin:5px;" alt="" src="http://nirlog.com/wp-content/uploads/2006/03/xponmac-partition-problem1.gif" />
</p>
<p>    7. To solve the above problem I selected the first &#39;unpartitioned space,&#39; then pressed &#39;C&#39; to create a new partition. As described in <a href="http://www.macfixit.com/article.php?story=20060317100333451">this solution</a>. After this things went smoothly.</p>
<p>
<img height="360" width="484" style="margin:5px;" alt="" src="http://nirlog.com/wp-content/uploads/2006/03/xponmac-partition-ok.gif" />
</p>
<p>    8. Finally itâs installed</p>
<p>
<img height="332" width="484" style="margin:5px;" alt="" src="http://nirlog.com/wp-content/uploads/2006/03/xponmac-start1.gif" />
</p>
<p>9. System Properties</p>
<p>
<img height="531" width="484" style="margin:5px;" alt="" src="http://nirlog.com/wp-content/uploads/2006/03/xponmac-sys-prop.gif" />
</p>
<p>10. Device Manager with unrecognized devices.</p>
<p>
<img height="399" width="484" style="margin:5px;" alt="" src="http://nirlog.com/wp-content/uploads/2006/03/xponmac-device.gif" />
</p>
<p>11. Downloaded the drivers from <a href="http://wiki.onmac.net/index.php/Drivers">here</a>. Ethernet works fine. Wireless doesnât work. If I press restart it will shutdown.</p>
<p>12. Browsing my blog.</p>
<p>
<img height="363" width="484" style="margin:5px;" alt="" src="http://nirlog.com/wp-content/uploads/2006/03/xponmac-firefox.gif" />
</p>
<p>13. Boot Choice: Mac OSX</p>
<p>
<img height="360" width="484" style="margin:5px;" alt="" src="http://nirlog.com/wp-content/uploads/2006/03/xponmac-apple.gif" />
</p>
<p>14. Boot Choice: Windows XP</p>
<p>
<img height="360" width="484" style="margin:5px;" alt="" src="http://nirlog.com/wp-content/uploads/2006/03/xponmac-12.gif" />
</p>
<p>Now thereâre few driver issues Iâm quite sure theyâll be solved soon.</p>&quot;

]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2006-02-09#932">
  <rss:title>WINE Arrives for Intel Macs</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2006-02-09T14:29:16Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">WINE Arrives for Intel Macs: &quot; Though the precious dream of dual-booting our Intel Macs has not descended, a convenient alternative has arrived. Although fully functional on developers releases of OS X for Intel, the WINE compatibility layer, which allows Windows programs to run on *nix systems including OS X, was not available for the public release of 10.4.4. However, thanks to the hard work of the folks at Darwine (http://darwine.opendarwin.org/) and their contributors, it appears this barrier has been broken! Find out how to compile WINE and view screenshots in our forum (http://forum.osx86project.org/index.php?showtopic=8699). &quot; (Via The OSx86 Project.)</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p><a href="http://www.osx86project.org/index.php?option=com_content&task=view&id=112&Itemid=2">WINE Arrives for Intel Macs</a>: &quot;

Though the precious dream of dual-booting our Intel Macs has
not descended, a convenient alternative has arrived. Although fully functional
on developers releases of OS X for Intel, the WINE compatibility layer, which
allows Windows programs to run on *nix systems including OS X, was not available
for the public release of 10.4.4. However, thanks to the hard work of the folks
at Darwine (http://darwine.opendarwin.org/) and their contributors, it appears this barrier has been broken!
Find out how to compile WINE and view screenshots in our forum (http://forum.osx86project.org/index.php?showtopic=8699).

&quot;</p>

<p>(Via <a href="http://www.osx86project.org">The OSx86 Project</a>.)</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2005-11-16#904">
  <rss:title>Saving the Net from the pipeholders</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2005-11-16T18:23:13Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">An interesting post that I have place verbatim for the following reasons: 1. Its Importance (generally speaking)2. Lot&#39;s of Link Love (A-List Blogger Style see: LinkBlog and Summary to see what My Blog does with these links)3. Time-to-show on Memeorandum (how, when, and if at all, are results that are of personal interest)Anyway, read the post from Doc Searls titled: Saving the Net from the pipeholders&quot;I&#39;ve spent much of the last two weeks writing an essay that just went up at Linux Journal: Saving the Net: How to Keep the Carriers from Flushing the Net Down the Tubes. It&#39;s probably the longest post I&#39;ve ever put up on the Web. It&#39;s certainly the most important. And not just to me.I started writing it after a recent surprise visit by David Isenberg to Santa Barbara. He&#39;s the one who got me â and, I hope, us â going.I finished writing it yesterday after David Berlind published threeexcellentpieces, which I highly recommend reading, and acting upon.For guidance during the rest of this thing (whether they knew it or not), I also want to thank David Weinberger, Dave Winer, Steve Gillmor, Kevin Werbach, Cory Doctorow, Don Marti, Richard M. Stallman, Eric S. Raymond, Susan Crawford, Larry Lessig, John Palfrey, Chris Nolan, Jeff Jarvis, Craig Burton, Andrew Sullivan, Paul Kunz, Dean Landsman, Matt Welch, Sheila Lennon, George Lakoff, Om Malik, Phil Hughes, J.D. Lasica, Virginia Postrel, Chris Anderson, Esther Dyson, Jim Thompson, Micah Sifry, John Perry Barlow, The EFF, the Berkman Center, the Personal Democracy Forum and others I&#39;m overlooking but will fill in later when I have the time.Although it&#39;s kinda huge, Saving the Net wasn&#39;t written as a Finished Work, but rather as a conversation starter â a way to change a rock we&#39;re pushing uphill to a snowball we&#39;re rolling downhill.Larry Lessig started rolling it at OSCON in 2002, and in various other ways before that, and the whole thing has been too damn sisyphean for too damn long. Time to change that.There&#39;s a thesis involved: that the Net is in danger of becoming what Kevin Werbachcalls&#39;a private toiled garden for the phone companies&#39;, but that the real enemy is in how we understand the Net itself. We have choices there, and those choices may mean life or death for the Net as most of us have known it â and taken it for granted â for the last decade or more.A couple days ago I spoke to a group of about thirty local citizens here in Santa Barbara County, gathered in the County supervisors&#39; conference room to discuss forming a broadband task force. Early on, I asked people what the Net was. The answers were varied, but had one thing in common: it was a place, and not just fiber and copper.&quot;</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>An interesting post that I have place verbatim for the following reasons:
</p><ul>1. Its Importance (generally speaking)</ul><ul>2. Lot&#39;s of Link Love (A-List Blogger Style see: <a href="http://www.openlinksw.com/blog/~kidehen/index.vspx?page=linkblog">LinkBlog</a> and <a href="http://www.openlinksw.com/blog/~kidehen/index.vspx?page=summary">Summary</a> to see what <a href="http://www.openlinksw.com/blog/~kidehen">My Blog</a> does with these links)</ul><ul>3. Time-to-show on <a href="http://memeorandum.com">Memeorandum</a> (how, when, and if at all, are results that are of personal interest)</ul><p>Anyway, read the post from <a href="http://doc.weblogs.com/">Doc Searls</a> titled: <a href="http://doc.weblogs.com/2005/11/16#savingTheNetFromThePipeholders">Saving the Net from the pipeholders</a></p><p>&quot;I&#39;ve spent much of the last two weeks writing an essay that just went up at <a href="http://linuxjournal.com">Linux Journal</a>: <a href="http://www.linuxjournal.com/article/8673">Saving the Net: How to Keep the Carriers from Flushing the Net Down the Tubes</a>. It&#39;s probably the longest post I&#39;ve ever put up on the Web. It&#39;s certainly the most important. And not just to me.</p><p>I started writing it after a recent surprise visit by <a href="http://www.isen.com/blog/">David Isenberg</a> to Santa Barbara. He&#39;s the one who got me â and, I hope, us â going.</p><p>I finished writing it yesterday after <a href="http://blogs.zdnet.com/BTL/">David Berlind</a> published <a href="http://blogs.zdnet.com/BTL/?p=2160">three</a><a href="http://blogs.zdnet.com/BTL/?p=2161">excellent</a><a href="http://blogs.zdnet.com/BTL/?p=2157">pieces</a>, which I highly recommend reading, and acting upon.</p><p>For guidance during the rest of this thing (whether they knew it or not), I also want to thank <a href="http://www.hyperorg.com/blogger/">David Weinberger</a>, <a href="http://scripting.com/">Dave Winer</a>, <a href="http://blogs.zdnet.com/Gillmor/">Steve Gillmor</a>, <a href="http://werblog.com/">Kevin Werbach</a>, <a href="http://craphound.com/">Cory Doctorow</a>, <a href="http://zgp.org/~dmarti/">Don Marti</a>, <a href="http://www.stallman.org/">Richard M. Stallman</a>, <a href="http://esr.ibiblio.org/">Eric S. Raymond</a>, <a href="http://scrawford.blogware.com/blog">Susan Crawford</a>, <a href="http://lessig.org/blog/">Larry Lessig</a>, <a href="http://blogs.law.harvard.edu/palfrey/">John Palfrey</a>, <a href="http://www.spot-on.com/nolan/">Chris Nolan</a>, <a href="http://www.buzzmachine.com/">Jeff Jarvis</a>, <a href="http://www.craigburton.com/">Craig Burton</a>,<a href="http://www.andrewsullivan.com/">  Andrew Sullivan</a>, <a href="http://arstechnica.com/news.ars/post/20011210-2489.html">Paul Kunz</a>, <a href="http://blog.deanland.com/">Dean Landsman</a>, <a href="http://www.mattwelch.com/warblog.html">Matt Welch</a>, <a href="http://www.projo.com/shenews">Sheila Lennon</a>, <a href="http://www.georgelakoff.com/">George Lakoff</a>, <a href="http://gigaom.com/">Om Malik</a>, <a href="http://www.ssc.com/xstatic/corporate/staff/phil.html">Phil Hughes</a>, <a href="http://www.newmediamusings.com/">J.D. Lasica</a>, <a href="http://www.dynamist.com/weblog/">Virginia Postrel</a>, <a href="http://longtail.typepad.com/the_long_tail/">Chris Anderson</a>, <a href="http://www.release1-0.com/esther/">Esther Dyson</a>, <a href="http://www.smallworks.com/">Jim Thompson</a>, <a href="http://micah.sifry.com/">Micah Sifry</a>, <a href="http://blog.barlowfriendz.net/">John Perry Barlow</a>, <a href="http://www.eff.org/">The EFF</a>, <a href="http://cyber.law.harvard.edu">the Berkman Center</a>, the <a href="http://www.personaldemocracy.com/">Personal Democracy Forum</a> and others I&#39;m overlooking but will fill in later when I have the time.</p><p>Although it&#39;s kinda huge, <a href="http://www.linuxjournal.com/article/8673">Saving the Net</a> wasn&#39;t written as a Finished Work, but rather as a conversation starter â a way to change a rock we&#39;re pushing uphill to a <a href="http://doc.weblogs.com/2005/03/28#betOnTheSnowball">snowball</a> we&#39;re rolling downhill.</p><p><a href="http://randomfoo.net/oscon/2002/lessig/">Larry Lessig started rolling</a> it at OSCON in 2002, and in various other ways before that, and the whole thing has been too damn <a href="http://en.wikipedia.org/wiki/Sisyphus">sisyphean</a> for too damn long. Time to change that.</p><p>There&#39;s a thesis involved: that the Net is in danger of becoming what <a href="http://werblog.com/">Kevin Werbach</a><a href="http://werbach.com/blog/archives/2005/11/not_the_interne.html">calls</a>&#39;a private toiled garden for the phone companies&#39;, but that the real enemy is in how we understand the Net itself. We have choices there, and those choices may mean life or death for the Net as most of us have known it â and taken it for granted â for the last decade or more.</p><p>A couple days ago I spoke to a group of about thirty local citizens here in Santa Barbara County, gathered in the County supervisors&#39; conference room to discuss forming a broadband task force. Early on, I asked people what the Net was. The answers were varied, but had one thing in common: it was a <i>place</i>, and not just fiber and copper.&quot; </p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2005-10-28#887">
  <rss:title>Self Annotation of Semantic Web (BBC Demo)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2005-10-28T22:54:44Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Stop whatever you are doing ...: &quot; .. and go and read Tom Coates&#39; explanation of his last project with the BBC. After 21 years working in broadcasting Ireckon this is one of the coolest things to happen for a very, very long time.The ramifications of this will go very deep indeed.&quot; (Spotted Via The Obvious?.) Yes, the ramifications are deep! Tom Coates&#39; screencast demonstrates an internal variation of an activity that is taking place on many fronts (concurrently) across the NET. I tend to refer to this effort as &quot;Self Annotation&quot;; the very process that will ultimately take us straight to &quot;Semantic Web&quot;. It is going to happen much quicker than anticipated because technology is taking the pain out of metadata annotation (e.g. what you do when you tag everything that is ultimately URI accessible). Technology is basically delivering what Jon Udell calls: &quot;reducing the activation threshold&quot;.Using my comments above for context placement, I suggest you take a look at, or re-read Jon Udell&#39;s post titled: Many Meanings of Metadata. Once again, the Web 2.0 brouhaha (in every sense of the word) is a reaction to a critical inflection that ultimately transitions the &quot;Semantic Web&quot; from &quot;Mirage&quot; to &quot;Nirvana&quot;. Put differently (with humor in mind solely!), Web 2.0 is what I tend to call a &quot;John the Baptist&quot; paradigm, and we all know what happened to him :-)Web 2.0 is a conduit to a far more important destination. The tendency to treat Web 2.0 as a destination rather than a conduit has contributed to the recent spate of Bozo bit flipping posts all over the blogosphere (is this an attempt to behead John, metaphorically speaking?). Humor aside, a really important thing about the Web 2.0 situation is that when we make the quantum evolutionary leap (internet time, mind you) to the &quot;Semantic Web&quot; (or whatever groovy name we dig up for it in due course) we will certainly have a plethora of reference points (I mean Web 2.0 URIs) ensuring that we do not revisit the &quot;Missing Link&quot; evolutionary paradox :-) BTW - You can see some example of my contribution to the ongoing annotation process by looking at: My Blog Summary PageMy LinkblogMy Blog SearchMy Blog Query Service (click on the enhanced view if you&#39;re a SOAP geek; also note blogid=127)</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<blockquote><p><a href="http://theobvious.typepad.com/blog/2005/10/stop_whatever_y.html">Stop whatever you are doing ...</a>: &quot;
</p><div xmlns="http://www.w3.org/1999/xhtml"><p>.. and go and read <a href="http://www.plasticbag.org/archives/2005/10/on_the_bbc_annotatable_audio_project.shtml">Tom Coates&#39; explanation</a> of his last project with the BBC. After 21 years working in broadcasting Ireckon this is one of the coolest things to happen for a very, very long time.</p><p>The ramifications of this will go very deep indeed.&quot;</p></div>

<p>(Spotted Via <a href="http://theobvious.typepad.com/blog/">The Obvious?</a>.)</p></blockquote><p> Yes, the ramifications are deep! <a href="http://www.plasticbag.org/">Tom Coates&#39;</a> screencast demonstrates an internal variation of an activity that is taking place on many fronts (concurrently) across the NET. I tend to refer to this effort as &quot;<a href="http://www.openlinksw.com/weblog/kidehen@openlinksw.com/127/index.vspx?page=&id=849">Self Annotation</a>&quot;; the very process that will ultimately take us straight to &quot;<a href="http://www.openlinksw.com/weblog/public/search.vspx?blogid=127&q=#39semantic%20web#39%20&type=text&output=html">Semantic Web</a>&quot;. It is going to happen much quicker than anticipated because technology is taking the pain out of metadata annotation (e.g. what you do when you tag everything that is ultimately URI accessible). Technology is basically delivering what <a href="http://weblog.infoworld.com/udell">Jon Udell</a> calls: <a href="http://weblog.infoworld.com/udell/2004/11/08.html">&quot;reducing the activation threshold&quot;</a>.</p><p>Using my comments above for context placement, I suggest you take a look at, or re-read <a href="http://weblog.infoworld.com/udell/2005/10/27.html#a1330">Jon Udell&#39;s post titled: Many Meanings of Metadata</a>. </p><p>Once again, the Web 2.0 brouhaha (in every sense of the word) is a reaction to a critical inflection that ultimately transitions the &quot;Semantic Web&quot; from &quot;Mirage&quot; to &quot;Nirvana&quot;. Put differently (with humor in mind solely!), Web 2.0 is what I tend to call a &quot;John the Baptist&quot; paradigm, and we all know what happened to him :-)</p><p>Web 2.0 is a conduit to a far more important destination. The tendency to treat Web 2.0 as a destination rather than a conduit has contributed to the recent spate of  <a href="http://c2.com/cgi/wiki?SetTheBozoBit">Bozo bit</a> flipping posts all over the blogosphere (is this an attempt to behead John, metaphorically speaking?). Humor aside, a really important thing about the Web 2.0 situation is that when we make the quantum <a href="http://www.pbs.org/wgbh/nova/link/evolution.html">evolutionary leap (internet time, mind you) to the &quot;Semantic Web&quot;</a> (or whatever groovy name we dig up for it in due course) we will certainly have a plethora of reference points (I mean <a href="http://www.openlinksw.com/weblog/public/search.vspx?blogid=127&q=#39web%202.0#39&type=text&output=html">Web 2.0 URIs</a>) ensuring that we do not revisit the &quot;Missing Link&quot; evolutionary paradox :-)</p><p>
BTW - You can see some example of my contribution to the ongoing annotation process by looking at:
</p><ul><a href="http://www.openlinksw.com/weblog/kidehen@openlinksw.com/127/index.vspx?page=summary">My Blog Summary Page</a></ul><ul><a href="http://www.openlinksw.com/weblog/kidehen@openlinksw.com/127/index.vspx?page=linkblog">My Linkblog</a></ul><ul><a href="http://www.openlinksw.com/weblog/public/search.vspx?blogid=127">My Blog Search</a></ul><ul><a href="http://www.openlinksw.com/BlogAPI/services.vsmx">My Blog Query Service</a> (click on the enhanced view if you&#39;re a SOAP geek; also note blogid=127)</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2005-09-16#867">
  <rss:title>A Webpage is Not An API or a Platform (The Populicio.us Remix)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2005-09-16T17:47:38Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">A Webpage is Not An API or a Platform (The Populicio.us Remix): &quot; A few months ago in my post GMail Domain Change Exposes Bad Design and Poor Code, I wrote Repeat after me, a web page is not an API or a platform. It seems some people are still learning this lesson the hard way. In the post The danger of running a remix service Richard MacManus writes Populicio.us was a service that used data from social bookmarking site del.icio.us, to create a site with enhanced statistics and a better variety of &#39;popular&#39; links. However the Populicio.us service has just been taken off air, because its developer can no longer get the required information from del.icio.us. The developer of Populicio.us wrote: &#39;Del.icio.us doesn&#39;t serve its homepage as it did and I&#39;m not able to get all needed data to continue Populicio.us. Right now Del.icio.us doesn&#39;t show all the bookmarked links in the homepage so there is no way I can generate real statistics.&#39; This plainly illustrates the danger for remix or mash-up service providers who rely on third party sites for their data. del.icio.us can not only giveth, it can taketh away. It seems Richard Macmanus has missed the point. The issue isn&#39;t depending on a third party site for data. The problem is depending on screen scraping their HTML webpage. An API is a service contract which is unlikely to be broken without warning. A web page can change depending on the whims of the web master or graphic designer behind the site. Versioning APIs is hard enough, let alone trying to figure out how to version an HTML website so screen scrapers are not broken. Web 2.0 isn&#39;t about screenscraping. Turning the Web into an online platform isn&#39;t about legitimizing bad practices from the early days of the Web. Screen scraping needs to die a horrible death. Web APIs and Web feeds are the way of the future. &quot; (Via Dare Obasanjo aka Carnage4Life.) Amen!</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<blockquote><p><a href="http://www.25hoursaday.com/weblog/PermaLink.aspx?guid=9e1811b8-f4f9-4407-aff7-92b3cd170f73">A Webpage is Not An API or a Platform (The Populicio.us Remix)</a>: &quot;</p><p>
      A few months ago in my post <a href="http://www.25hoursaday.com/weblog/PermaLink.aspx?guid=87ad1fa6-08a9-491f-90c3-c77b22002c0c">GMail
      Domain Change Exposes Bad Design and Poor Code</a>, I wrote <em>Repeat after me, a
      web page is not an API or a platform</em>. It seems some people are still learning
      this lesson the hard way. In the post <a href="http://www.readwriteweb.com/archives/002829.php">The
      danger of running a remix service</a> Richard MacManus writes 
   </p>
        <blockquote dir="ltr" style="MARGIN-RIGHT: 0px">
          <p>
            <a href="http://populicio.us/">Populicio.us</a> was a service that used data from
      social bookmarking site <a href="http://del.icio.us/">del.icio.us</a>, to create a
      site with enhanced statistics and a better variety of &#39;popular&#39; links. However the
      Populicio.us service has just been taken off air, because its developer can no longer
      get the required information from del.icio.us. <a href="http://populicio.us/">The
      developer of Populicio.us wrote</a>:
   </p>
          <p>
      &#39;Del.icio.us doesn&#39;t serve its homepage as it did and I&#39;m not able to get all needed
      data to continue Populicio.us. Right now Del.icio.us doesn&#39;t show all the bookmarked
      links in the homepage so there is no way I can generate real statistics.&#39;
   </p>
          <p>
      This plainly illustrates the danger for remix or mash-up service providers who rely
      on third party sites for their data. del.icio.us can not only giveth, it can taketh
      away. 
   </p>
        </blockquote>
        <p dir="ltr">
      It seems Richard Macmanus has missed the point. The issue isn&#39;t depending on a third
      party site for data. The problem is depending on screen scraping their HTML webpage.
      An API is a service contract which is unlikely to be broken without warning. A web
      page can change depending on the whims of the web master or graphic designer behind
      the site. 
   </p>
        <p dir="ltr">
      Versioning APIs is hard enough, let alone trying to figure out how to version an HTML
      website so screen scrapers are not broken. Web 2.0 isn&#39;t about screenscraping. Turning
      the Web into an online platform isn&#39;t about legitimizing bad practices from the early
      days of the Web. Screen scraping needs to die a horrible death. Web APIs and Web feeds
      are the way of the future. 
   </p>&quot;

<p>(Via <a href="http://www.25hoursaday.com/weblog/">Dare Obasanjo aka Carnage4Life</a>.)</p></blockquote>

Amen!
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2005-05-20#849">
  <rss:title>World Wide Web of Junk</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2005-05-20T23:07:38Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">After digesting Oblique Angle&#39;s post titled: World Wide Web of Junk, it was nice to be reassured that I am not part of a shrinking minority of increasingly peturbed Web users. The post excerptÂ below is what compelled me to contributeÂ some of my thoughts about the current state of the Web and a future &quot;Semantic Web&quot;. The value of the Internet as a repository of useful information is very low. Carl Shapiro in âInformation Rulesâ suggests that the amount of actually useful information on the Internet would fit within roughly 15,000 books, which is about half the size of an average mall bookstore. To put this in perspective: there are over 5 billion unique, static &amp; publicly accessible web pages on the www. Apparently Only 6% of web sites have educational content (Maureen Henninger, âDonât just surf the net: Effective research strategiesâ. UNSW Press). Even of the educational content only a fraction is of significant informational value. Noise is taking over the Web at an alarming rate (to be expected in a sense ), and even though Tim Berners-Lee (TBL) had the foresight to create the Web,Â many see nothing butÂ futilityÂ in hisÂ vision for a &quot;Semantic Web&quot; (I don&#39;t!).Â  A recent exampleÂ of such commentary comes from Eric Nee&#39;s CIO article, titled:Â  Web Future is Not Semantic, Or Overly Orderly. I take issue with this article because, like most (who have been bitten at least once),Â  I don&#39;t like mono culture.Â  This article inadvertently promotes &quot;Google Mono Culture&quot;.Â  I haveÂ excerpted the more frustrating parts of this article below: ..As Stanford students, Larry Page and Sergey Brin looked at the same problemâhow to impart meaning to all the content on the Webâand decided to take a different approach. The two developed sophisticated software that relied on other clues to discover the meaning of content, such as which Web sites the information was linked to. And in 1998 they launched Google.. You mean noise ranking. Now, I don&#39;t think Larry and Sergey set out to do this, but Google page ranks are ultimately based on the concept of &quot;Google Juice&quot; (aka links). The value quotient of this algorithm is accelerating at internet speed (ironically, but naturally). Human beings are smarter than computers, we just processÂ data (not information!)Â much slower that&#39;s all. Thus, we can conjure up numerous ways to bubble up the google link ranking algorithms in no time (as is the case today). ..What most differentiates Google&#39;s approach from Berners-Lee&#39;s is that Google doesn&#39;t require people to change the way they post content.. The Semantic Web doesn&#39;t require anyone to change how they post content either! It just provides a roadmap for intelligent content managment and consumption through innovative products. ..As Sergey Brin told Infoworld&#39;s 2002 CTO Forum, &quot;I&#39;d rather make progress by having computers under-stand what humans write, than by forcing -humans to write in ways that computers can understand.&quot; In fact, Google has not participated at all in the W3C&#39;s formulation of Semantic Web standards, says Eric Miller.. Semantic Content generated by next generation content managers will make more progress, and they certainly won&#39;t require humans to write any differently. If anything, humans will find the process quite refreshing as and when participation is required e.g. clickingÂ bookmarklets associated with tagging services such asÂ &#39;del.icio.us&#39;, &#39;de.lirio.us&#39;, or Unalog and others. But this is only the beginning, if I can click on a bookmarklet to post this blog post to a tagging service, then why wouldn&#39;t I be able to incorporate the &quot;tag service post&quot; into the same process that saves my blog post (the post is content that ends up in a content management system aka blog server)? Yet Google&#39;s impact on the Web is so dramatic that it probably makes more sense to call the next generation of the Web the &quot;Google Web&quot; rather than the &quot;Semantic Web.&quot; Ah! so you think weÂ really want the noisy &quot;Google Web&quot; as opposed to a federation of distributed Information- and Knowledgbases ala the &quot;Semantic Web&quot;? I don&#39;t think so somehow! Today we are generally excited about &quot;tagging&quot; but fail to see its correlation with the &quot;Semantic Web&quot;, somehow? I have said this before, and I will say it again, the &quot;Semantic Web&quot; is going to be self-annotated by humansÂ with theÂ aid ofÂ intelligent and unobtrusive annotation technology solutions. These solutions willÂ provide context and purposeÂ by using ourÂ our social essence as currency. The annotationÂ effort will be subliminal, thereÂ won&#39;t be a &quot;Semantic Web Day&quot; parade or anything of the like.Â It will appear before us all, in all its glory, without any fanfare.Â Funnily enough, weÂ might not even call it &quot;The Semantic Web&quot;, who cares? But it will have the distinct attributes of being very &quot;Quiet&quot; and highly &quot;Valuable&quot;; withÂ no burden onÂ &quot;how weÂ write&quot;, but constructiveÂ burden on &quot;why weÂ write&quot; as part of the content contributionÂ process (less Google/Yahoo/etc juiceÂ chasingÂ for more knowledgeÂ assembly and exchange). We are social creatures at our core. The Internet and Web have collectively reduced the connectivity hurdles thatÂ once made social network oriented solutions implausible. The eradication ofÂ these hurdles ultimately feeds the very impulses that trigger the critical self-annotation that is the basis of my fundamental belief in the realization of TBL&#39;s Semantic Web vision. Â </dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[
<div align="left">After digesting <a href="http://obliqueangle.blogspot.com/">Oblique Angle</a>&#39;s post titled: <a href="http://obliqueangle.blogspot.com/2005/05/world-wide-web-of-junk.html">World Wide Web of Junk</a>,
it was nice to be reassured that I am not part of a shrinking minority
of increasingly peturbed Web users. The post excerptÂ below is what
compelled me to contributeÂ some of my thoughts about the current
state of the Web and a future &quot;Semantic Web&quot;.</div> <blockquote style="margin-right: 0px;" dir="ltr"> <div align="left">The value of the Internet as a repository of useful information is very low. <a href="http://faculty.haas.berkeley.edu/shapiro/">Carl Shapiro </a>in <a href="http://www.inforules.com/">âInformation Rulesâ</a>
suggests that the amount of actually useful information on the Internet
would fit within roughly 15,000 books, which is about half the size of
an average mall bookstore. To put this in perspective: there are over 5
billion unique, static &amp; publicly accessible web pages on the www.
Apparently Only 6% of web sites have educational content (Maureen
Henninger, <a href="http://www.mja.com.au/public/bookroom/1999/mullins/mullins.html">âDonât just surf the net: Effective research strategiesâ. </a>UNSW Press). Even of the educational content only a fraction is of significant informational value.</div></blockquote> <div dir="ltr" align="left">Noise is taking over the Web at an alarming rate (to be expected in a sense ), and even though <a href="http://www.w3.org/People/Berners-Lee/">Tim Berners-Lee</a>
(TBL) had the foresight to create the Web,Â many see nothing
butÂ futilityÂ in hisÂ vision for a &quot;Semantic Web&quot; (I
don&#39;t!).Â  A recent exampleÂ of such commentary comes from Eric
Nee&#39;s CIO article, titled:Â  <span class="print_article_title"><a href="http://www.cioinsight.com/print_article2/0,2533,a=151806,00.asp">Web Future is Not Semantic, Or Overly Orderly</a>. I take issue with this article because, like most (who have been bitten at least once),Â  I don&#39;t like mono culture</span><span class="print_article_title">.Â  </span>This
article inadvertently promotes &quot;Google Mono Culture&quot;.Â  I
haveÂ excerpted the more frustrating parts of this article below:</div> <blockquote style="margin-right: 0px;" dir="ltr"> <div dir="ltr" align="left"> <p><em>..As
Stanford students, Larry Page and Sergey Brin looked at the same
problemâhow to impart meaning to all the content on the Webâand decided
to take a different approach. The two developed sophisticated software
that relied on other clues to discover the meaning of content, such as
which Web sites the information was linked to. And in 1998 they
launched Google..</em></p></div></blockquote> <p dir="ltr">You mean
noise ranking. Now, I don&#39;t think Larry and Sergey set out to do this,
but Google page ranks are ultimately based on the concept of &quot;Google
Juice&quot; (aka links). The value quotient of this algorithm is
accelerating at internet speed (ironically, but naturally). Human
beings are smarter than computers, we just processÂ data (not
information!)Â much slower that&#39;s all. Thus, we can conjure up
numerous ways to bubble up the google link ranking algorithms in no
time (as is the case today). </p> <blockquote style="margin-right: 0px;" dir="ltr"> <p dir="ltr" align="left"><em>..What
most differentiates Google&#39;s approach from Berners-Lee&#39;s is that Google
doesn&#39;t require people to change the way they post content..</em></p></blockquote> <p dir="ltr" align="left">The
Semantic Web doesn&#39;t require anyone to change how they post content
either! It just provides a roadmap for intelligent content managment
and consumption through innovative products. </p><blockquote style="margin-right: 0px;" dir="ltr"> <p dir="ltr" align="left"><em>..As
Sergey Brin told Infoworld&#39;s 2002 CTO Forum, &quot;I&#39;d rather make progress
by having computers under-stand what humans write, than by forcing
-humans to write in ways that computers can understand.&quot; In fact,
Google has not participated at all in the W3C&#39;s formulation of Semantic
Web standards, says Eric Miller.. </em></p></blockquote> <p dir="ltr" align="left">Semantic
Content generated by next generation content managers will make more
progress, and they certainly won&#39;t require humans to write any
differently. If anything, humans will find the process quite refreshing
as and when participation is required e.g. clickingÂ bookmarklets
associated with tagging services such asÂ &#39;<a href="http://del.icio.us">del.icio.us</a>&#39;, <a href="http://de.lirio.us">&#39;de.lirio.us</a>&#39;, or <a href="http://www.unalog.com">Unalog</a>
and others. But this is only the beginning, if I can click on a
bookmarklet to post this blog post to a tagging service, then why
wouldn&#39;t I be able to incorporate the &quot;tag service post&quot; into the same
process that saves my blog post (the post is content that ends up in a <a href="http://virtuoso.openlinksw.com">content management system</a> aka blog server)? </p><blockquote style="margin-right: 0px;" dir="ltr"> <p dir="ltr" align="left"><em>Yet
Google&#39;s impact on the Web is so dramatic that it probably makes more
sense to call the next generation of the Web the &quot;Google Web&quot; rather
than the &quot;Semantic Web.&quot;</em></p></blockquote> <p dir="ltr" align="left">Ah!
so you think weÂ really want the noisy &quot;Google Web&quot; as opposed to a
federation of distributed Information- and Knowledgbases ala the
&quot;Semantic Web&quot;? I don&#39;t think so somehow!</p> <p dir="ltr" align="left">Today
we are generally excited about &quot;tagging&quot; but fail to see its
correlation with the &quot;Semantic Web&quot;, somehow? I have said this before,
and I will say it again, the &quot;Semantic Web&quot; is going to be
self-annotated by humansÂ with theÂ aid ofÂ intelligent and
unobtrusive annotation technology solutions. These solutions
willÂ provide context and purposeÂ by using ourÂ our social
essence as currency. The annotationÂ effort will be subliminal,
thereÂ won&#39;t be a &quot;Semantic Web Day&quot; parade or anything of the
like.Â It will appear before us all, in all its glory, without any
fanfare.Â Funnily enough, weÂ might not even call it &quot;The
Semantic Web&quot;, who cares? But it will have the distinct attributes of
being very &quot;Quiet&quot; and highly &quot;Valuable&quot;; withÂ no burden
onÂ &quot;how weÂ write&quot;, but constructiveÂ burden on &quot;why
weÂ write&quot; as part of the content contributionÂ process (less
Google/Yahoo/etc juiceÂ chasingÂ for more
knowledgeÂ assembly and exchange). </p><p dir="ltr" align="left">We
are social creatures at our core. The Internet and Web have
collectively reduced the connectivity hurdles thatÂ once made
social network oriented solutions implausible. The eradication
ofÂ these hurdles ultimately feeds the very impulses that trigger
the critical self-annotation that is the basis of my fundamental belief
in the realization of TBL&#39;s Semantic Web vision. </p><p dir="ltr" align="left">Â </p>
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2005-05-01#831">
  <rss:title>A Collection of PHP and ODBC How-To Links</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2005-05-01T15:46:45Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">In 2005 I am somewhat surprised at the steady level of emails and commentary expressing confusion about the use of PHP and ODBC. Here are a few links that resolve any confusion about this matter: OpenLink&#39;s PHP and iODBC HOWTO doc: http://www.iodbc.org/index.php?page=languages/php/odbc-phpHOWTO PHP Everywhere&#39;s guide: http://phplens.com/phpeverywhere/node/view/9 Zili Zhang&#39;s piece from 1999 (time flies!): http://www.tldp.org/HOWTO/MSSQL6-Openlink-PHP-ODBC.html Zend&#39;s ODBC Tutorial: http://www.zend.com/zend/tut/odbc.php  Or simple google on PHP and ODBC or PHP and iODBC ...</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>In 2005 I am somewhat surprised at the steady level of&nbsp;emails and commentary expressing confusion about the use of PHP and ODBC.</p>
<p>Here are a few links that resolve any confusion about this matter:</p>
<ol>
<li>OpenLink's PHP and iODBC HOWTO doc: <a href="http://www.iodbc.org/index.php?page=languages/php/odbc-phpHOWTO">http://www.iodbc.org/index.php?page=languages/php/odbc-phpHOWTO</a><br></li>
<li>PHP Everywhere's guide: <a href="http://phplens.com/phpeverywhere/node/view/9">http://phplens.com/phpeverywhere/node/view/9</a><br></li>
<li>Zili Zhang's piece from 1999 (time flies!): <a href="http://www.tldp.org/HOWTO/MSSQL6-Openlink-PHP-ODBC.html">http://www.tldp.org/HOWTO/MSSQL6-Openlink-PHP-ODBC.html</a><br></li>
<li>Zend's ODBC Tutorial: <a href="http://www.zend.com/zend/tut/odbc.php">http://www.zend.com/zend/tut/odbc.php</a>&nbsp;</li></ol>
<p>Or simple google on PHP and ODBC or PHP and iODBC ...</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2005-04-29#825">
  <rss:title>Ajax, Hard Facts, Brass Tacks ... and Bad Slacks</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2005-04-29T20:11:22Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">By Mark Bierbeck: Ajax, Hard Facts, Brass Tacks ... and Bad Slacks A number of people have contacted me recently about Ajax [1] -- a catchy name -- coined to provide an umbrella term for a particular group of technologies used to build web applications. The use of the word comes from Jesse James Garrett in a recent blog [2], and describes a class of internet applications written using JavaScript in a browser. By using JavaScript these apps have full access to the DOM, and as a consequence are able to make all sorts of changes to the page that the user is interacting with, without having to go back to the server.When the application does need to go back to the server -- to deliver some data and get a response -- the idea is to keep the DOM intact so that the user has a smooth experience. This means that all communication with the server needs to take place outside of the normal HTML form mechanism, since this would obviously replace the current page.Ajax addressed this, with what it calls &#39;asynchronous-JavaScript&#39; -- retrieve only the data you need, and then directly manipulate the DOM to get the effect you want. &#39;Asynchronous-JavaScript&#39; accounts for the first few letters of the name, with the remainder being the obligatory &#39;X&#39; for XML (although XML is not really key to this technology, and many of the applications that are often cited as Ajax-apps don&#39;t use XML as the data medium). BuzzingThe response to Ajax has been pretty positive. In fact the only negatives have been either to suggest a change of name or to moan a little that &quot;I&#39;ve been doing this for years, why hasn&#39;t anyone noticed me?&quot; (I won&#39;t put any links to those sort of articles, since they are a little embarassing -- after all, everyone has been doing this for years!)Anyway, despite a couple of sour-pusses, the software community is almost universally excited, and the blog wires have glowed over the last few months with descriptions of Google Maps, GMail, and so on.Just about everyone who has asked me about Ajax has expected me to be disappointed. Surely, they say, this makes the case for XForms weaker? But my answer is the exact opposite -- XForms and standards-based web applications are in every way superior to the techniques described as Ajax, since the whole raison d&#39;ÃÂªtre of XForms and XHTML 2 is to address the very problems that Ajax-like techniques suffer from.That may come across as a little bold...so perhaps I should explain. From Workaround to FeatureWe&#39;ve all been using HTML mark-up for years now, and the language hasn&#39;t changed much in that time. As a consequence, the increasing demand for more complex web-pages has meant that the balance in our documents has shifted increasingly from vanilla mark-up to &#39;the workaround&#39;. Whether it&#39;s providing tooltips, dynamic/repeating data sections, or small portions of our page that change without having to request a new document, we&#39;ve generally had to dive into script. But the shift from mark-up to script has meant that the mark-up language itself has been relegated to a mere carrier for programs.Unfortunately this means that no-one gains -- it&#39;s annoying for the programmer to have to produce ever more convoluted spaghetti JavaScript to meet the demands of their audience, but it&#39;s also annoying for the non-programmer, who probably only wants a tooltip. And its particularly annoying for those who want to use documents on the web for more ambitious applications to find that most of the important stuff in a document is hidden away in script.All is not lost, however, since this collection of &#39;workarounds&#39; provides a rich source of real-life patterns that appear for authors and programmers, time and again. They may be workarounds, but they are much-needed ones.The aim of the new generation of languages like XForms and XHTML 2 is to take these &#39;common patterns&#39; and turn them into mark-up. Just like the HTML elements &lt;a&gt; and &lt;form&gt; pack an enormous amount of functionality into deceptively simple tags, so too can new declarative mark-up capture patterns that have emerged &#39;in the wild&#39;.(Note that this is the opposite of so-called folksonomies, where popular practice that occurs in the wild is left it the wild, and codification is regarded as a dirty word.) The XML HTTP Request ObjectLet&#39;s take the much talked about XML HTTP Request Object (XMLHttpRequest). If you are not familiar with it, it was originally part of Microsoft&#39;s XML parser, and allows you to send and receive data outside of the normal HTML form processing. Since it&#39;s a handy feature to have in a client, other browsers have followed suit and it&#39;s now becoming the &#39;standard&#39; way to communicate with servers without messing up your page. It&#39;s a corner-stone of Ajax. (A good summary with examples is on Jim Ley&#39;s jibbering.com site [3].)But...we need to be clear that we&#39;re using XMLHttpRequest to get round a weakness in HTML forms. The problem we have is that even if you know that a server is about to give you some data, and the server knows it&#39;s about to give you some data, there&#39;s no way to tell your form that -- instead your page will be wiped out and replaced with whatever the server sends back.Of course, constant round-tripping doesn&#39;t make it completely impossible to produce applications, and a lot of books and airline tickets are bought every day without the facility to get &#39;just the data&#39;. But we all know it would reduce network traffic and create a smoother user experience if we could just send a list of books or seats, rather than a whole new page.Over the years applications such as Microsoft&#39;s Outlook Web Access (OWA), have had to step around the HTML form to get just the data they need. But, whilst OWA considerably predates GMail, until the advent of XMLHttpRequest, the techniques used were quite difficult to manage. (Google Suggest is often cited as a good example of an Ajax-app, but interestingly merges old and new techniques; XMLHttpRequest is used to obtain a piece of JavaScript from a server, and this script contains a call to a client-side function, but using server-provided parameters. It&#39;s one of the techniques you might have used in the past with a hidden frame.)So as many have said on their blogs, XMLHttpRequest is not a newly devised technique, but rather a generally accepted replacement for a very old technique. But ultimately that technique is a workaround since the real problem is that HTML forms will always replace the current page. Beyond HTML FormsWhilst XMLHttpRequest gives us a way to get data to and from the server without losing our document, we&#39;ve unfortunately thrown the baby out with the bath-water; whatever the weaknesses of HTML forms, you have to acknowledge that they are pretty simple to use. Here&#39;s an abbreviated version of Google&#39;s search form (note that the mark-up is HTML, not XML):&lt;form action=/search name=f&gt; &lt;input type=hidden name=hl value=en&gt; &lt;input maxLength=256 size=55 name=q value=&quot;&quot;&gt; &lt;input type=submit value=&quot;Google Search&quot; name=btnG&gt;&lt;/form&gt; As you can see, the simple problem with HTML forms is that we don&#39;t say anything about where the data should go when we&#39;ve received it from the server. The assumption in HTML of old is that we are just doing a kind of &#39;super-navigation&#39;, and no matter what we send to the server, it will only ever give us back a new web-page. (To put it a different way, you could say that &lt;a&gt; and &lt;form&gt; are pretty much the same thing.)To see how this problem is resolved, let&#39;s code the same Google search in XForms:&lt;xf:submission id=&quot;sub-search&quot; action=&quot;http://www.google.com/complete/search?hl=en&quot; method=&quot;get&quot; separator=&quot;&amp;&quot; replace=&quot;all&quot;/&gt; &lt;xf:input ref=&quot;q&quot;&gt; &lt;xf:label&gt;Query:&lt;/xf:label&gt;&lt;/xf:input&gt; &lt;xf:submit submission=&quot;sub-search&quot;&gt; &lt;xf:label&gt;Google Search&lt;/xf:label&gt;&lt;/xf:submit&gt; Although it will do exactly the same -- right down to replacing the current page -- it&#39;s a little different to the HTML mark-up. But the changes in structure have given us some major benefits, from accessible labels on our form controls, to the possibility of many different submissions for the same data.But what it has also given us is the possibility of solving our data update problem. The replace attribute is actually optional in XForms, but I showed it in the previous mark-up so that you can compare it to this:&lt;xf:submission id=&quot;sub-search&quot; action=&quot;http://www.google.com/complete/search?hl=en&quot; method=&quot;get&quot; separator=&quot;&amp;&quot; replace=&quot;instance&quot;/&gt; In this example the data returned from the server will just replace the instance that was sent, and our page will remain completely intact. (The replace attribute can take the values all, instance, or none.)I won&#39;t show the full equivalent using XMLHttpRequest since it&#39;s pretty large, but I&#39;ll give a flavour of it. (Jim Ley&#39;s page -- referenced earlier -- shows how to search Google with XMLHttpRequest.) The Script VersionFirst we need to create an XMLHttpRequest object, but we need to do it in such a way that it will work on both Mozilla and IE:var req; function loadXMLDoc(url) { // native XMLHttpRequest object if (window.XMLHttpRequest) { req = new XMLHttpRequest(); req.onreadystatechange = readyStateChange; req.open(&quot;GET&quot;, url, true); req.send(null); // IE/Windows ActiveX version } else if (window.ActiveXObject) { req = new ActiveXObject(&quot;Microsoft.XMLHTTP&quot;); if (req) { req.onreadystatechange = readyStateChange; req.open(&quot;GET&quot;, url, true); req.send(); } }} When a document is loaded via this function, the readyStateChange() method is invoked:function readyStateChange() { // &#39;4&#39; means document &quot;loaded&quot; if (req.readyState == 4) { // 200 means &quot;OK&quot; if (req.status == 200) { // do something here } else { // error processing here } }} From a programming point of view, I guess you could say that there isn&#39;t a lot wrong with this, but then from a programming point of view there wasn&#39;t a lot wrong with Z80 or 6502 assembly languages -- I just wouldn&#39;t want to go back to them!But the most important issue is that we have lost the very thing that was responsible for HTML&#39;s success -- the use of simple, clear, declarative mark-up, in which we simply state our intent, without having to write a program to do it for us. After all, the web took off because authors only had to master &lt;a&gt; in order to enter the exciting new world of &#39;hypertext&#39; -- but XMLHttpRequest raises the bar again, and takes us right back into the heart of geek-world. Beyond XMLHttpRequestBut in keeping with the principle that I outlined above -- that XForms and XHTML 2 try to provide mark-up for commonly existing design patterns -- let&#39;s see if there are any other patterns that XMLHttpRequest has thrown up.You will have noticed in the earlier script that we had tests for success and failure:if (req.status == 200) { // do something here} else { // error processing here} XForms provides the same functionality through the use of events -- on success do this, on failure do that. This is far more powerful, since it hides the protocol-specific aspects of this code (&quot;200&quot; may be &#39;success&#39; for HTTP, but it isn&#39;t &#39;success&#39; when saving data to the hard-drive or sending an email).XForms uses declarative mark-up to express those events, which again dramatically reduces coding:&lt;xf:action ev:observer=&quot;sub-search&quot; ev:event=&quot;xforms-submit-error&quot;&gt; &lt;xf:message level=&quot;modal&quot;&gt; Submission failed &lt;/xf:message&gt;&lt;/xf:action&gt; But there&#39;s lots, lots more in the submission part of XForms: it can provide full XML Schema validation before submitting the data; there is built in support for numerous types of serialisation, such as multipart/related; abstract methods are used so the code is independent of protocol. For example, since put means the same thing whether the target URL begins http: or file:, a form with relative paths will run unchanged on a local machine or a web server; it&#39;s extensible -- in formsPlayer 2.0 we have used the submission element to read and write from an ADO database, allowing programmers to convert forms from using the web to using a local database by doing nothing more than changing a single target URL. (Try doing that with XMLHttpRequest!)The submission part of XForms is in fact so powerful that it will eventually form a separate specification, for use in other languages. From Patterns to Mark-upAnd there are plenty more patterns out there that were crying out to be turned into mark-up, and which are now incorporated into XForms and XHTML 2. Do you remember the days when if we wanted a tooltip that contained mark-up -- perhaps an image, or bold text -- we had to use a carefully placed &lt;div&gt;, a CSS display: none;, a mouseover event handler and a timer? Nowadays the programmer with better things to do than work with spaghetti-JavaScript just uses the XForms &lt;hint&gt; element, and for free they get platform independence (and therefore accessibility), as well as the ability to insert any mark-up.And what about the days when we had to write code to open up a text-to-speech engine, and then invoke the various methods on the object to get it to speak its mind? Nowadays who wouldn&#39;t just use a CSS property on their XForms&#39; messages? Bad SlacksAnd do you remember...I&#39;m sorry, this one always makes me laugh...do you remember how we used to write lots of JavaScript to recalculate the shopping-cart when a new item was added? I know it&#39;s hard to believe -- it&#39;s like looking at old photos of us all wearing flares. Anyway, thank God for straight trousers and the XForms dependency-engine. But enough of the good old days, the days of assembly language, C and JavaScript...let&#39;s stick with the new. Do Try This at HomeTo round all of this off, we&#39;ll take a look at Google Suggest, and we&#39;ll use XForms to implement it. I&#39;ll walk through the demo in a separate blog [4] so that this one doesn&#39;t get too cluttered -- and hopefully by disecting this simple but useful application, we can show how declarative mark-up scores over scripting.[1] Will AJAX help Google clean up?, c|net, http://news.com.com/Will+AJAX+help+Google+clean+up/2100-1032_3-5621010.html [2] Ajax: A New Approach to Web Applications, Jesse James Garrett, Adaptive Path blog, http://www.adaptivepath.com/publications/essays/archives/000385.php [3] Using the XML HTTP Request object, http://jibbering.com/2002/4/httprequest.html [4] &quot;Google Suggest&quot; Using XForms, http://internet-apps.blogspot.com/2005/04/google-suggest-using-xforms.html Tags: xforms | xbl | webapps | ajax | javascript [via Internet Applications]</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>By <a href="http://internet-apps.blogspot.com/">Mark Bierbeck</a>:</p>
<p><a href="http://internet-apps.blogspot.com/2005/04/ajax-hard-facts-brass-tacks-and-bad.html">Ajax, Hard Facts, Brass Tacks ... and Bad Slacks</a> </p>
<div xmlns="http://www.w3.org/1999/xhtml">A number of people have contacted me recently about Ajax [<a href="about:blank#20050426-1">1</a>] -- a catchy name -- coined to provide an umbrella term for a particular group of technologies used to build web applications. The use of the word comes from Jesse James Garrett in a recent blog [<a href="about:blank#20050426-2">2</a>], and describes a class of internet applications written using JavaScript in a browser. By using JavaScript these apps have full access to the DOM, and as a consequence are able to make all sorts of changes to the page that the user is interacting with, without having to go back to the server.<br><br>When the application <em>does</em> need to go back to the server -- to deliver some data and get a response -- the idea is to keep the DOM intact so that the user has a smooth experience. This means that all communication with the server needs to take place outside of the normal HTML form mechanism, since this would obviously replace the current page.<br><br>Ajax addressed this, with what it calls 'asynchronous-JavaScript' -- retrieve only the data you need, and then directly manipulate the DOM to get the effect you want. 'Asynchronous-JavaScript' accounts for the first few letters of the name, with the remainder being the obligatory 'X' for XML (although XML is not really key to this technology, and many of the applications that are often cited as Ajax-apps don't use XML as the data medium).<br><br>
<h2>Buzzing</h2>The response to Ajax has been pretty positive. In fact the only negatives have been either to suggest a change of name or to moan a little that "I've been doing this for years, why hasn't anyone noticed me?" (I won't put any links to those sort of articles, since they are a little embarassing -- after all, <em>everyone</em> has been doing this for years!)<br><br>Anyway, despite a couple of sour-pusses, the software community is almost universally excited, and the blog wires have glowed over the last few months with descriptions of Google Maps, GMail, and so on.<br><br>Just about everyone who has asked me about Ajax has expected me to be disappointed. Surely, they say, this makes the case for XForms weaker? But my answer is the exact opposite -- XForms and standards-based web applications are in every way superior to the techniques described as Ajax, since the whole <em>raison d'ÃÂªtre</em> of XForms and XHTML 2 is to address the very problems that Ajax-like techniques suffer from.<br><br>That may come across as a little bold...so perhaps I should explain.<br><br>
<h2>From Workaround to Feature</h2>We've all been using HTML mark-up for years now, and the language hasn't changed much in that time. As a consequence, the increasing demand for more complex web-pages has meant that the balance in our documents has shifted increasingly from vanilla mark-up to 'the workaround'. <br><br>Whether it's providing tooltips, dynamic/repeating data sections, or small portions of our page that change without having to request a new document, we've generally had to dive into script. But the shift from mark-up to script has meant that the mark-up language itself has been relegated to a mere carrier for programs.<br><br>Unfortunately this means that no-one gains -- it's annoying for the programmer to have to produce ever more convoluted spaghetti JavaScript to meet the demands of their audience, but it's also annoying for the non-programmer, who probably only wants a tooltip. And its particularly annoying for those who want to use documents on the web for more ambitious applications to find that most of the important stuff in a document is hidden away in script.<br><br>All is not lost, however, since this collection of 'workarounds' provides a rich source of real-life patterns that appear for authors and programmers, time and again. They may be workarounds, but they are much-needed ones.<br><br>The aim of the new generation of languages like XForms and XHTML 2 is to take these 'common patterns' and turn them into mark-up. Just like the HTML elements <code>&lt;a&gt;</code> and <code>&lt;form&gt;</code> pack an enormous amount of functionality into deceptively simple tags, so too can new declarative mark-up capture patterns that have emerged 'in the wild'.<br><br>(Note that this is the opposite of so-called folksonomies, where popular practice that occurs in the wild is left it the wild, and codification is regarded as a dirty word.)<br><br>
<h2>The XML HTTP Request Object</h2>Let's take the much talked about XML HTTP Request Object (XMLHttpRequest). If you are not familiar with it, it was originally part of Microsoft's XML parser, and allows you to send and receive data outside of the normal HTML form processing. Since it's a handy feature to have in a client, other browsers have followed suit and it's now becoming the 'standard' way to communicate with servers without messing up your page. It's a corner-stone of Ajax. (A good summary with examples is on Jim Ley's jibbering.com site [<a href="about:blank#20050426-3">3</a>].)<br><br>But...we need to be clear that we're using XMLHttpRequest to get round a weakness in HTML forms. The problem we have is that even if you know that a server is about to give you some data, and the <em>server</em> knows it's about to give you some data, there's no way to tell your <em>form</em> that -- instead your page will be wiped out and replaced with whatever the server sends back.<br><br>Of course, constant round-tripping doesn't make it completely impossible to produce applications, and a lot of books and airline tickets are bought every day without the facility to get 'just the data'. But we all know it would reduce network traffic and create a smoother user experience if we could just send a list of books or seats, rather than a whole new page.<br><br>Over the years applications such as Microsoft's <em>Outlook Web Access</em> (OWA), have had to step around the HTML form to get just the data they need. But, whilst OWA considerably predates GMail, until the advent of XMLHttpRequest, the techniques used were quite difficult to manage. (Google Suggest is often cited as a good example of an Ajax-app, but interestingly merges old and new techniques; XMLHttpRequest is used to obtain a piece of JavaScript from a server, and this script contains a call to a client-side function, but using server-provided parameters. It's one of the techniques you might have used in the past with a hidden frame.)<br><br>So as many have said on their blogs, XMLHttpRequest is not a newly devised technique, but rather a generally accepted replacement for a very old technique. But ultimately that technique is a workaround since the <em>real</em> problem is that HTML forms will always replace the current page.<br><br><br>
<h2>Beyond HTML Forms</h2>Whilst XMLHttpRequest gives us a way to get data to and from the server without losing our document, we've unfortunately thrown the baby out with the bath-water; whatever the weaknesses of HTML forms, you have to acknowledge that they are pretty simple to use. Here's an abbreviated version of Google's search form (note that the mark-up is HTML, not XML):<br><code><pre><br>&lt;form action=/search name=f&gt;<br>  &lt;input type=hidden name=hl value=en&gt;<br>  &lt;input maxLength=256 size=55 name=q value=""&gt;<br>  &lt;input type=submit value="Google Search" name=btnG&gt;<br>&lt;/form&gt;<br>
</pre></code><br>As you can see, the simple problem with HTML forms is that we don't say anything about where the data should go when we've received it from the server. The assumption in HTML of old is that we are just doing a kind of 'super-navigation', and no matter what we send to the server, it will only ever give us back a new web-page. (To put it a different way, you could say that <code>&lt;a&gt;</code> and <code>&lt;form&gt;</code> are pretty much the same thing.)<br><br>To see how this problem is resolved, let's code the same Google search in XForms:<br><code><pre><br>&lt;xf:submission id="sub-search"<br> action="http://www.google.com/complete/search?hl=en"<br> method="get" separator="&amp;"<br> replace="all"<br>/&gt;<br>
<br>&lt;xf:input ref="q"&gt;<br>  &lt;xf:label&gt;Query:&lt;/xf:label&gt;<br>&lt;/xf:input&gt;<br>
<br>&lt;xf:submit submission="sub-search"&gt;<br>  &lt;xf:label&gt;Google Search&lt;/xf:label&gt;<br>&lt;/xf:submit&gt;<br>
</pre></code><br>Although it will do exactly the same -- right down to replacing the current page -- it's a little different to the HTML mark-up. But the changes in structure have given us some major benefits, from accessible labels on our form controls, to the possibility of many different submissions for the same data.<br><br>But what it has also given us is the possibility of solving our data update problem. The <code>replace</code> attribute is actually optional in XForms, but I showed it in the previous mark-up so that you can compare it to this:<br><code><pre><br>&lt;xf:submission id="sub-search"<br> action="http://www.google.com/complete/search?hl=en"<br> method="get" separator="&amp;"<br> replace="<span style="COLOR: red">instance</span>"<br>/&gt;<br>
</pre></code><br>In this example the data returned from the server will just replace the instance that was sent, and our page will remain completely intact. (The <code>replace</code> attribute can take the values <code>all</code>, <code>instance</code>, or <code>none</code>.)<br><br>I won't show the full equivalent using XMLHttpRequest since it's pretty large, but I'll give a flavour of it. (Jim Ley's page -- referenced earlier -- shows how to search Google with XMLHttpRequest.)<br><br>
<h3>The Script Version</h3>First we need to create an XMLHttpRequest object, but we need to do it in such a way that it will work on both Mozilla and IE:<br><code><pre><br>var req;<br>
<br>function loadXMLDoc(url) {<br>    // native XMLHttpRequest object<br>    if (window.XMLHttpRequest) {<br>        req = new XMLHttpRequest();<br>        req.onreadystatechange = readyStateChange;<br>        req.open("GET", url, true);<br>        req.send(null);<br>    // IE/Windows ActiveX version<br>    } else if (window.ActiveXObject) {<br>        req = new ActiveXObject("Microsoft.XMLHTTP");<br>        if (req) {<br>            req.onreadystatechange = readyStateChange;<br>            req.open("GET", url, true);<br>            req.send();<br>        }<br>    }<br>}<br>
</pre></code><br>When a document is loaded via this function, the <code>readyStateChange()</code> method is invoked:<br><code><pre><br>function readyStateChange() {<br>    // '4' means document "loaded"<br>    if (req.readyState == 4) {<br>        // 200 means "OK"<br>        if (req.status == 200) {<br>            // do something here<br>        } else {<br>            // error processing here<br>        }<br>    }<br>}<br>
</pre></code><br>From a <em>programming</em> point of view, I guess you could say that there isn't a lot wrong with this, but then from a programming point of view there wasn't a lot wrong with Z80 or 6502 assembly languages -- I just wouldn't want to go back to them!<br><br>But the most important issue is that we have lost the very thing that was responsible for HTML's success -- the use of simple, clear, declarative mark-up, in which we simply state our intent, without having to write a program to do it for us. After all, the web took off because authors only had to master <code>&lt;a&gt;</code> in order to enter the exciting new world of 'hypertext' -- but XMLHttpRequest raises the bar again, and takes us right back into the heart of geek-world.<br><br>
<h2>Beyond XMLHttpRequest</h2>But in keeping with the principle that I outlined above -- that XForms and XHTML 2 try to provide mark-up for commonly existing design patterns -- let's see if there are any other patterns that XMLHttpRequest has thrown up.<br><br>You will have noticed in the earlier script that we had tests for success and failure:<br><code><pre><br>if (req.status == 200) {<br>  // do something here<br>} else {<br>  // error processing here<br>}<br>
</pre></code><br>XForms provides the same functionality through the use of events -- on success do this, on failure do that. This is far more powerful, since it hides the protocol-specific aspects of this code ("200" may be 'success' for HTTP, but it isn't 'success' when saving data to the hard-drive or sending an email).<br><br>XForms uses declarative mark-up to express those events, which again dramatically reduces coding:<br><code><pre><br>&lt;xf:action ev:observer="sub-search" ev:event="xforms-submit-error"&gt;<br>  &lt;xf:message level="modal"&gt;<br>    Submission failed<br>  &lt;/xf:message&gt;<br>&lt;/xf:action&gt;<br>
</pre></code><br>But there's lots, lots more in the <code>submission</code> part of XForms:<br>
<ul><br>
<li>it can provide full XML Schema validation before submitting the data;</li><br>
<li>there is built in support for numerous types of serialisation, such as <code>multipart/related</code>;</li><br>
<li>abstract methods are used so the code is independent of protocol. For example, since <code>put</code> means the same thing whether the target URL begins <code>http:</code> or <code>file:</code>, a form with relative paths will run unchanged on a local machine or a web server;</li><br>
<li>it's extensible -- in formsPlayer 2.0 we have used the <code>submission</code> element to read and write from an ADO database, allowing programmers to convert forms from using the web to using a local database by doing nothing more than changing a single target URL. (Try doing that with XMLHttpRequest!)</li><br></ul><br><br>The <code>submission</code> part of XForms is in fact so powerful that it will eventually form a separate specification, for use in other languages.<br><br>
<h2>From Patterns to Mark-up</h2>And there are plenty more patterns out there that were crying out to be turned into mark-up, and which are now incorporated into XForms and XHTML 2. Do you remember the days when if we wanted a tooltip that contained mark-up -- perhaps an image, or bold text -- we had to use a carefully placed <code>&lt;div&gt;</code>, a CSS <code>display: none;</code>, a <code>mouseover</code> event handler and a timer? Nowadays the programmer with better things to do than work with spaghetti-JavaScript just uses the XForms <code>&lt;hint&gt;</code> element, and for free they get platform independence (and therefore accessibility), as well as the ability to insert any mark-up.<br><br>And what about the days when we had to write code to open up a text-to-speech engine, and then invoke the various methods on the object to get it to speak its mind? Nowadays who wouldn't just use a CSS property on their XForms' <code>message</code>s?<br><br>
<h3>Bad Slacks</h3>And do you remember...I'm sorry, this one always makes me laugh...do you remember how we used to write lots of JavaScript to recalculate the shopping-cart when a new item was added? I know it's hard to believe -- it's like looking at old photos of us all wearing flares. Anyway, thank God for straight trousers and the XForms dependency-engine.<br><br><img border="1" src="http://www.npr.org/programs/morning/features/2004/sep/fashion_week/satfever_nano140.jpg"> <br>But enough of the good old days, the days of assembly language, C and JavaScript...let's stick with the new.<br><br>
<h2>Do Try This at Home</h2><br>To round all of this off, we'll take a look at Google Suggest, and we'll use XForms to implement it. I'll walk through the demo in a separate blog [<a href="about:blank#20050426-4">4</a>] so that this one doesn't get too cluttered -- and hopefully by disecting this simple but useful application, we can show how declarative mark-up scores over scripting.<br><br><br><a name="20050426-1">[1] Will AJAX help Google clean up?, c|net, <a href="http://news.com.com/Will+AJAX+help+Google+clean+up/2100-1032_3-5621010.html">http://news.com.com/Will+AJAX+help+Google+clean+up/2100-1032_3-5621010.html</a> <br><br><a name="20050426-2">[2] Ajax: A New Approach to Web Applications, Jesse James Garrett, Adaptive Path blog, <a href="http://www.adaptivepath.com/publications/essays/archives/000385.php">http://www.adaptivepath.com/publications/essays/archives/000385.php</a> <br><br><a name="20050426-3">[3] Using the XML HTTP Request object, <a href="http://jibbering.com/2002/4/httprequest.html">http://jibbering.com/2002/4/httprequest.html</a> <br><br><a name="20050426-4">[4] "Google Suggest" Using XForms, <a href="http://internet-apps.blogspot.com/2005/04/google-suggest-using-xforms.html">http://internet-apps.blogspot.com/2005/04/google-suggest-using-xforms.html</a> <br><br>Tags: <a href="http://technorati.com/tag/xforms" rel="tag">xforms</a> | <a href="http://technorati.com/tag/xbl" rel="tag">xbl</a> | <a href="http://technorati.com/tag/webapps" rel="tag">webapps</a> | <a href="http://technorati.com/tag/ajax" rel="tag">ajax</a> | <a href="http://technorati.com/tag/javascript" rel="tag">javascript</a> </div>
<div align="right">[via <a href="http://internet-apps.blogspot.com/">Internet Applications</a>]</div>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2005-03-17#754">
  <rss:title>OpenSearch &amp; Potential Patent Abuse?</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2005-03-17T22:47:49Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">It finally dawned on me what OpenSearch does. Basically you tell it about different search engines by showing it how to query something in each, and get back an RSS return. Then when you search for some term, say foo+bar, it performs the search in all the engines you have configured it for. So it&#39;s a way to group a bunch of search engines together and command them all to look for the same thing. It is clever. It is something that hasn&#39;t been done before, to my knowledge. That&#39;s the good news. The bad news is that Amazon is a leading patent abuser. So as good as this idea is, it&#39;s bad for all the rest of us, unless they tell us that they&#39;re granting us some kind of license to use the idea. [via Scripting News]   I am no fan of Amazon&#39;s moves in the patent arena. At the same time I am very confident that OpenSearch isn&#39;t headed down this part. Virtualization isn&#39;t new or unique (irrespective of context), and the prior art defense should be pretty trivial.   For now, I like what OpenSearch offers, and would continue do so as long as there is no patent abuse associated with this (I certainly understand Dave Winer&#39;s concern; their track record isn&#39;t great re. this matter).   I should have an OpenSearch variant of this dynamic collection of Amazon and patents related blog posts in the coming days (you will see a new OpenSearch gem alongside RSS/Atom/RDF).   BTW - Here is the dynamic collection of all my Amazon.com posts to date.    </dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<blockquote dir="ltr" style="MARGIN-RIGHT: 0px">
<p>It finally dawned on me what <a href="http://www.reallysimplesyndication.com/2005/03/15#a379">OpenSearch</a> does. Basically you tell it about different search engines by showing it how to query something in each, and get back an RSS return. Then when you search for some term, say foo+bar, it performs the search in all the engines you have configured it for. So it's a way to group a bunch of search engines together and command them all to look for the same thing. It is clever. It is something that hasn't been done before, to my knowledge. That's the good news. The bad news is that Amazon is a leading patent abuser. So as good as this idea is, it's bad for all the rest of us, unless they tell us that they're granting us some kind of license to use the idea. [via <a href="http://www.scripting.com/">Scripting News</a>]</p></blockquote>
<div align="right">&nbsp;</div>
<div align="left">I am no fan of Amazon's moves in the patent arena. At the same time I am very confident that OpenSearch isn't headed down this part. Virtualization isn't new or unique (irrespective of context), and the prior art defense should be pretty trivial. </div>
<div align="left">&nbsp;</div>
<div align="left">For now, I like what OpenSearch offers, and would continue do so as long as there is no&nbsp;patent abuse associated with this (I certainly understand Dave Winer's concern; their track record isn't great re. this matter). </div>
<div align="left">&nbsp;</div>
<div align="left">I should have an OpenSearch variant of this dynamic <a href="http://www.openlinksw.com/blog/search.vspx?blogid=127&q=amazon+patent%0D%0A&type=text&output=html">collection</a> of Amazon and patents&nbsp;related blog posts in the coming days (you will see a new OpenSearch gem alongside RSS/Atom/RDF).</div>
<div align="left">&nbsp;</div>
<div align="left">BTW - <a href="http://www.openlinksw.com/blog/search.vspx?blogid=127&q=amazon%0D%0A&type=text&output=html">Here</a> is the dynamic collection of all my Amazon.com posts to date.</div>
<div align="left">&nbsp;</div>
<div align="left">&nbsp;</div>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2005-03-08#746">
  <rss:title>An Interesting Marketing &amp; PR Inflection In Progress</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2005-03-08T19:50:00Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Wikis, Blogs, and Search Engines are collectively fuelling a huge inflection across the interrelated realms of Technology Marketing and PR. When putting together a post yesterday about &quot;Virtualization&quot;, I instinctively looked to Gurunet&#39;s &quot;answers.com&quot; service for information on the subject: Enterprise Information Integration (EII). Woe and behold! Here is what I found at the tail end of the answers.com article on this subject: This article needs cleanup.This article needs to be edited to conform to a higher standard of article quality. After the article has been cleaned up, you may remove this message. For help, see How to Edit a Page and the style and How-to Directory . Now, I knew this was Wikipedia content repurposed by &quot;answers.com&quot;, and I proceeded to clean up the article. The wikified article took a while to complete, because true to the &quot;Wikipedia&quot; ethos, I had to contribute knowledge as opposed to the original weenie marketing gunk. Its naturally easier to cut and paste marketing fluff for a misguided quick win attempt than it is to embed links, add knowledge, and discern Wiki Markup (but &quot;Wiki&quot; don&#39;t play that!). This little exercise has broader implications for marketing as a whole, especially for the IT sector. The end of days for  &quot;Misinformation based Marketing&quot; are nigh! Wikis, Blogs, Search Engines, Web Services, and Social Networking are rapidly destroying the historically prohibitive costs associated with customer pursuit of facts. I am very confident that product quality will soon overshadow market share as the key determinant for both product selection on the part of customers (this is no longer a pipe dream!). I also have increased hope that IT product development and associated product marketing by technology vendors will veer in the same direction.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Wikis, Blogs, and Search Engines are collectively fuelling a huge inflection across the&nbsp;interrelated realms of Technology Marketing and PR.</p>
<p>When putting together a <a href="http://www.openlinksw.com/blog/~kidehen/index.vspx?id=736">post yesterday about "Virtualization"</a>, I instinctively looked to <a href="http://www.gurunet.com/">Gurunet</a>'s "<a href="http://answers.com/">answers.com</a>" service&nbsp;for&nbsp;information on the subject: Enterprise Information Integration (EII). Woe and behold! Here is what I found at the tail end of the answers.com <a href="http://www.answers.com/main/ntquery?s=eii&method=2&gwp=13">article</a> on this subject: </p>
<blockquote dir="ltr" style="MARGIN-RIGHT: 0px">
<div class="boilerplate metadata" id="cleanup" style="BORDER-RIGHT: rgb(119,153,187) 1px solid; PADDING-RIGHT: 1em; BORDER-TOP: rgb(119,153,187) 1px solid; PADDING-LEFT: 1em; BACKGROUND: rgb(247,251,255) 0% 50%; PADDING-BOTTOM: 0pt; MARGIN: 0.5em 2.5%; BORDER-LEFT: rgb(119,153,187) 1px solid; PADDING-TOP: 0pt; BORDER-BOTTOM: rgb(119,153,187) 1px solid; TEXT-ALIGN: justify; moz-background-clip: initial; moz-background-origin: initial; moz-background-inline-policy: initial">
<p><b>This article needs <a class="extiw" href="http://en.wikipedia.org/wiki/Wikipedia:Cleanup" target="wpext" title="Wikipedia:Cleanup">cleanup</a></b>.<br>This article needs to be edited to conform to a <a class="extiw" href="http://en.wikipedia.org/wiki/Wikipedia:Style_and_How-to_Directory" target="wpext" title="Wikipedia:Style and How-to Directory">higher standard</a> of article quality. After the article has been cleaned up, you may remove this message. For help, see <a class="extiw" href="http://en.wikipedia.org/wiki/Wikipedia:How_to_edit_a_page" target="wpext" title="Wikipedia:How to edit a page">How to Edit a Page</a> and the <a class="extiw" href="http://en.wikipedia.org/wiki/Wikipedia:Style_and_How-to_Directory" target="wpext" title="Wikipedia:Style and How-to Directory">style and How-to Directory</a> <span class="nslink">.</span></p></div></blockquote>
<p>Now, I knew this was <a href="http://en.wikipedia.org/">Wikipedia</a> content repurposed by "answers.com",&nbsp;and I proceeded to clean up the article. The <a href="http://en.wikipedia.org/wiki/EII">wikified article</a> took a while to complete, because true to the "Wikipedia" ethos, I had to contribute knowledge as opposed to the original&nbsp;weenie marketing gunk. Its naturally easier to cut and paste marketing fluff for a misguided quick win attempt than it is to embed links, add&nbsp;knowledge,&nbsp;and&nbsp;discern Wiki Markup (but "Wiki" <a href="http://www.tvtome.com/tvtome/servlet/ShowMainServlet/showid-893/In_Living_Color/">don't&nbsp;play that</a>!).</p>
<p>This little exercise has broader implications for marketing as a whole, especially for the IT sector. The end of days&nbsp;for &nbsp;"Misinformation based Marketing" are nigh! Wikis, Blogs, Search Engines, Web Services, and&nbsp;Social Networking are rapidly destroying the historically prohibitive costs associated with customer pursuit&nbsp;of&nbsp;facts.</p>
<p>I am very confident that product quality will soon overshadow market share as the key determinant for both product selection on the part of customers (this is no longer a pipe dream!). I also have increased hope that IT&nbsp;product development and associated product marketing by technology vendors will veer in the same direction. </p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2005-02-14#687">
  <rss:title>TECH TALK: Multi-Model Minds: Correcting Education</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2005-02-14T14:56:27Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">TECH TALK: Multi-Model Minds: Correcting Education While each of us can alter and build our own multiple models, it is an uphill struggle once we are past the initial years in educational institutions. We stand at the crossroads in India. We have the advantage of demographics on our side. We need to address the twin challenges of educating India&#39;s youth and doing it right. Education done right can be IndiaÃÂs biggest change agent. Conversely, putting people with limited and incomplete mental models in decision-making positions can worsen the situation dramatically. So, what does it take for us to fix the problem at the source? Atanu Dey wrote about how to re-invent the education system recently on his blog: I think that at a minimum, an educational system must teach people how to think. How to fast and how to wait would be good but perhaps it is too much to ask for right now. Does such a system exist anywhere in the world? I don&#39;t know for sure but I doubt it very sincerely. I realize of course that there are people who have gone through the current educational systems and they are also able to think. But I would be wary of ascribing that result to the present setup. It is more likely that despite the present system, those people have learnt how to think. I believe that learning how to think may be something alike to learning a language. It appears that we have a language learning sub-system in our brains which shuts down sometime around age 12 or so. Before reaching that age, you can very easily learn languages; after that, learning languages is extremely hard. So also, I believe that if you catch a kid early enough, you can teach him or her to think. It is as if the brain circuits are just a lot of firmware in early childhood and then as one grows up, the firmware hardens and become hardware that cannot be re-programmed. Here is my prescription for a good education. Focus primarily on teaching how to think and on teaching people how to learn. Teaching how to think is like giving kids a very high powered CPU. Teaching them how to learn gives them control of a very broadband channel through which they can have access to content that the CPU can process. Alternative analogy: good thinking skills is like have a good operating system. And good learning skills is like having a great set of applications. A multi-model mind can be our greatest asset as we seek to build both our careers and the new India around us. But for that to happen, we will have to shed some of the baggage from the past  and that is not going to be easy. We need to make a start with the world inside us, and then the outside. We have the benefit of technological revolutions that are happening around us giving us the ability to compress time -- we don&#39;t have a generation to effect this change. Recommended Reading: My earlier Tech Talk series: My Mental Model Atanu Dey&#39;s Blog Robert Hagstrom, Investing: The Last Liberal Art [via E M E R G I C . o r g]</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<a href="http://www.emergic.org/archives/2005/02/11/index.html#tech_talk_multimodel_minds_correcting_education">TECH TALK: Multi-Model Minds: Correcting Education</a> 
<p>While each of us can alter and build our own multiple models, it is an uphill struggle once we are past the initial years in educational institutions. We stand at the crossroads in India. We have the advantage of demographics on our side. We need to address the twin challenges of educating India's youth and doing it right. Education done right can be IndiaÃÂs biggest change agent. Conversely, putting people with limited and incomplete mental models in decision-making positions can worsen the situation dramatically. </p>
<p>So, what does it take for us to fix the problem at the source? <a href="http://www.deeshaa.org/archives/2005/01/10/index.html#006395">Atanu Dey</a> wrote about how to re-invent the education system recently on his blog: <br>
<blockquote><br>I think that at a minimum, an educational system must teach people how to think. How to fast and how to wait would be good but perhaps it is too much to ask for right now. Does such a system exist anywhere in the world? I don't know for sure but I doubt it very sincerely. I realize of course that there are people who have gone through the current educational systems and they are also able to think. But I would be wary of ascribing that result to the present setup. It is more likely that despite the present system, those people have learnt how to think. 
<p></p>
<p>I believe that learning how to think may be something alike to learning a language. It appears that we have a language learning sub-system in our brains which shuts down sometime around age 12 or so. Before reaching that age, you can very easily learn languages; after that, learning languages is extremely hard. So also, I believe that if you catch a kid early enough, you can teach him or her to think. It is as if the brain circuits are just a lot of firmware in early childhood and then as one grows up, the firmware hardens and become hardware that cannot be re-programmed. </p>
<p>Here is my prescription for a good education. Focus primarily on teaching how to think and on teaching people how to learn. Teaching how to think is like giving kids a very high powered CPU. Teaching them how to learn gives them control of a very broadband channel through which they can have access to content that the CPU can process. Alternative analogy: good thinking skills is like have a good operating system. And good learning skills is like having a great set of applications. <br></p></blockquote><br>A multi-model mind can be our greatest asset as we seek to build both our careers and the new India around us. But for that to happen, we will have to shed some of the baggage from the past&nbsp; and that is not going to be easy. We need to make a start with the world inside us, and then the outside. We have the benefit of technological revolutions that are happening around us giving us the ability to compress time -- we don't have a generation to effect this change. 
<p></p>
<p><b>Recommended Reading:</b></p>
<p>
<li>My earlier Tech Talk series: <a href="http://www.emergic.org/collections/tech_talk_my_mental_model.html">My Mental Model</a><br>
<li><a href="http://www.deeshaa.org/">Atanu Dey's Blog</a><br>
<li>Robert Hagstrom, <a href="http://www.amazon.com/exec/obidos/ASIN/1587991381/emergicorg-20">Investing: The Last Liberal Art</a>
<p></p>
<div align="right">[via <a href="http://www.emergic.org/">E M E R G I C . o r g</a>]</div></li>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2005-02-12#685">
  <rss:title>Exploring Network Economics</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2005-02-12T22:00:57Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Exploring Network Economics [via Abhay Bhagat] Michael Mauboussin writes: Economists have successfully described the economics of both information and networks. These economic principles appear durable. It is the combination of information and network properties that creates opportunities for businesses and investors. Most investors have not internalized these ideas. We believe the importance of information-based networks is increasing in todayÂs global economy for four reasons: 1. Physical capital needs are lower than they were in the past. Information-based networks require less capital as they grow than physical networks do. 2. Networks demonstrate increasing returns. Most industries benefit from supply-side increasing returns to scale: higher volume leads to lower unit costs, up to a point. In contrast, successful networks generate increasing returns from the demand-side as users beget users. 3. Networks can form faster and more frequently than in the past. Because of plummeting communication and computing costs, the barriers to creating a network are declining. But even though the barriers to entry are low, the barriers to success remain high. 4. Networks can spread globally. Because many networks have high upfront costs and low incremental costs, they can expand rapidly within countries and across borders. This report focuses on how to categorize networks, how they affect economic value, and how they form. [via E M E R G I C . o r g]</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<a href="http://www.emergic.org/archives/2004/11/29/index.html#exploring_network_economics">Exploring Network Economics</a> 
<p>[via Abhay Bhagat] <a href="http://www.leggmason.com/funds/ourfunds/whats_new/MaubExplNtwkEcon_1104_Final.pdf">Michael Mauboussin</a> writes:<br>
<blockquote><br>Economists have successfully described the economics of both information and networks. These economic principles appear durable. It is the combination of information and network properties that creates opportunities for businesses and investors. Most investors have not internalized these ideas.
<p></p>
<p>We believe the importance of information-based networks is increasing in todayÂs global economy for four reasons:</p>
<p>1. Physical capital needs are lower than they were in the past. Information-based networks require less capital as they grow than physical networks do.</p>
<p>2. Networks demonstrate increasing returns. Most industries benefit from supply-side increasing returns to scale: higher volume leads to lower unit costs, up to a point. In contrast, successful networks generate increasing returns from the demand-side as users beget users.</p>
<p>3. Networks can form faster and more frequently than in the past. Because of plummeting communication and computing costs, the barriers to creating a network are declining. But even though the barriers to entry are low, the barriers to success remain high.</p>
<p>4. Networks can spread globally. Because many networks have high upfront costs and low incremental costs, they can expand rapidly within countries and across borders.</p>
<p>This report focuses on how to categorize networks, how they affect economic value, and how they form.<br></p></blockquote>
<p></p>
<div align="right">[via <a href="http://www.emergic.org/">E M E R G I C . o r g</a>]</div>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2005-02-11#684">
  <rss:title>Avoid Reinventing Wheels: Look Up for XML Schemata and Web Services</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2005-02-11T22:00:04Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">By Uche Ogbuji, IBM developerWorks The world of XML and Web services is huge, and growing. developerWorks does much to map it out for you, but when you&#39;re looking for a schema or a public Web service to meet some pressing need, it&#39;s useful to have handy several key resources. This tip shows you how to comb through the enormous variety of Internet resources to find schemata and Web services using common search criteria. The best known source for finding public SOAP Web services is XMethods. It has a comprehensive list of SOAP services that you can sort by several criteria. It also provides a demo client so you can try out the services right from the index site. You can also keep track of the listings on XMethods programmatically using UDDI, RSS, and other means.sites that provide directories of Web services include RemoteMethods.com and Web Service List. A chronicle of interesting Web services is Web service of the Day. One resource that straddles the Web services/Semantic Web is WSindex.org, a directory of Web services, XML, SOAP, UDDI, WSDL, and Semantic Web resources. This site is a hierarchical and searchable directory. http://www-106.ibm.com/developerworks/xml/library/x-tiplkws.html</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<font size="2">
<p>By Uche Ogbuji, IBM developerWorks</p>
<p>The world of XML and Web services is huge, and growing. developerWorks does much to map it out for you, but when you're looking for a schema or a public Web service to meet some pressing need, it's useful to have handy several key resources. This tip shows you how to comb through the enormous variety of Internet resources to find schemata and Web services using common search criteria. The best known source for finding public SOAP Web services is XMethods. It has a comprehensive list of SOAP services that you can sort by several criteria. It also provides a demo client so you can try out the services right from the index site. You can also keep track of the listings on XMethods programmatically using UDDI, RSS, and other means.sites that provide directories of Web services include RemoteMethods.com and Web Service List. A chronicle of interesting Web services is Web service of the Day.</p>
<p>One resource that straddles the Web services/Semantic Web is WSindex.org, a directory of Web services, XML, SOAP, UDDI, WSDL, and Semantic Web resources. This site is a hierarchical and searchable directory. </p>
<p></font><a href="http://www-106.ibm.com/developerworks/xml/library/x-tiplkws.html"><u><font color="#0000ff" size="2">http://www-106.ibm.com/developerworks/xml/library/x-tiplkws.html</u></font></a></p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2005-01-27#668">
  <rss:title>Hacking Open Office</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2005-01-27T14:46:02Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">By Peter Sefton, XML.org The author explores some of the ways that OpenOffice.org&#39;s Writer application is open to customization and configuration. He coveres a few techniques that will be of interest to template maintainers working with OpenOffice.org writer: how to crack open the file format, how to maintain large sets of styles, and how to customize menus and macros, all without using anything except standard tools, zip, an XSLT processor, and a text editor. All this can, of course, be further automated with a programming language of some kind, even a batch file. There are some changes coming in version 2 of OpenOffice.org, but all these techniques will be forwards compatible, although some things like the location and name of the menu-bar files look like they will change. If you are also trying to store and manipulate content in XML but want to use a word processing environment for authoring, then well-crafted templates are even more important. http://www.xml.com/pub/a/2005/01/26/hacking-ooo.html See also the OpenDocument 1.0 CD: http://xml.coverpages.org/ni2005-01-04-a.html</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<font size="2">
<p>By Peter Sefton, XML.org</p>
<p>The author explores some of the ways that OpenOffice.org's Writer application is open to customization and configuration. He coveres a few techniques that will be of interest to template maintainers working with OpenOffice.org writer: how to crack open the file format, how to maintain large sets of styles, and how to customize menus and macros, all without using anything except standard tools, zip, an XSLT processor, and a text editor. All this can, of course, be further automated with a programming language of some kind, even a batch file.</p>
<p>There are some changes coming in version 2 of OpenOffice.org, but all these techniques will be forwards compatible, although some things like the location and name of the menu-bar files look like they will change. If you are also trying to store and manipulate content in XML but want to use a word processing environment for authoring, then well-crafted templates are even more important.</p>
<p></font><a href="http://www.xml.com/pub/a/2005/01/26/hacking-ooo.html"><u><font color="#0000ff" size="2">http://www.xml.com/pub/a/2005/01/26/hacking-ooo.html</u></font></a></p><font size="2">
<p>See also the OpenDocument 1.0 CD: </font><a href="http://xml.coverpages.org/ni2005-01-04-a.html"><u><font color="#0000ff" size="2">http://xml.coverpages.org/ni2005-01-04-a.html</u></font></a></p><font size="2"></font>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2004-12-20#651">
  <rss:title>How Blogs Can Supercharge Your Business</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2004-12-20T21:13:11Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">How Blogs Can Supercharge Your Business How Blogs Can Supercharge Your Business Blogs in business is a new idea and a strategy that is evolving very quickly. There many new products and services emerging to service everyone interesting in blogs and blogging. While there are new opportunities being created because of blogs, small and large businesses as well as solo entrepreneurs are deeply interested how blogs can help their bottom line as well. Blogs were once thought to be the domain of computer geeks and teens interested in publishing their personal ideas, poetry, commentary and the like. Blogs are no longer &quot;personal&quot; publishing tools anymore. Blogs are now business publishing tools and powerful ones at that. You learned in Lesson One that blogs allow you publish quickly, easily, efficiently and instantly. This has transformed business communications online. Now all companies large and small can create a very low cost content communication and publishing strategy online. You see, blogs don&#39;t have the same overhead as a large scale information site or e-commerce site. Blogs are not complex content management systems that require a paid staff of webmasters and programmers to support. Blogs are low as $14.95 a month to use, require only one person the manage easily and yet the content of blog will outperform the content of a more established site both in speed of distribution and in search engine rankings. Businesses and entrepreneurs benefit in 3 BIG ways from a blog. Information. Blogs allow you to distributed information instantly and frequently. Speed of communications is critical in this day and age. You need to communicate information about your products and services quickly. Blogs allow you to educate your markets and engage in real-time two-way conversations with customers and prospects around topics that relate to you and them. As an information tool for businesses and entrepreneurs, blogs allow you to build recurring relationships with prospects and customers by establishing rapport. As you publish content that your customers and prospect come to know and trust the will return to you as their expert and vote with wallets. Reputation. Blogs build reputation. Blogs are considered honest communication tools. Blogs are two-way communications tools and people have come to expect blogs to provide high value, useful content and honest. transparent communications. Your level of integrity will weigh heavily on how you build authority and credibility regarding your blogs subject matter. What you publish will serve as the basis for others to formulate an opinion about your expertise, knowledge and character in general. Communication. Blogs allow you to communicate at the speed of business. You can literally communicate relevant information to partners, personnel and prospects in real-time and as events occur in your business. Literally. How powerful is that?! What greater way to maintain and competitive advantage than to alert those your customers to new and useful information about your business. This keeps customers engaged in conversations with you and less susceptible to the influences of competitor marketing messages. Information, reputation and communication are the three things every business must build in order to success online. A business blog allows you build a bigger information depot that can draw a continuous flow of visitors looking for your content. A business blog allows you build reputation by publishing high value information about and around your industry, your products and services. As you frequently demonstrate your expertise, experience and know-how you can gain recognition in your market niches. If you aren&#39;t using a blog in your business and marketing endeavors you are surely being left behind. I can almost guarantee you that one or more of your competitors is using a blog to communicate and distribution information on products and services that compete with yours. If you aren&#39;t using a blog but you are using a regular web site, you competition is going to whoop you in the search engine rankings game. For any online business, getting your content indexed and found by your target is a bottom line activity. Blogs have changed the game of search engine optimization in an upcoming article I will talk about the SEO benefits of blogs To recap, a blog is great business accelerator. A acceleration in communications can translation into an acceleration in business and entrepreneurial processes that impact your bottom line. Blogs allow businesses and entrepreneurs to share information instantly and frequently. Blogs allows business to build reputation and by demonstrating subject matter expertise and finally, blogs allow businesses to communication in real-time as their business happens. &quot;If You Want To Learn How To Blog For Fun and Profits... Then CLICK HERE NOW and I&#39;ll Show You Step-By-Step!&quot;!-----------------------------------------------</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<a href="http://blogforfunandprofit.blogware.com/blog/_archives/2004/12/11/202992.html">How Blogs Can Supercharge Your Business</a> How Blogs Can Supercharge Your Business <br /><br />Blogs in business is a new idea and a strategy that is evolving very quickly. There many new products and services emerging to service everyone interesting in blogs and blogging. While there are new opportunities being created because of blogs, small and large businesses as well as solo entrepreneurs are deeply interested how blogs can help their bottom line as well. Blogs were once thought to be the domain of computer geeks and teens interested in publishing their personal ideas, poetry, commentary and the like. Blogs are no longer &quot;personal&quot; publishing tools anymore. <br /><br />Blogs are now business publishing tools and powerful ones at that. You learned in Lesson One that blogs allow you publish quickly, easily, efficiently and instantly. This has transformed business communications online. Now all companies large and small can create a very low cost content communication and publishing strategy online. <br /><br />You see, blogs don&#39;t have the same overhead as a large scale information site or e-commerce site. Blogs are not complex content management systems that require a paid staff of webmasters and programmers to support. <br /><br />Blogs are low as $14.95 a month to use, require only one person the manage easily and yet the content of blog will outperform the content of a more established site both in speed of distribution and in search engine rankings. <br /><br />Businesses and entrepreneurs benefit in 3 BIG ways from a blog. <br /><br />Information. <br /><br />Blogs allow you to distributed information instantly and frequently. Speed of communications is critical in this day and age. You need to communicate information about your products and services quickly. Blogs allow you to educate your markets and engage in real-time two-way conversations with customers and prospects around topics that relate to you and them. <br /><br />As an information tool for businesses and entrepreneurs, blogs allow you to build recurring relationships with prospects and customers by establishing rapport. As you publish content that your customers and prospect come to know and trust the will return to you as their expert and vote with wallets. <br /><br />Reputation. <br /><br />Blogs build reputation. Blogs are considered honest communication tools. Blogs are two-way communications tools and people have come to expect blogs to provide high value, useful content and honest. transparent communications. <br /><br />Your level of integrity will weigh heavily on how you build authority and credibility regarding your blogs subject matter. What you publish will serve as the basis for others to formulate an opinion about your expertise, knowledge and character in general. <br /><br />Communication. <br /><br />Blogs allow you to communicate at the speed of business. You can literally communicate relevant information to partners, personnel and prospects in real-time and as events occur in your business. Literally. How powerful is that?! <br /><br />What greater way to maintain and competitive advantage than to alert those your customers to new and useful information about your business. This keeps customers engaged in conversations with you and less susceptible to the influences of competitor marketing messages. <br /><br /><br />Information, reputation and communication are the three things every business must build in order to success online. <br /><br />A business blog allows you build a bigger information depot that can draw a continuous flow of visitors looking for your content. <br /><br />A business blog allows you build reputation by publishing high value information about and around your industry, your products and services. As you frequently demonstrate your expertise, experience and know-how you can gain recognition in your market niches. <br /><br />If you aren&#39;t using a blog in your business and marketing endeavors you are surely being left behind. I can almost guarantee you that one or more of your competitors is using a blog to communicate and distribution information on products and services that compete with yours. <br /><br />If you aren&#39;t using a blog but you are using a regular web site, you competition is going to whoop you in the search engine rankings game. For any online business, getting your content indexed and found by your target is a bottom line activity. Blogs have changed the game of search engine optimization in an upcoming article I will talk about the SEO benefits of blogs <br /><br />To recap, a blog is great business accelerator. A acceleration in communications can translation into an acceleration in business and entrepreneurial processes that impact your bottom line. Blogs allow businesses and entrepreneurs to share information instantly and frequently. Blogs allows business to build reputation and by demonstrating subject matter expertise and finally, blogs allow businesses to communication in real-time as their business happens. <br /><br /><br />
<center><font face="Arial" size="2"><b>&quot;If You Want To Learn How To Blog For Fun and Profits...</b></font> <br /><font face="Arial" size="2">Then <span style="BACKGROUND-COLOR: #ffff00"><a href="http://www.how2blog.com/at/go.php?c=h2b&s=BFFPBG" target="_blank">CLICK HERE NOW</a></span> and I&#39;ll <u>Show You Step-By-Step</u>!&quot;!</font></center><br /><br />----------------------------------------------- <br />]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2004-06-09#559">
  <rss:title>Questions about Longhorn, part 3: Avalon&#39;s enterprise mission</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2004-06-09T21:48:25Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">A Blog post for the ages, from Jon Udell. I expect to refer back to this post a number of times in the future, as I have the same concerns across related realms; for instance data access API usage and evolution. Enjoy! Questions about Longhorn, part 3: Avalon&#39;s enterprise mission The slide shown at the right comes from a presentation entitled Windows client roadmap, given last month to the International .NET Association (INETA). When I see slides like this, I always want to change the word &quot;How&quot; to &quot;Why&quot; -- so, in this case, the question would become &quot;Why do I have to pick between Windows Forms and Avalon?&quot; Similarly, MSDN&#39;s Channel 9 ran a video clip of Joe Beda, from the Avalon team, entitled How should developers prepare for Longhorn/Avalon? that, at least for me, begs the question &quot;Why should developers prepare for Longhorn/Avalon?&quot; I&#39;ve been looking at decision trees like the one shown in this slide for more than a decade. It&#39;s always the same yellow-on-blue PowerPoint template, and always the same message: here&#39;s how to manage your investment in current Windows technologies while preparing to assimilate the new stuff. For platform junkies, the internal logic can be compelling. The INETA presentation shows, for example, how it&#39;ll be possible to use XAML to write WinForms apps that host combinations of WinForms and Avalon components, or to write Avalon apps that host either or both style of component. Cool! But...huh? Listen to how Joe Beda frames the &quot;rich vs. reach&quot; debate: Avalon will be supplanting WinForms, but WinForms is more reach than it is rich. It&#39;s the reach versus rich thing, and in some ways there&#39;s a spectrum. If you write an ASP.NET thing and deploy via the browser, that&#39;s really reach. If you write a WinForms app, you can go down to Win98, I believe. Avalon&#39;s going to be Longhorn only. So developers are invited to classify degrees of reach -- not only with respect to the Web, but even within Windows -- and to code accordingly. What&#39;s more, they&#39;re invited to consider WinForms, the post-MFC (Microsoft Foundation Classes) GUI framework in the .NET Framework, as &quot;reachier&quot; than Avalon. That&#39;s true by definition since Avalon&#39;s not here yet, but bizarre given that mainstream Windows developers can&#39;t yet regard .NET as a ubiquitous foundation, even though many would like to. Beda recommends that developers isolate business logic and data-intensive stuff from the visual stuff -- which is always smart, of course -- and goes on to sketch an incremental plan for retrofitting Avalon goodness into existing apps. He concludes: Avalon, and Longhorn in general, is Microsoft&#39;s stake in the ground, saying that we believe power on your desktop, locally sitting there doing cool stuff, is here to stay. We&#39;re investing on the desktop, we think it&#39;s a good place to be, and we hope we&#39;re going to start a wave of excitement leveraging all these new technologies that we&#39;re building. It&#39;s not every decade that the Windows presentation subsystem gets a complete overhaul. As a matter of fact, it&#39;s never happened before. Avalon will retire the hodge-podge of DLLs that began with 16-bit Windows, and were carried forward (with accretion) to XP and Server 2003. It will replace this whole edifice with a new one that aims to unify three formerly distinct modes: the document, the user interface, and audio-visual media. This is a great idea, and it&#39;s a big deal. If you&#39;re a developer writing a Windows application that needs to deliver maximum consumer appeal three or four years from now, this is a wave you won&#39;t want to miss. But if you&#39;re an enterprise that will have to buy or build such applications, deploy them, and manage them, you&#39;ll want to know things like: How much fragmentation can my developers and users tolerate within the Windows platform, never mind across platforms? Will I be able to remote the Avalon GUI using Terminal Services and Citrix? Is there any way to invest in Avalon without stealing resources from the Web and mobile stuff that I still have to support? Then again, why even bother to ask these questions? It&#39;s not enough to believe that the return of rich-client technology will deliver compelling business benefits. (Which, by the way, I think it will.) You&#39;d also have to be shown that Microsoft&#39;s brand of rich-client technology will trump all the platform-neutral variations. Perhaps such a case can be made, but the concept demos shown so far don&#39;t do so convincingly. The Amazon demo at the Longhorn PDC (Professional Developers Conference) was indeed cool, but you can see similar stuff happening in Laszlo, Flex, and other RIA (rich Internet application) environments today. Not, admittedly, with the same 3D effects. But if enterprises are going to head down a path that entails more Windows lock-in, Microsoft will have to combat the perception that the 3D stuff is gratuitous eye candy, and show order-of-magnitude improvements in users&#39; ability to absorb and interact with information-rich services. [via Jon&#39;s Radio]</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>A Blog post for the ages, from <a href="http://weblog.infoworld.com/udell">Jon Udell</a>. I expect to refer back to this post a number of times in the future, as I have the same concerns across related realms; for instance data access API usage and evolution.</p>
<p>Enjoy!</p>
<blockquote dir="ltr" style="MARGIN-RIGHT: 0px">
<p><a href="http://weblog.infoworld.com/udell/2004/06/09.html#a1019">Questions about Longhorn, part 3: Avalon&#39;s enterprise mission</a> </p>
<p><a href="http://weblog.infoworld.com/udell/gems/WinformsVsAvalon.jpg"><img align="right" hspace="6" src="http://weblog.infoworld.com/udell/gems/WinformsVsAvalon_s.jpg" vspace="6" /></a> The slide shown at the right comes from a presentation entitled <a href="http://www.ineta.org/DesktopDefault.aspx?tabindex=2&tabid=41&FileID=125">Windows client roadmap</a>, given last month to the International .NET Association (<a href="http://www.ineta.org/DesktopDefault.aspx">INETA</a>). When I see slides like this, I always want to change the word &quot;How&quot; to &quot;Why&quot; -- so, in this case, the question would become &quot;Why do I have to pick between Windows Forms and Avalon?&quot; Similarly, MSDN&#39;s Channel 9 ran a video clip of Joe Beda, from the Avalon team, entitled <a href="http://www.microsoft.com/winme/0404/22606/Joe_Beda_prepare_300k.asx">How should developers prepare for Longhorn/Avalon?</a> that, at least for me, begs the question &quot;Why should developers prepare for Longhorn/Avalon?&quot; </p>
<p>I&#39;ve been looking at decision trees like the one shown in this slide for more than a decade. It&#39;s always the same yellow-on-blue PowerPoint template, and always the same message: here&#39;s how to manage your investment in current Windows technologies while preparing to assimilate the new stuff. For platform junkies, the internal logic can be compelling. The INETA presentation shows, for example, how it&#39;ll be possible to use XAML to write WinForms apps that host combinations of WinForms and Avalon components, or to write Avalon apps that host either or both style of component. Cool! But...huh? Listen to how Joe Beda frames the &quot;rich vs. reach&quot; debate: </p>
<blockquote class="personQuote JoeBeda">Avalon will be supplanting WinForms, but WinForms is more reach than it is rich. It&#39;s the reach versus rich thing, and in some ways there&#39;s a spectrum. If you write an ASP.NET thing and deploy via the browser, that&#39;s really reach. If you write a WinForms app, you can go down to Win98, I believe. Avalon&#39;s going to be Longhorn only. </blockquote>
<p>So developers are invited to classify degrees of reach -- not only with respect to the Web, but even within Windows -- and to code accordingly. What&#39;s more, they&#39;re invited to consider WinForms, the post-MFC (Microsoft Foundation Classes) GUI framework in the .NET Framework, as &quot;reachier&quot; than Avalon. That&#39;s true by definition since Avalon&#39;s not here yet, but bizarre given that mainstream Windows developers can&#39;t yet regard .NET as a ubiquitous foundation, even though many would like to. </p>
<p>Beda recommends that developers isolate business logic and data-intensive stuff from the visual stuff -- which is always smart, of course -- and goes on to sketch an incremental plan for retrofitting Avalon goodness into existing apps. He concludes: 
</p><blockquote class="personQuote JoeBeda">Avalon, and Longhorn in general, is Microsoft&#39;s stake in the ground, saying that we believe power on your desktop, locally sitting there doing cool stuff, is here to stay. We&#39;re investing on the desktop, we think it&#39;s a good place to be, and we hope we&#39;re going to start a wave of excitement leveraging all these new technologies that we&#39;re building. </blockquote>
<p></p>
<p>It&#39;s not every decade that the Windows presentation subsystem gets a complete overhaul. As a matter of fact, it&#39;s never happened before. Avalon will retire the hodge-podge of DLLs that began with 16-bit Windows, and were carried forward (with accretion) to XP and Server 2003. It will replace this whole edifice with a new one that aims to unify three formerly distinct modes: the document, the user interface, and audio-visual media. This is a great idea, and it&#39;s a big deal. If you&#39;re a developer writing a Windows application that needs to deliver maximum consumer appeal three or four years from now, this is a wave you won&#39;t want to miss. But if you&#39;re an enterprise that will have to buy or build such applications, deploy them, and manage them, you&#39;ll want to know things like: 
</p><ul>
<li>
<p>How much fragmentation can my developers and users tolerate <i>within</i> the Windows platform, never mind across platforms?</p></li>
<li>
<p>Will I be able to remote the Avalon GUI using Terminal Services and Citrix?</p></li>
<li>
<p>Is there any way to invest in Avalon without stealing resources from the Web and mobile stuff that I still have to support?</p></li></ul>
<p></p>
<p>Then again, why even bother to ask these questions? It&#39;s not enough to believe that the return of rich-client technology will deliver compelling business benefits. (Which, by the way, I think it will.) You&#39;d also have to be shown that Microsoft&#39;s brand of rich-client technology will trump all the platform-neutral variations. Perhaps such a case can be made, but the concept demos shown so far don&#39;t do so convincingly. The Amazon demo at the Longhorn PDC (Professional Developers Conference) was indeed cool, but you can see similar stuff happening in <a href="http://www.ultrasaurus.com/sarahblog/archives/000140.html">Laszlo</a>, Flex, and other RIA (rich Internet application) environments today. Not, admittedly, with the same 3D effects. But if enterprises are going to head down a path that entails more Windows lock-in, Microsoft will have to combat the perception that the 3D stuff is gratuitous eye candy, and show order-of-magnitude improvements in users&#39; ability to absorb and interact with information-rich services. </p></blockquote>
<div align="right">[via <a href="http://weblog.infoworld.com/udell/">Jon&#39;s Radio</a>]</div>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2004-06-09#1100">
  <rss:title>Comparison of RDF Query Languages</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2004-06-09T18:32:42Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">The W3C RDF Data Access Working Group recently released an initial public Working Draft specification for &quot;RDF Data Access Use Cases and Requirements&quot;. Naturally, this triggered discussion on the RDF mailing list along the following lines: In section, 4.1 Human-friendly Syntax, you say  &quot;There must be a text-based form of the query language which can be read and written by users of the language&quot;,  and you list the status as &quot;pending&quot;. As background for section 4.1, you may be interested in RDFQueryLangComparison1 (original text replaced with live link). It shows how to write queries in a form that includes English meanings. The example queries can be run by pointing a browser to www.reengineeringllc.com . Perhaps importantly, given the intricacy of RDF for nonprogrammers, one can get an English explanation of the result of each query. -- Dr. Adrian Walker of Internet Business Logic The Semantic Web continues to take shape, and Infonauts (information centric agents) are already taking shape. A great thing about the net is the &quot;back to the future&quot; nature of most Web and Internet technology. For instance we are now frenzied about Service Oriented Architecture (SOA), Event Drivent Architecture (EDA), Loose Coupling of Composite Services etc. Basically rehashing the CORBA vision. I see the Semantic Web playing a similar role in relation to artificial intelligence. BTW - It still always comes down to data, and as you can imagine Virtuoso will be playing its usual role of alleviating the practical implementation and ulization challenges of all of the above :-)  </dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[
<p>The W3C RDF Data Access Working Group recently released an initial public Working Draft specification for &quot;<a href="http://www.w3.org/TR/2004/WD-rdf-dawg-uc-20040602/"><font color="#000000">RDF Data Access Use Cases and Requirements</font></a>&quot;. Naturally, this triggered discussion on the RDF mailing list along the following lines:</p> <blockquote dir="ltr" style="margin-right: 0px;"> <p>In section, 4.1 Human-friendly Syntax, you say<font size="4"><b>  </b></font>&quot;There must be a text-based form of the query language which can be read and written by users of the language&quot;,  and you list the status as &quot;pending&quot;.</p> <p>As background for section 4.1, you may be interested in <a href="http://www.aifb.uni-karlsruhe.de/WBS/pha/rdf-query/">RDFQueryLangComparison1</a> (original text replaced with live link).<br />
  <br />It shows how to write queries in a form that includes English meanings.<br />
  <br />The example queries can be run by pointing a browser to <a eudora="autourl" href="http://www.reengineeringllc.com/" title="http://www.reengineeringllc.com/">www.reengineeringllc.com</a> .<br />
  <br />Perhaps importantly, given the intricacy of RDF for nonprogrammers, one can get an English explanation of the result of each query.<br />
  <br />-- Dr. Adrian Walker of <a href="https://www.reengineeringllc.com/ibl_login.html#about">Internet Business Logic </a>
</p>
</blockquote> <p dir="ltr">The Semantic Web continues to take shape, and Infonauts (information centric agents) are already taking <a href="http://www.reengineeringllc.com/IBL_tutorial_part1.html">shape</a>.</p> <p dir="ltr">A great thing about the net is the &quot;back to the future&quot; nature of most Web and Internet technology. For instance we are now frenzied about Service Oriented Architecture (SOA), Event Drivent Architecture (EDA), Loose Coupling of Composite Services etc. Basically rehashing the CORBA vision.</p> <p dir="ltr">I see the Semantic Web playing a similar role in relation to artificial intelligence. </p> <p dir="ltr">BTW - It still always comes down to data, and as you can imagine <a href="http://www.openlinksw.com/virtuoso">Virtuoso</a> will be playing its usual role of alleviating the practical implementation and ulization challenges of all of the above :-)</p> <p dir="ltr"> </p>
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2004-06-09#557">
  <rss:title>Comparison of RDF Query Languages</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2004-06-09T17:32:42Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">The W3C RDF Data Access Working Group recently released an initial public Working Draft specification for &quot;RDF Data Access Use Cases and Requirements&quot;. Naturally, this triggered discussion on the RDF mailing list along the following lines: In section, 4.1 Human-friendly Syntax, you say  &quot;There must be a text-based form of the query language which can be read and written by users of the language&quot;,  and you list the status as &quot;pending&quot;. As background for section 4.1, you may be interested in RDFQueryLangComparison1 (original text replaced with live link).It shows how to write queries in a form that includes English meanings.The example queries can be run by pointing a browser to www.reengineeringllc.com .Perhaps importantly, given the intricacy of RDF for nonprogrammers, one can get an English explanation of the result of each query.-- Dr. Adrian Walker of Internet Business Logic The Semantic Web continues to take shape, and Infonauts (information centric agents) are already taking shape. A great this about the net is the &quot;back to the future&quot; nature of most Web and Internet technology. For instance we are now frenzied about Service Oriented Architecture (SOA), Event Drivent Architecture (EDA), Loose Coupling of Composite Services etc. Basically rehashing the CORBA vision. I see the Semantic Web playing a similar role in relation to artificial intelligence. BTW - It still always comes down to data, and as you can imagine Virtuoso will be playing its usual role of alleviating the practical implementation and ulization challenges of all of the above :-)  </dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>The W3C RDF Data Access Working Group recently released an initial public Working Draft specification for "<a href="http://www.w3.org/TR/2004/WD-rdf-dawg-uc-20040602/"><font color="#000000">RDF Data Access Use Cases and Requirements</font></a>". Naturally, this triggered discussion on the RDF mailing list along the following lines:</p>
<blockquote dir="ltr" style="MARGIN-RIGHT: 0px">
<p>In section, 4.1 Human-friendly Syntax, you say<font size="4"><b>&nbsp; </b></font>"There must be a text-based form of the query language which can be read and written by users of the language",&nbsp; and you list the status as "pending".</p>
<p>As background for section 4.1, you may be interested in <a href="http://www.aifb.uni-karlsruhe.de/WBS/pha/rdf-query/">RDFQueryLangComparison1</a> (original text replaced with live link).<br><br>It shows how to write queries in a form that includes English meanings.<br><br>The example queries can be run by pointing a browser to <a eudora="autourl" href="http://www.reengineeringllc.com/" title="http://www.reengineeringllc.com/">www.reengineeringllc.com</a> .<br><br>Perhaps importantly, given the intricacy of RDF for nonprogrammers, one can get an English explanation of the result of each query.<br><br>-- Dr. Adrian Walker of <a href="https://www.reengineeringllc.com/ibl_login.html#about">Internet Business Logic </a></p></blockquote>
<p dir="ltr">The Semantic Web continues to take shape, and Infonauts (information centric agents) are already taking <a href="http://www.reengineeringllc.com/IBL_tutorial_part1.html">shape</a>.</p>
<p dir="ltr">A great this about the net is the "back to the future" nature of most Web and Internet technology. For instance we are now frenzied about Service Oriented Architecture (SOA), Event Drivent Architecture (EDA), Loose Coupling of Composite Services etc. Basically rehashing the CORBA vision.</p>
<p dir="ltr">I see the Semantic Web playing a similar role in relation to artificial intelligence. </p>
<p dir="ltr">BTW - It still always comes down to data, and as you can imagine <a href="http://www.openlinksw.com/virtuoso">Virtuoso</a> will be playing its usual role of alleviating the practical implementation and ulization challenges of all of the above :-)</p>
<p dir="ltr">&nbsp;</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2004-06-04#555">
  <rss:title>XML, the New Database Heresy</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2004-06-04T04:04:48Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">A great post by Dare, especially his bringing into context the essence of this matter refrred to by C.J. Date as &quot;XML the New Database Heresy&quot;. I have little to add to this matter as our understanding and vision is aptly expressed via the architecture and feature set of Virtuoso (this area was actually addressed circa 1999). We are heading into a era of multi-model databases, these are single database engines that are capable of effectively serving the requirements of the Hierarchical, Network, Relational, and Object database models . As we get closer to the unravelling of universal storage, hopefully this will get clearer. Back to Dare&#39;s commentary: C.J. Date, one of the most influential names in the relational database world, had some harsh words about XML&#39;s encroachment into the world of relational databases in a recent article entitled Date defends relational model  that appeared on SearchDatabases.com. Key parts of the article are excerpted below Date reserved his harshest criticism for the competition, namely object-oriented and XML-based DBMSs. Calling them &quot;the latest fashions in the computer world,&quot; Date said he rejects the argument that relational DBMSs are yesterday&#39;s news. Fans of object-oriented database systems &quot;see flaws in the relational model because they don&#39;t fully understand it,&quot; he said. Date also said that XML enthusiasts have gone overboard. &quot;XML was invented to solve the problem of data interchange, but having solved that, they now want to take over the world,&quot; he said. &quot;With XML, it&#39;s like we forget what we are supposed to be doing, and focus instead on how to do it.&quot; Craig S. Mullins, the director of technology planning at BMC Software and a SearchDatabase.com expert, shares Date&#39;s opinion of XML. It can be worthwhile, Mullins said, as long as XML is only used as a method of taking data and putting it into a DBMS. But Mullins cautioned that XML data that is stored in relational DBMSs as whole documents will be useless if the data needs to be queried, and he stressed Date&#39;s point that XML is not a real data model. Craig Mullins points are more straightforward to answer since his comments don&#39;t jibe with the current state of the art in the XML world. He states that you can&#39;t query XML documents stored in databases but this is untrue. Almost three years ago, I was writing articles about querying XML documents stored in relational databases. Storing XML in a relational database doesn&#39;t mean it has to be stored in as an opaque binary BLOB or as a big, bunch of text which cannot effectively be queried. The next version of SQL Server will have extensive capabilities for querying XML data in relational database and doing joins across relational and XML data, a lot of this functionality is described in the article on XML Support in SQL Server 2005. As for XML not having a data model, I beg to differ. There is a data model for XML that many applications and people adhere to, often without realizing that they are doing so. This data model is the XPath 1.0 data model, which is being updated to handled typed data as the XQuery and XPath 2.0 data model. Now to tackle the meat of C.J. Date&#39;s criticisms which is that XML solves the problem of data interchange but now is showing up in the database. The thing first point I&#39;d like point out is that there are two broad usage patterns of XML, it  is used to represent both rigidly structured tabular data (e.g., relational data or serialized objects) and semi-structured data (e.g., office documents). The latter type of data will only grow now that office productivity software like Microsoft Office have enabled users to save their documents as XML instead of proprietary binary formats. In many cases, these documents cannot simply shredded into relational tables. Sure you can shred an Excel spreadsheet written in spreadsheetML into relational tables but is the same really feasible for a Word document written in WordprocessingML? Many enterprises would rather have their important business data being stored and queried from a unified location instead of the current situation where some data is in document management systems, some hangs around as random files in people&#39;s folders while some sits in a database management system. As for stating that critics of the relational model don&#39;t understand it, I disagree. One of the major benefits of using XML in relational databases is that it is a lot easier to deal with fluid schemas or data with sparse entries with XML. When the shape of the data tends to change or is not fixed the relational model is simply not designed to deal with this. Constantly changing your database schema is simply not feasible and there is no easy way to provide the extensibility of XML where one can say &quot;after the X element, any element from any namespace can appear&quot;. How would one describe the capacity to store âany dataâ in a traditional relational database without resorting to an opaque blob? I do tend to agree that some people are going overboard and trying to model their data hierarchically instead of relationally which experience has thought us is a bad idea. Recently on the XML-DEV mailing list entitled Designing XML to Support Information Evolution where Roger L. Costello described his travails trying to model his data which was being transferred as XML in a hierarchical manner. Micheal Champion accurately described the process Roger Costello went through as having &quot;rediscovered the relational model&quot;. In a response to that thread I wrote &quot;Hierarchical databases failed for a reason&quot;. Using hierarchy as a primary way to model data is bad for at least the following reasons Hierarchies tend to encourage redundancy. Imagine I have a &lt;Customer&gt; element who has one or more &lt;ShippingAddress&gt; elements as children as well as one or more &lt;Order&gt; elements as children as well. Each order was shipped to an address, so if modelled hierarchically each &lt;Order&gt; element also will have a &lt;ShippingAddress&gt; element which leads to a lot of unnecessary duplication of data. In the real world, there are often multiple groups to which a piece of data belongs which often cannot be modelled with a single hierarchy.   Data is too tightly coupled. If I delete a &lt;Customer&gt; element, this means I&#39;ve automatically deleted his entire order history since all the &lt;Order&gt; elements are children of &lt;Customer&gt;. Similarly if I query for a &lt;Customer&gt;, I end up getting all the &lt;Order&gt; information as well. To put it simply, experience has taught the software world that the relational model is a better way to model data than the hierarchical model. Unfortunately, in the rush to embrace XML many a repreating the mistakes from decades ago in the new millenium. [via Dare Obasanjo aka Carnage4Life]</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p dir="ltr">A great <a href="http://www.25hoursaday.com/weblog/PermaLink.aspx?guid=d28ce1fb-7b27-407d-b1a3-0b9a34831ca1">post </a>by Dare, especially his bringing into context the essence of this matter refrred to by C.J. Date as "XML the New Database Heresy".</p>
<p dir="ltr">I have little to add to this matter as our&nbsp;understanding and vision is aptly expressed via the architecture and feature set of <a href="http://www.openlinksw.com/virtuoso">Virtuoso</a> (this area was actually addressed circa 1999).</p>
<p dir="ltr">We are heading into a era of multi-model databases, these are single database engines that are capable of effectively serving the requirements of the Hierarchical, Network, Relational, and Object database <a href="http://www.web-dictionary.org/encyclopedia/db/DBMS.html#Navigational_databases">models</a>&nbsp;. As we get closer to the unravelling of universal storage, hopefully this will get clearer.</p>
<p dir="ltr">Back to Dare's commentary:</p>
<blockquote dir="ltr" style="MARGIN-RIGHT: 0px">
<p><a href="http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/d/Date:C=_J=.html">C.J. Date</a>, one of the most influential names in the relational database world, had some harsh words about XML's encroachment into the world of relational databases in a recent article entitled <a href="http://searchdatabase.techtarget.com/originalContent/0,289142,sid13_gci962948,00.html">Date defends relational model&nbsp;</a>&nbsp;that appeared on SearchDatabases.com. Key parts of the article are excerpted below </p>
<blockquote dir="ltr" style="MARGIN-RIGHT: 0px">
<p>Date reserved his harshest criticism for the competition, namely object-oriented and XML-based DBMSs. Calling them "the latest fashions in the computer world," Date said he rejects the argument that relational DBMSs are yesterday's news. Fans of object-oriented database systems "see flaws in the relational model because they don't fully understand it," he said. </p>
<p>Date also said that XML enthusiasts have gone overboard. </p>
<p>"XML was invented to solve the problem of data interchange, but having solved that, they now want to take over the world," he said. "With XML, it's like we forget what we are supposed to be doing, and focus instead on how to do it." </p>
<p>Craig S. Mullins, the director of technology planning at BMC Software and a SearchDatabase.com expert, shares Date's opinion of XML. It can be worthwhile, Mullins said, as long as XML is only used as a method of taking data and putting it into a DBMS. But Mullins cautioned that XML data that is stored in relational DBMSs as whole documents will be useless if the data needs to be queried, and he stressed Date's point that XML is not a real data model. </p></blockquote>
<p dir="ltr">Craig Mullins points are more straightforward to answer since his comments don't jibe with the current state of the art in the XML world. He states that you can't query XML documents stored in databases but this is untrue. Almost three years ago, I was writing articles about <a href="http://features.slashdot.org/article.pl?sid=01/10/29/0725214&mode=thread&tid=156">querying XML documents stored in relational databases</a>. Storing XML in a relational database doesn't mean it has to be stored in&nbsp;as an opaque&nbsp;binary BLOB or as a big, bunch of text which cannot effectively be queried. The next version of SQL Server will have extensive capabilities for querying XML data in relational database and doing joins across relational and XML data, a lot of this functionality is&nbsp;described in the article on <a href="http://msdn.microsoft.com/xml/default.aspx?pull=/library/en-us/dnsql90/html/sql2k5xml.asp">XML Support in SQL Server 2005</a>. As for XML not having a data model, I beg to differ. There is a data model for XML that many applications and people adhere to, often without realizing that they are doing so. This data model is the <a href="http://www.w3.org/TR/1999/REC-xpath-19991116#data-model">XPath 1.0 data model</a>, which is being updated to handled typed data as the <a href="http://www.w3.org/TR/2003/WD-xpath-datamodel-20031112/">XQuery and XPath 2.0 data model</a>. </p>
<p dir="ltr">Now to tackle the meat of C.J. Date's criticisms which is that XML solves the problem of data interchange but now is showing up in the database. The thing first point I'd like point out is that there are two broad usage patterns of XML, it&nbsp; is used to represent both rigidly structured tabular data&nbsp;(e.g., relational data or serialized objects) and semi-structured data (e.g., office documents). The latter type of data will only grow now that office productivity software like <a href="http://www.microsoft.com/office">Microsoft Office</a> have enabled users to save their&nbsp;documents as XML instead of proprietary binary formats. In many cases, these documents cannot simply shredded into relational tables. Sure you can shred an Excel spreadsheet written in&nbsp;spreadsheetML into relational tables but is the same really feasible for a Word document written in WordprocessingML? Many enterprises would rather have their important business data being stored and queried from a unified location instead of the current situation where some data is in document management systems, some hangs around as random files in people's folders while some sits in a database management system. </p>
<p dir="ltr">As for stating that critics of the relational model don't understand it,&nbsp;I disagree. One of the major&nbsp;benefits of using XML&nbsp;in relational databases is that it is a lot easier to deal with fluid schemas or&nbsp;data with sparse entries with XML. When the shape of the data tends to change or is not fixed the relational model is simply not&nbsp;designed to deal with this. Constantly changing your database schema is simply not feasible and there is no easy way to provide the extensibility of XML where one can say "after the <font face="Courier New">X </font>element, any element from any namespace can appear". How would one describe the capacity to store âany dataâ in a traditional relational database without resorting to an opaque blob? </p>
<p dir="ltr">I do tend to agree that some people are going overboard and trying to model their data hierarchically instead of relationally which experience has thought us is a bad idea. Recently on the XML-DEV mailing list entitled <a href="http://lists.xml.org/archives/xml-dev/200405/msg00216.html">Designing XML to Support Information Evolution </a>where Roger L. Costello described his travails trying to model his data which was being transferred as XML in a hierarchical manner. Micheal Champion accurately described the process Roger Costello went through as having "rediscovered the relational model". In&nbsp;a response to that thread I wrote "Hierarchical databases failed for a reason". </p>
<p dir="ltr">Using hierarchy as a primary way to model data is bad for at least the following reasons </p>
<ol dir="ltr">
<li>
<div>Hierarchies tend to encourage redundancy. Imagine I have a &lt;Customer&gt; element who has one or more &lt;ShippingAddress&gt; elements as children as well as one or more &lt;Order&gt; elements as children as well. Each order was shipped to an address, so if modelled hierarchically each &lt;Order&gt; element also will have a &lt;ShippingAddress&gt; element which leads to a lot of unnecessary duplication of data. </div></li>
<li>
<div>In the real world, there are often multiple&nbsp;groups to which a piece of data belongs which often cannot be modelled with a single hierarchy. &nbsp; </div></li>
<li>
<div>Data is too tightly coupled. If I delete a &lt;Customer&gt; element, this means I've automatically deleted his entire order history since all&nbsp;the &lt;Order&gt; elements are children of &lt;Customer&gt;. Similarly if I query for a &lt;Customer&gt;, I end up getting all the &lt;Order&gt; information as well. </div></li></ol>
<p>To put it simply, experience has taught the software world that the relational model is a better way to model data than the hierarchical model. Unfortunately, in the rush to embrace XML many a repreating the mistakes from decades ago in the new millenium. </p></blockquote>
<div align="right">[via <a href="http://www.25hoursaday.com/weblog/">Dare Obasanjo aka Carnage4Life</a>]</div>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2004-04-23#526">
  <rss:title>INSTEAD-OF Triggers</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2004-04-24T00:52:50Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">During a session with a potential customer/partner I was posed the following question re. Virtuoso&#39;s Virtual Database functionality: &quot;Can I create an updateable SQL VIEW in Virtuoso that would comprise columns from 3rd party databases such as Oracle, SQL Server, and say MySQL&quot;. The answer was yes, based on the fact that Virtuoso does support SQL INSTEAD-OF Triggers - even in Virtual Database mode. I am certainly keen to see if any other Virtual Database style products achieve this feat (which is trying for many homogeneous SQL database engines). Dr. Paul Dorsey of Dulcian, Inc. wrote a very good article about this subject, and here is an excerpt from his article overiew: Views are an important part of application development. Since Oracle 7.3, we quickly recognized the importance of using Oracleâs updateable view feature. An updateable view allows you to join several tables and perform updates against the driving table. For example, if you join EMP and DEPT in the traditional way and display columns from both tables, DML operations are possible against EMP but not DEPT. For traditional relational database designs, this is enough functionality. For example, in a typical Forms application, when you are basing a block on a table, the additional columns that you want to display are lookups from other tables and can therefore be easily supported using traditional updateable views. These views are built using a combination of joins and outer joins or,  in extreme cases, looking up additional information through functions embedded in the views. Under no circumstances should post query triggers be used to support this functionality. Post query triggers cause unnecessary network traffic and also embed the logic in the application rather than in the database or somewhere else where it can easily be reused. What happens in a situation where the information you want to display in the block requires a query that is so complex that your ability to maintain (insert, update, delete) that information using a simple updateable view is eliminated? The updateable views are relatively restrictive. Only a single table can be updated. Joins must be created carefully and based on Foreign Key constraints in the database. No set operators such as UNION or MINUS can be used. For these reasons, it is common to end up with a block that cannot be updated as required. How do most developers handle this situation? &lt;!--[if !supportLists]--&gt;a)       &lt;!--[endif]--&gt;By placing complex logic in the form (WHEN-VALIDATE-ITEM triggers) &lt;!--[if !supportLists]--&gt;b)       &lt;!--[endif]--&gt;By writing procedures that access Formsâ ability to replace the Insert, Update, Delete routines and place that logic in the form These practices are just as undesirable as using POST-QUERY triggers. The logic is in the wrong place and is not reusable. The INSTEAD-OF trigger views feature was introduced by Oracle in version 8.15. This feature enables developers to create views on single or multiple tables or any other view imaginable by writing INSTEAD-OF triggers that tell the view how to behave when Inserts, Updates or Deletes are issued. Peter Koletzke and I first wrote about this feature in our Oracle Press book Oracle Developer: Advanced Forms &amp; Reports (2000). At the time, we gave the feature relatively brief mention because we believed that most of the systems we were building included blocks based on traditional updateable views, which allow updates to a single table. Now, there is a good reason to look more closely at INSTEAD-OF trigger views. Database Journal also has an article on this subject.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>During a session with a potential customer/partner I was posed the following question re. <a href="http://www.openlinksw.com/virtuoso/whatis.htm">Virtuoso's Virtual Database</a> functionality:</p>
<p>"Can I create an updateable SQL VIEW in Virtuoso that would comprise columns from 3rd party databases such as Oracle, SQL Server, and say MySQL".</p>
<p>The answer was yes,&nbsp;based on&nbsp;the fact that Virtuoso does support SQL <a href="http://docs.openlinksw.com/virtuoso/TRIGGERS.html#TRIGGERS">INSTEAD-OF Triggers</a> - even in Virtual Database mode. </p>
<p>I am certainly keen to see if any other Virtual Database style products achieve this feat (which is trying for many homogeneous SQL database engines).</p>
<p>Dr. Paul Dorsey of <a href="http://www.dulcian.com/">Dulcian, Inc</a>. wrote a very <a href="http://www.dulcian.com/papers/INSTEAD%20OF%20Trigger%20Views_ODTUG.htm">good article about this subject</a>, and here is an excerpt from his article overiew:</p>
<blockquote dir="ltr" style="MARGIN-RIGHT: 0px">
<p>Views are an important part of application development. Since Oracle 7.3, we quickly recognized the importance of using Oracleâs updateable view feature. An updateable view allows you to join several tables and perform updates against the driving table. For example, if you join EMP and DEPT in the traditional way and display columns from both tables, DML operations are possible against EMP but not DEPT. </p>
<p class="Text">For traditional relational database designs, this is enough functionality. For example, in a typical Forms application, when you are basing a block on a table, the additional columns that you want to display are lookups from other tables and can therefore be easily supported using traditional updateable views. These views are built using a combination of joins and outer joins or,<span>&nbsp; </span>in extreme cases, looking up additional information through functions embedded in the views. Under no circumstances should post query triggers be used to support this functionality. Post query triggers cause unnecessary network traffic and also embed the logic in the application rather than in the database or somewhere else where it can easily be reused.</p>
<p class="Text">What happens in a situation where the information you want to display in the block requires a query that is so complex that your ability to maintain (insert, update, delete) that information using a simple updateable view is eliminated? The updateable views are relatively restrictive. Only a single table can be updated. Joins must be created carefully and based on Foreign Key constraints in the database. No set operators such as UNION or MINUS can be used. For these reasons, it is common to end up with a block that cannot be updated as required. How do most developers handle this situation?</p>
<p class="Text" style="MARGIN-LEFT: 0.5in; TEXT-INDENT: -0.25in">&lt;!--[if !supportLists]--&gt;a)<span style="FONT: 7pt 'Times New Roman'; font-size-adjust: none; font-stretch: normal">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span>&lt;!--[endif]--&gt;By placing complex logic in the form (WHEN-VALIDATE-ITEM triggers)</p>
<p class="Text" style="MARGIN-LEFT: 0.5in; TEXT-INDENT: -0.25in">&lt;!--[if !supportLists]--&gt;b)<span style="FONT: 7pt 'Times New Roman'; font-size-adjust: none; font-stretch: normal">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span>&lt;!--[endif]--&gt;By writing procedures that access Formsâ ability to replace the Insert, Update, Delete routines and place that logic in the form</p>
<p class="Text">These practices are just as undesirable as using POST-QUERY triggers. The logic is in the wrong place and is not reusable. </p>
<p class="Text">The INSTEAD-OF trigger views feature was introduced by Oracle in version 8.15. This feature enables developers to create views on single or multiple tables or any other view imaginable by writing INSTEAD-OF triggers that tell the view how to behave when Inserts, Updates or Deletes are issued. Peter Koletzke and I first wrote about this feature in our Oracle Press book <i>Oracle Developer: Advanced Forms &amp; Reports</i> (2000). At the time, we gave the feature relatively brief mention because we believed that most of the systems we were building included blocks based on traditional updateable views, which allow updates to a single table. Now, there is a good reason to look more closely at INSTEAD-OF trigger views. </p></blockquote>
<p class="Text" dir="ltr">Database Journal also has an <a href="http://www.databasejournal.com/features/mssql/article.php/1437741">article on this subject</a>.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2004-03-24#493">
  <rss:title>Metadata? Thesauri? Taxonomies? Topic Maps! Making Sense Of It All.</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2004-03-24T15:43:12Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">By Lars Marius Garshol, Ontopia Technical Report Information Architecture is the discipline dealing with the modern version of this problem: how to organize web sites so that users actually can find what they are looking for. Information architects have so far applied known and well-tried tools from library science to solve this problem, and now topic maps are sailing up as another potential tool for information architects. This raises the question of how topic maps compare with the traditional solutions. The paper argues that topic maps go beyond the traditional solutions in the sense that it provides a framework within which they can be represented as they are, but also extended in ways which significantly improve information retrieval. The paper tries to show that topic maps provide a common reference model that can be used to explain how to understand many common techniques from library science and information architecture. http://www.ontopia.net/topicmaps/materials/tm-vs-thesauri.html See also (XML) Topic Maps: http://xml.coverpages.org/topicMaps.html</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<font size="2">
<p>By Lars Marius Garshol, Ontopia Technical Report</p>
<p>Information Architecture is the discipline dealing with the modern version of this problem: how to organize web sites so that users actually can find what they are looking for. Information architects have so far applied known and well-tried tools from library science to solve this problem, and now topic maps are sailing up as another potential tool for information architects. This raises the question of how topic maps compare with the traditional solutions. The paper argues that topic maps go beyond the traditional solutions in the sense that it provides a framework within which they can be represented as they are, but also extended in ways which significantly improve information retrieval. The paper tries to show that topic maps provide a common reference model that can be used to explain how to understand many common techniques from library science and information architecture.</p>
<p></font><a href="http://www.ontopia.net/topicmaps/materials/tm-vs-thesauri.html"><u><font color="#0000ff" size="2">http://www.ontopia.net/topicmaps/materials/tm-vs-thesauri.html</u></font></a></p><font size="2">
<p>See also (XML) Topic Maps: </font><a href="http://xml.coverpages.org/topicMaps.html"><u><font color="#0000ff" size="2">http://xml.coverpages.org/topicMaps.html</u></font></a></p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2004-02-03#462">
  <rss:title>Remember WebDAV</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2004-02-03T21:04:10Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">WebDAV is one of those interesting standards that sometimes gets lost in the broader industry hoopla. Well I finally decided to take a look at Mozilla&#39;s Calendar project as more open solution for sharing my calendar. After browsing around a little I came a across the following piece: To share your calendars, you need access to a webDAV server. If you run your own web server, you can install mod_dav, a free Apache module that will turn your web server into a webDAV server. Instructions on how to set it up are on their website. Once you set up your webDAV server, you can publish your calendar to the site, then subscribe to it from any other Mozilla Calendar. Automatically updating the calendar will give you a poor man&#39;s calendar server. Through WebDAV we will be able to share calendars across disparate calendaring tools (albeit with some degree of pain when Outlook is in the mix). Even better for me, I can post my shared calendar data via a Virtuoso instance (internally and externally since WebDAV is one of the many protocols that it implements), in short I could even seriously consider generating this on the fly and sharing it via this blog (Wow!). We aren&#39;t too many miles away from open and standards compliant Unified Data Storage thanks to WebDAV.  </dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<P><A href="http://www.webdav.org">WebDAV</A> is one of those interesting standards that sometimes gets lost in the broader industry hoopla. Well I finally decided to take a look at <A href="http://www.mozilla.org/projects/calendar/">Mozilla's Calendar project </A>as more open solution for sharing my calendar. After browsing around a little I came a across the following <A href="http://www.mozilla.org/projects/calendar/faq.html#share">piece</A>:</P>
<BLOCKQUOTE dir=ltr style="MARGIN-RIGHT: 0px">
<P><!--StartFragment --><EM>To share your calendars, you need access to a </EM><A href="http://www.webdav.org/"><EM>webDAV server</EM></A><EM>. If you run your own web server, you can install </EM><A href="http://www.webdav.org/mod_dav"><EM>mod_dav</EM></A><EM>, a free Apache module that will turn your web server into a webDAV server. Instructions on how to set it up are on their website. Once you set up your webDAV server, you can publish your calendar to the site, then subscribe to it from any other Mozilla Calendar. Automatically updating the calendar will give you a poor man's calendar server.</EM></P></BLOCKQUOTE>
<P>Through WebDAV we will be able to share calendars across disparate calendaring tools (albeit with some degree of pain when Outlook is in the mix). Even better for me, I can post my shared calendar data via a <A href="http://www.openlinksw.com/virtuoso">Virtuoso</A> instance (internally and externally since <A href="http://www.openlinksw.com/virtuoso/whatis.htm#webdav">WebDAV is one of the many protocols that it implements</A>), in short I could even seriously consider generating this on the fly and sharing it via this blog (Wow!).</P>
<P>We&nbsp;aren't too many miles away from open and standards compliant Unified Data Storage thanks to WebDAV.</P>
<P>&nbsp;</P>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2003-11-11#423">
  <rss:title>Creating RSS Using SQLX</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2003-11-11T23:33:50Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Here is a practical example of how to create RSS on the fly from SQL data sources leveraging Virtuoso 3.2&#39;s SQLX implementation. This is further illuminates the content of my earlier post on this subject.  </dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<P>Here is a <A href="http://www.openlinksw.com/articles/rssvirtsqlx.htm">practical example of how to create RSS on the fly from SQL </A>data sources leveraging Virtuoso 3.2's SQLX implementation.</P>
<P>This is further illuminates the content of my <A href="http://www.openlinksw.com/weblogs/virtuoso/index.vspx?id=426">earlier post</A> on this subject.</P>
<P>&nbsp;</P>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2003-10-31#410">
  <rss:title>Replace and defend -- Contd</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2003-10-31T20:58:52Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Reading the Longhorn SDK docs is a disorienting experience. Everything&#39;s familiar but different. Consider these three examples: [Full story: Replace and defend via Jon&#39;s Radio] &quot;Replace &amp; Defend&quot; is certainly a strategy that would have awakened the entire non Microsoft Developer world during the recent PDC event. I know these events are all about preaching to the choir (Windows only developers), but as someone who has worked with Microsoft technologies as an ISV since the late 80&#39;s there is something about this events announcements that leave me concerned. Ironically these concerns aren&#39;t about the competitive aspects of their technology disruptions, but more along the lines of how Microsoft (I hope inadvertently) generates the kinds of sentiments echoed in the comments thread from Scobles recent &quot;How to hate Microsoft&quot; post. As indicated in my response to this post, I don&#39;t believe Microsoft is as bad or evil as is instinctively assumed in many quarters, but I can certainly understand why they are hated by others which is really unfortunate, especially bearing in mind that they have done more good than harm to date (in my humble opinion) . Anyway, back to my concerns post PDC which I break down as follows: Disruptive assaults on existing standards with the only benefit being Microsoft platform centricity. Jon Udell addressed this in his &quot;Replace and Defend&quot; post (which kicked of this post), and I see exactly what he sees here, and I don&#39;t see any reason for this approach whatsoever. Even if one of these standards was deficient what stops the Microsoft from addressing these deficiencies, and then should the W3C&#39;s standards acceptance and ratification process bogs things down at least let the industry know you gave it openness a chance but have to move on etc.. Gradual obsolescence of existing Microsoft standards which used to provide interfaces for 3rd party ISV partners, and replacing these with totally closed infrastructure implementations that bind to Microsoft products only.  A good example is WinFS, I believe in the unified data storage concept, it&#39;s a vision that I&#39;ve believed in for many years, but there is no notion from any PDC presentation or Blog that I have read so far (I aggregate a serious number of feeds) that Microsoft is committed to an architectural strategy that enables 3rd party ISVs to hook their data stores and data sources into this storage infrastructure - it&#39;s simply about Yukon (SQL Server) and that&#39;s basically it. WinFS needs to architecturally separate the System Provider from the Data Provider (pretty much the OLE-DB architecture) with Microsoft naturally providing reference System Provider (pretty much what was demonstrated at PDC) and Data Provider (ADO.NET, OLE DB, and ODBC) implementations. Third parties can choose to produce custom WinFS Service or Data Providers which serve their data access needs. It&#39;s impractical to want to force every non SQL Server customer over to SQL Server in order them to exploit WinFS, and I certainly hope this isn&#39;t the definitive strategy at Microsoft.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<BLOCKQUOTE dir=ltr style="MARGIN-RIGHT: 0px">
<P dir=ltr style="MARGIN-RIGHT: 0px">Reading the Longhorn SDK docs is a disorienting experience. Everything's familiar but different. Consider these three examples: </P>
<P dir=ltr style="MARGIN-RIGHT: 0px">[Full story: <A href="http://weblog.infoworld.com/udell/2003/10/31.html#a836">Replace and defend</A> via <A href="http://weblog.infoworld.com/udell/">Jon's Radio</A>]</P></BLOCKQUOTE>
<P dir=ltr style="MARGIN-RIGHT: 0px">"Replace &amp; Defend" is certainly a strategy that would have awakened the entire non Microsoft Developer world during the recent PDC event. I know these events are all about preaching to the choir (Windows only developers), but as someone who has worked with Microsoft technologies as an ISV since the late 80's there is something about this events announcements that leave me concerned. </P>
<P dir=ltr style="MARGIN-RIGHT: 0px">Ironically these concerns aren't about the competitive aspects of their technology disruptions, but more along the lines of how&nbsp;Microsoft (I hope inadvertently) generates the kinds of sentiments echoed in the <A href="http://longhornblogs.com/scobleizer/posts/345.aspx#FeedBack">comments thread </A>from <A href="http://longhornblogs.com/">Scobles</A> recent <A href="http://longhornblogs.com/scobleizer/posts/345.aspx">"How to hate Microsoft"</A> post. As indicated in my response to this post,&nbsp;I don't believe&nbsp;Microsoft is as bad or evil as is instinctively assumed in many quarters, but I can certainly understand why they&nbsp;are hated by others which is really unfortunate, especially&nbsp;bearing in mind that they have done more good than harm&nbsp;to date&nbsp;(in my humble&nbsp;opinion)&nbsp;. </P>
<P dir=ltr style="MARGIN-RIGHT: 0px">Anyway, back to my concerns post PDC which I break down as follows:</P>
<OL dir=ltr>
<LI>
<DIV style="MARGIN-RIGHT: 0px">Disruptive assaults on existing standards with the only benefit being Microsoft platform centricity. <A href="http://weblog.infoworld.com/udell/2003/10/31.html#a836">Jon Udell addressed this in his "Replace and Defend" post </A>(which kicked of this post), and I see exactly what he sees here, and I don't see any reason for this approach whatsoever. Even if one of these standards was deficient what stops the&nbsp;Microsoft from addressing these deficiencies, and then should the W3C's standards acceptance and ratification process bogs things down at least let the industry know you gave it openness a chance&nbsp;but have to move on etc.. <BR><BR></DIV></LI>
<LI>
<DIV style="MARGIN-RIGHT: 0px">Gradual obsolescence of existing Microsoft standards which used to provide interfaces for 3rd party ISV partners, and replacing these with totally closed infrastructure implementations that bind to Microsoft products only.&nbsp; A good example is <A href="http://msdn.microsoft.com/longhorn/default.aspx?pull=/msdnmag/issues/04/01/WinFS/default.aspx">WinFS</A>, I believe in the unified data storage concept, <A href="http://www.openlinksw.com/blog/~kidehen/index.vspx?id=406">it's a vision that I've believed in for&nbsp;many years</A>, but there is no notion&nbsp;from any PDC presentation or Blog that I have&nbsp;read so far (I aggregate&nbsp;a serious number of feeds)&nbsp;that Microsoft is committed to an architectural strategy that enables 3rd party ISVs to hook their data stores and data sources into this storage infrastructure -&nbsp;it's simply about <A href="http://www.openlinksw.com/blog/~kidehen/index.vspx?id=407">Yukon (SQL Server)</A> and that's basically it.</DIV></LI></OL>
<P style="MARGIN-RIGHT: 0px">WinFS needs to architecturally separate the <STRONG>System Provider</STRONG> from the <STRONG>Data Provider</STRONG> (pretty much the OLE-DB architecture)&nbsp;with Microsoft&nbsp;naturally providing reference System Provider (pretty much what was demonstrated at PDC)&nbsp;and Data Provider (ADO.NET, OLE DB, and ODBC) implementations. Third parties can choose to produce custom WinFS Service or Data Providers which serve their data access needs. It's impractical to want to force every non SQL Server customer over to SQL Server in order them to exploit WinFS, and I certainly hope this isn't the definitive strategy at Microsoft.</P>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2003-10-24#399">
  <rss:title>HOWTO: Apache-PHP-ODBC on Mac OS X</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2003-10-24T20:55:06Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">There is a new HOWTO document that addresses an area of frequent confusion on Mac OS X, which is how do you build PHP with an ODBC data access layer binding (iODBC variant) using Mac OS X Frameworks as opposed to Darwin Shared Libraries. This document basically brings clarity to both the Frameworks and Darwin Shared library approaches.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<DIV class=Section1>
<P><FONT face="Times New Roman"><SPAN style="FONT-SIZE: 12pt"><FONT size=2>There is a new </FONT><A href="http://www.iodbc.org/iodbc-phposxHOWTO.html"><FONT size=2>HOWTO document</FONT></A><FONT size=2> that addresses an area of frequent confusion on Mac OS X, which is how do you build PHP with an ODBC data access layer binding (</FONT><A href="http://www.iodbc.org/"><FONT size=2>iODBC</FONT></A><FONT size=2> variant) using Mac OS X Frameworks as opposed to Darwin Shared Libraries. </FONT></SPAN></FONT></P>
<P><FONT face="Times New Roman"><SPAN style="FONT-SIZE: 12pt"></SPAN></FONT><FONT face="Times New Roman" size=2><SPAN style="FONT-SIZE: 12pt"><FONT size=2>This document basically brings clarity to both the Frameworks and Darwin Shared library approaches</FONT>.</SPAN></FONT></P></DIV>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2003-09-04#247">
  <rss:title>Multimedia Blogging</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2003-09-04T17:49:37Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">While reading computing magazine I stumbled across an article titled: Picture Messaging Comes To The Rescue. The interesting this about this article is that it is inadvertently brings attention to the breadth of blogging. Here are some article excerpts: Lives could be saved if pioneering messaging trial is a success Mobile phone photo messaging could help to save lives at the scene of an accident if a new service being tested in Scotland is successful. Fife Fire &amp; Rescue Service has started trials using photo messaging to receive advice from doctors on how to deal with critical injuries at major incidents. Rescue officers will send photo messages of accidents via GPRS to the Accident and Emergency (A&amp;E) unit at Dunfermline&#39;s Queen Margaret hospital, preparing emergency wards for the arrival of casualties and receiving help in return. &#39;We plan to send pictures of traffic accidents directly to the hospital, in order to get advice about how best to deal with the accident victims,&#39; said Fife Fire &amp; Rescue Service firemaster Mike Bitcon. Using the photographs, doctors can assess the injuries and prepare appropriately, as well as deciding if a doctor should be present at the scene of the accident. &#39;We&#39;re confident that this initiative will help to save lives,&#39; said Bitcon. This can all happen right now, and independent of any particular network carrier or device via the inherent power of blogging (using mobile or multimedia blogging technology).</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[While reading <a href="http://www.vnunet.com">computing magazine</a> I stumbled across an article titled: <a href="http://www.vnunet.com/lite/News/1142978">Picture Messaging Comes To The Rescue</a>.

The interesting this about this article is that it is inadvertently brings attention to the breadth of blogging. Here are some article excerpts:
<blockquote>Lives could be saved if pioneering messaging trial is a success
Mobile phone photo messaging could help to save lives at the scene of an accident if a new service being tested in Scotland is successful.

Fife Fire & Rescue Service has started trials using photo messaging to receive advice from doctors on how to deal with critical injuries at major incidents.

Rescue officers will send photo messages of accidents via GPRS to the Accident and Emergency (A&E) unit at Dunfermline's Queen Margaret hospital, preparing emergency wards for the arrival of casualties and receiving help in return.

'We plan to send pictures of traffic accidents directly to the hospital, in order to get advice about how best to deal with the accident victims,' said Fife Fire & Rescue Service firemaster Mike Bitcon.

Using the photographs, doctors can assess the injuries and prepare appropriately, as well as deciding if a doctor should be present at the scene of the accident.

'We're confident that this initiative will help to save lives,' said Bitcon.</blockquote>

This can all happen right now, and independent of any particular network carrier or device via the inherent power of blogging (using mobile or multimedia blogging technology). ]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2003-08-21#241">
  <rss:title>RSS: INJAN (It&#39;s not just about news)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2003-08-21T15:41:25Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">When Virtuoso first unleashed support for XML (in-built XSL, Native XML Storage, Validating XML Parser, XPath, and XQuery) the core message was the delivery of a single server solution that would address the challenges of creating XML data. In the year 2000 the question of the shape and form of XML data was unclear to many, and reading the article below basically took me back in time to when we released Virtuoso 2.0 (we are now at release 3.0 commercially with a 3.2 beta dropping any minute). RSS is a great XML application, and it does a great job ofÂ demonstrating howÂ XML --the new data access foundation layer-- will galvanize the next generation Web (I refer to this as Web 2.0.). RSS: INJAN (It&#39;s not just about news) RSS is not just about news, according to Ian Davis on rss-dev.He presents a nice list of alternatives, which I reproduce here (and to which Iï¿½d add, of course, bibliography management) Sitemaps: one of the Sï¿½s in RSS stands for summary. A sitemap is a summary of the content on a site, the items are pages or content areas. This is clearly a non-chronological ordering of items. Is a hierarchy of RSS sitemaps implied here ï¿½ how would the linking between them work? How hard would it be to hack a web browser to pick up the RSS sitemap and display it in a sidebar when you visit the site? Small ads: also known as classifieds. These expire so thereï¿½s some kind of dynamic going on here but the ordering of items isnï¿½t necessarily chronological. How to describe the location of the seller, or the condition of the item or even the price. Not every ad is selling something ï¿½ perhaps itï¿½s to rent out a room. Personals: similar model to the small ads. No prices though (I hope). Comes with a ready made vocabulary of terms that could be converted to an RDF schema. Probably should do that just for the hell of it anyway ï¿½ gsoh Weather reports: how about a weekï¿½s worth of weather in an RSS channel. If an item is dated in the future, should an aggregator display it before time? Alternate representations include maps of temperature and pressure etc. Auctions: again, related to small ads, but these are much more time limited since there is a hard cutoff after which the auction is closed. The sequence of bids could be interesting ï¿½ would it make sense to thread them like a discussion so you can see the tactics? TV listings: this is definitely chronological but with a twist ï¿½ the items have durations. They also have other metadata such as cast lists, classification ratings, widescreen, stereo, program type. Some types have additional information such as director and production year. Top ten listings: top ten singles, books, dvds, richest people, ugliest, rear of the year etc. Not chronological, but has definate order. May update from day to day or even more often. Sales reporting: imagine if every department of a company reported their sales figures via RSS. Then the divisions aggregate the departmental figures and republish to the regional offices, who aggregate and add value up the chain. The chairman of the company subscribes to one super-aggregate feed. Membership lists / buddy lists: could I publish my buddy list from Jabber or other instant messengers? Maybe as an interchange format or perhaps could be used to look for shared contacts. Lots of potential overlap with FOAF here. Mailing lists: or in fact any messaging system such as usenet. There are some efforts at doing this already (e.g. yahoogroups) but we need more information ï¿½ threads; references; headers; links into archives. Price lists / inventory: the items here are products or services. No particular ordering but itï¿½d be nice to be able to subscribe to a catalog of products and prices from a company. The aggregator should be able to pick out price rises or bargains given enough history. [via Semantic Blogging Demonstrator] Thus, if we can comprehend RSS (the blog article below does a great job) we should be able to see the fundamental challenges that are before any organization seeking to exploit the potential of the imminent Web 2.0 inflection; how will you cost-effectively create XML data from existing data sources? Without upgrading or switching database engines, operating systems, programming languages? Put differently how can you exploit this phenomenonÂ without losing your ever dwindling technology choices (believe me choices are dwindling fast but most are oblivious to this fact). Â  xmlrsssyndication</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[
<p><span style="font-size: 10pt; font-family: Arial;">When Virtuoso first unleashed support for XML (in-built XSL, Native XML Storage, Validating XML Parser, XPath, and XQuery) the core message was the delivery of a single server solution that would address the challenges of creating XML data.</span></p><p xmlns="o"></p> <p><span style="font-size: 10pt; font-family: Arial;">In the year 2000 the question of the shape and form of XML data was unclear to many, and reading the article below basically took me back in time to when we released <a href="http://www.it-director.com/article.php?articleid=916">Virtuoso 2.0</a> (we are now at <a href="http://www.openlinksw.com/virtuoso">release 3.0</a> commercially with a <a href="http://www.openlinksw.com/press/virt32_wwdc1.htm">3.2 beta </a>dropping any minute).</span></p><p xmlns="o"></p> <p><span style="font-size: 10pt; font-family: Arial;">RSS is a great XML application, and it does a great job ofÂ demonstrating howÂ XML --the new data access foundation layer-- will galvanize the next generation Web (I refer to this as Web 2.0.). </span></p> <blockquote dir="ltr" style="margin-right: 0px;"><span style="font-size: 10pt; font-family: Arial;"> <p><a href="http://jena.hpl.hp.com:3030/blojsom-hp/blog/technologies/blogging/metadata/?permalink=1214847A10C1966396472E816A7A4243.textile">RSS: INJAN (It&#39;s not just about news)</a> </p> <p><span class="caps">RSS</span> is not just about news, according to <a href="http://groups.yahoo.com/group/rss-dev/message/5764">Ian Davis on rss-dev</a>.<br />He presents a nice list of alternatives, which I reproduce here (and to which Iï¿½d add, of course, bibliography management)</p> <ul> <li>Sitemaps: one of the Sï¿½s in <span class="caps">RSS</span> stands for summary. A sitemap is a summary of the content on a site, the items are pages or content areas. This is clearly a non-chronological ordering of items. Is a hierarchy of <span class="caps">RSS</span> sitemaps implied here ï¿½ how would the linking between them work? How hard would it be to hack a web browser to pick up the <span class="caps">RSS</span> sitemap and display it in a sidebar when you visit the site?</li> <li>Small ads: also known as classifieds. These expire so thereï¿½s some kind of dynamic going on here but the ordering of items isnï¿½t necessarily chronological. How to describe the location of the seller, or the condition of the item or even the price. Not every ad is selling something ï¿½ perhaps itï¿½s to rent out a room.</li> <li>Personals: similar model to the small ads. No prices though (I hope). Comes with a ready made vocabulary of terms that could be converted to an <span class="caps">RDF</span> schema. Probably should do that just for the hell of it anyway ï¿½ gsoh</li> <li>Weather reports: how about a weekï¿½s worth of weather in an <span class="caps">RSS</span> channel. If an item is dated in the future, should an aggregator display it before time? Alternate representations include maps of temperature and pressure etc.</li> <li>Auctions: again, related to small ads, but these are much more time limited since there is a hard cutoff after which the auction is closed. The sequence of bids could be interesting ï¿½ would it make sense to thread them like a discussion so you can see the tactics?</li> <li>TV listings: this is definitely chronological but with a twist ï¿½ the items have durations. They also have other metadata such as cast lists, classification ratings, widescreen, stereo, program type. Some types have additional information such as director and production year.</li> <li>Top ten listings: top ten singles, books, dvds, richest people, ugliest, rear of the year etc. Not chronological, but has definate order. May update from day to day or even more often.</li> <li>Sales reporting: imagine if every department of a company reported their sales figures via <span class="caps">RSS</span>. Then the divisions aggregate the departmental figures and republish to the regional offices, who aggregate and add value up the chain. The chairman of the company subscribes to one super-aggregate feed.</li> <li>Membership lists / buddy lists: could I publish my buddy list from Jabber or other instant messengers? Maybe as an interchange format or perhaps could be used to look for shared contacts. Lots of potential overlap with <span class="caps">FOAF</span> here.</li> <li>Mailing lists: or in fact any messaging system such as usenet. There are some efforts at doing this already (e.g. yahoogroups) but we need more information ï¿½ threads; references; headers; links into archives.</li> <li>Price lists / inventory: the items here are products or services. No particular ordering but itï¿½d be nice to be able to subscribe to a catalog of products and prices from a company. The aggregator should be able to pick out price rises or bargains given enough history.</li> <div align="right">[via <a href="http://jena.hpl.hp.com:3030/blojsom-hp/blog/">Semantic Blogging Demonstrator</a>] </div></ul></span></blockquote> <p><span style="font-size: 10pt; font-family: Arial;">Thus, if we can comprehend RSS (the blog article below does a great job) we should be able to see the fundamental challenges that are before any organization seeking to exploit the potential of the imminent Web 2.0 inflection; how will you cost-effectively create XML data from existing data sources? Without upgrading or switching database engines, operating systems, programming languages? Put differently how can you exploit this phenomenonÂ without losing your ever dwindling technology choices (believe me choices are dwindling fast but most are oblivious to this fact).</span></p><p xmlns="o"></p> <p>Â </p>
<a href="index.vspx?tag=xml" rel="tag" style="display:none;">xml</a><a href="index.vspx?tag=rss" rel="tag" style="display:none;">rss</a><a href="index.vspx?tag=syndication" rel="tag" style="display:none;">syndication</a>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2003-07-07#201">
  <rss:title>Tim O&#39;Reilly about network aware software</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2003-07-07T20:51:35Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Tim O&#39;Reilly about network aware software Tim O&#39;Reilly wrote some thoughts about network aware software. Good sumup and nice ideas, why not only blogs should be net-aware (and where even blogs can be improved ;) ) &quot;For the desktop, my personal vision is to see existing software instrumented to become increasingly web aware. It seems that Apple are doing a good job with this. (What does web aware mean for me? Being able to grok URIs, speaking WebDAV, and using open standard data formats.)&quot; -- Edd Dumbill [via Bitflux Blog] I agree, but you do have to add Open Data Access formats (such as ODBC and to some degree JDBC) to this mix otherwise the you will need to create data for Open Standard Data Formats from sratch (tough for any enterprise irrespective of size). Tim O&#39;Reilly added the following items to Edd&#39;s list: Rendezvous-like functionality for automatic discovery of and potential synchronization with other instances of the application on other computers. Apple is showing the power of this idea with iChat and iTunes, but it really could be applied in so many other places. For example, if every PIM supported this functionality, we could have the equivalent of &quot;phonester&quot; where you could automatically ask peers for contact information. Of course, that leads to guideline 2. Another application is discovery of ODBC data sources, and database servers. Rendezvous can also simply security and administration of data sources accessible by either one of these standards data access mechanisms. It can also apply to XML databases and data sources exposed by XML Databases. If you assume ad-hoc networking, you have to automatically define levels of access. I&#39;ve always thought that the old Unix ugo (user, group, other) three-level permission system was simple and elegant, and if you replace the somewhat arbitrary &quot;group&quot; with &quot;on my buddy list&quot;, you get something quite powerful. Which leads me to... Buddy lists ought to be supported as a standard feature of many apps, and in a consistent way. What&#39;s more, our address books really ought to make it easy to indicate who is in a &quot;buddy list&quot; and support numerous overlapping lists for different purposes. Every application ought to expose some version of its data as an XML feed via some well-defined and standard access mechanism. It strikes me that one of the really big wins that fueled the early web was a simple naming scheme: you could go to a site called www.foo.com, and you&#39;d find a web server there. While it wasn&#39;t required, it made web addresses eminently guessable. We missed the opportunity for xml.foo.com to mean &quot;this is where you get the data feed&quot; but it&#39;s probably still possible to come up with a simple, consistent naming scheme. And of course, if we can do it for web sites, we also need to think about how to do it for local applications, since... The very point I continue to make about Internet Points of Presence beingactual data acces points, in short these end points should be served by database serverprocesses. This is the very basis of Virtuoso, the inevitability of this realization remains the undepinings of this product. There are other products out there that have some sense of this vision too, but there is a little snag (at least so far in my research efforts), and that is the tendency to create dedicated independent server per protocol (an ultimate integration, administration, and maintenance nightmare). We ought to be able to have the expectation that all applications, whether local or remote (web) will be set up for two-way interactions. That is, they can be either a source or sink of online data. So, for example, the natural complement to amazon&#39;s web services data feeds is data input (for example, the ability to comment on a book on your local blog, and syndicate the review via RSS to amazon&#39;s detail page for the book.) And that leads to: We really need to understand who owns what, and come up with mechanisms that protect the legitimate rights of individuals and businesses to their own data, while creating the &quot;liquidity&quot; and free movement of data that will fuel the next great revolution in computer functionality. (I&#39;m doing a panel on this subject at next week&#39;s Open Source Convention, entitled &quot;We Need a Bill of Rights for Web Services.&quot;) We need easy gateways between different application domains. I was recently in Finland at a Nokia retreat, and we used camera-enabled cell phones to create a mobile photoblog. That was great. But even more exciting was the ease with which I could send a photo from the phone not just to another phone but also to an email address. This is the functionality that enabled the blog gateway, but it also made it trivial to send photos home to my family and friends. Similarly, I often blog things that I hear on mailing lists, and read many web sites via screen-scraping enabled email lists. It would be nice to have cross-application gateways be a routine part of software, rather than something that has to be hacked on after the fact. The wish list is pretty much a clear articulation of key items that should matter most to decision makers (CTOs and CIOs) ; in particular those that continue to wrestle with the identification and isolation of relevantcomponentsfor their enterprisearchitectures.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p><a href="http://blog.bitflux.ch/p1077.html">Tim O'Reilly about network aware software</a> </p>
<p>Tim O'Reilly wrote some thoughts about network aware software. Good sumup and nice ideas, why not only blogs should be net-aware (and where even blogs can be improved ;) ) </p>
<blockquote dir="ltr" style="MARGIN-RIGHT: 0px">
<div align="left">"<i>For the desktop, my personal vision is to see existing software instrumented to become increasingly web aware. It seems that Apple are doing a good job with this. (What does web aware mean for me? Being able to grok URIs, speaking WebDAV, and using open standard data formats.)</i>" -- <strong>Edd Dumbill</strong> </div>
<div align="left"></div>
<div align="left">[via <a href="http://blog.bitflux.ch/">Bitflux Blog</a>]</div></blockquote>
<div align="left">I agree, but you do have to add Open Data Access formats (such as ODBC and to some degree JDBC) to this mix otherwise the you will need to create data for Open Standard Data Formats from sratch (tough for any enterprise irrespective of size).</div>
<div align="left"></div>
<div align="left">Tim O'Reilly added the following items to Edd's list:</div>
<div align="left">
<ul>
<li>
<p>Rendezvous-like functionality for automatic discovery of and potential synchronization with other instances of the application on other computers. Apple is showing the power of this idea with iChat and iTunes, but it really could be applied in so many other places. For example, if every PIM supported this functionality, we could have the equivalent of "phonester" where you could automatically ask peers for contact information. Of course, that leads to guideline 2. </p></li></ul></div>
<p>Another application is discovery of <a href="http://www.openlinksw.com/info/docs/uda50/mt/features.html#features">ODBC data sources</a>, and database servers. Rendezvous can also simply security and administration of data sources accessible by either one of these standards data access mechanisms. It can also apply to XML databases and data sources exposed by <a href="http://www.openlinksw.com/virtuoso/whatis.htm">XML Databases</a>.</p>
<p></p>
<p></p>
<ul>
<li>If you assume ad-hoc networking, you have to automatically define levels of access. I've always thought that the old Unix ugo (user, group, other) three-level permission system was simple and elegant, and if you replace the somewhat arbitrary "group" with "on my buddy list", you get something quite powerful. Which leads me to... 
<p></p>
<p></p></li>
<ul>
<li>Buddy lists ought to be supported as a standard feature of many apps, and in a consistent way. What's more, our address books really ought to make it easy to indicate who is in a "buddy list" and support numerous overlapping lists for different purposes. <br></li></ul>
<li>Every application ought to expose some version of its data as an XML feed via some well-defined and standard access mechanism. It strikes me that one of the really big wins that fueled the early web was a simple naming scheme: you could go to a site called www.foo.com, and you'd find a web server there. While it wasn't required, it made web addresses eminently guessable. We missed the opportunity for xml.foo.com to mean "this is where you get the data feed" but it's probably still possible to come up with a simple, consistent naming scheme. And of course, if we can do it for web sites, we also need to think about how to do it for local applications, since... </li></ul>
<p>The very point I continue to make about Internet Points of Presence beingactual data acces points, in short these end points should be served by database serverprocesses. This is the very basis of <a href="http://www.openlinksw.com/virtuoso">Virtuoso</a>, the inevitability of this realization remains the undepinings of this product. There are other products out there that have some sense of this vision too, but there is a little snag (at least so far in my research efforts), and that is the tendency to create dedicated independent server per protocol (an ultimate integration, administration, and maintenance nightmare).</p>
<ul>
<li>We ought to be able to have the expectation that all applications, whether local or remote (web) will be set up for two-way interactions. That is, they can be either a source or sink of online data. So, for example, the natural complement to amazon's web services data feeds is data input (for example, the ability to comment on a book on your local blog, and syndicate the review via RSS to amazon's detail page for the book.) And that leads to: 
<p></p>
<p></p></li>
<li>We really need to understand who owns what, and come up with mechanisms that protect the legitimate rights of individuals and businesses to their own data, while creating the "liquidity" and free movement of data that will fuel the next great revolution in computer functionality. (I'm doing a panel on this subject at next week's Open Source Convention, entitled "<a href="http://conferences.oreillynet.com/cs/os2003/view/e_sess/4526">We Need a Bill of Rights for Web Services</a>.") 
<p></p>
<p></p></li>
<li>We need easy gateways between different application domains. I was recently in Finland at a Nokia retreat, and we used camera-enabled cell phones to create a mobile photoblog. That was great. But even more exciting was the ease with which I could send a photo from the phone not just to another phone but also to an email address. This is the functionality that enabled the blog gateway, but it also made it trivial to send photos home to my family and friends. Similarly, I often blog things that I hear on mailing lists, and read many web sites via screen-scraping enabled email lists. It would be nice to have cross-application gateways be a routine part of software, rather than something that has to be hacked on after the fact.</li></ul>
<div align="left">The wish list is pretty much a clear articulation of key items that should matter most to decision makers (CTOs and CIOs) ; in particular those that continue to wrestle with the identification and isolation of relevantcomponentsfor their enterprisearchitectures. </div>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2003-06-25#182">
  <rss:title>Lack Of Internet Skills A Barrier To Progress At Work</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2003-06-25T13:27:02Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Lack Of Internet Skills A Barrier To Progress At Work We need to get with the program, technology is no silver bullet, we have brains for a reason, we simply need to exercise the brain muscle (this activity has been in rapid decline). The piece below pretty much sums up this sentiment: Lack Of Internet Skills A Barrier To Progress At Work I would guess this really depends on what your job entails, but a new survey has found that many people who lack internet &quot;skills&quot; feel that it has held them back at work. There are plenty of jobs where I would assume it would be a requirement that you know how to use the internet, while there are plenty of others where it shouldn&#39;t matter one way or the other. Also, I imagine this problem will begin to decrease over time as a new generation of workers shows up who were brought up on the internet. Of course, then we&#39;ll find out that a lack of &quot;mobile phone text messaging&quot; or some other random tech skill will be holding people back at work. These are all skills that can be picked up with a little bit of effort. If people think they need them to advance in their job, isn&#39;t it their responsibility to learn these skills? You make yourself employable by keeping up-to-date. [via Techdirt] I say, &quot;Get with the Program!&quot;.  </dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<P><A href="http://techdirt.com/articles/20030625/0124245.shtml">Lack Of Internet Skills A Barrier To Progress At Work</A></P>
<P>We need to get with the program, technology is no silver bullet, we have brains for a reason, we simply need to exercise the brain muscle (this activity has been in rapid decline). The piece below pretty much sums up this sentiment:</P>
<BLOCKQUOTE dir=ltr style="MARGIN-RIGHT: 0px">
<BLOCKQUOTE dir=ltr style="MARGIN-RIGHT: 0px">
<P><A href="http://techdirt.com/articles/20030625/0124245.shtml">Lack Of Internet Skills A Barrier To Progress At Work</A> I would guess this really depends on what your job entails, but a new survey has found that many people who lack internet "skills" feel that it has <A href="http://www.ananova.com/news/story/sm_793495.html">held them back at work</A>. There are plenty of jobs where I would assume it would be a requirement that you know how to use the internet, while there are plenty of others where it shouldn't matter one way or the other. Also, I imagine this problem will begin to decrease over time as a new generation of workers shows up who were brought up on the internet. Of course, then we'll find out that a lack of "mobile phone text messaging" or some other random tech skill will be holding people back at work. These are all skills that can be picked up with a little bit of effort. If people think they need them to advance in their job, isn't it their responsibility to learn these skills? You make yourself employable by keeping up-to-date. [via <A href="http://www.techdirt.com/">Techdirt</A>]</P></BLOCKQUOTE></BLOCKQUOTE>
<P dir=ltr>I say, "Get with the Program!".</P>
<P dir=ltr>&nbsp;</P>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?date=2003-06-17#279">
  <rss:title>Ingres - A Forgotten Database, the untold story</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2003-06-17T11:18:57Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Ingres - A Forgottent Database The Untold Story Ingres (technically, Advantage Ingres Enterprise) is, arguably, the forgotten database. There used to be five major databases: Oracle, DB2, Sybase, Informix and Ingres. Then along came Microsoft and, if you listened to most press comment (or the lack of it), you would think that there were only two of these left, plus SQL Server. [From IT-Director] Oracle, Microsoft, and IBM would certainly like the illusion of a 3 horse race, as this is the only way they can induce Ingres, Informix, and Sybase users to jump ship, and this, even though database migrations are by far the most risk prone and problematic aspects of any IT infrastructure. Here is the interesting logic from the self-made big three, if you want to take advanatage of new paradigms and technologies such as XML, Web Services, and anything else in the pipeline you have to move all your data out of these databases, and then get all the mission critical applications re-associated with one of these databases, and by the way when you do so it is advisable that you use native interfaces (so that sometime in the future you have no chance whatsoever of repeating this folly at their expense). The simple fact of the matter (which the self-made big three do not want you to know) is that you can put ODBC, JDBC, even platform specific data access APIs such as OLE DB and ADO.NET atop any of these databases, and then explore and exploit the benefits of new technologies and paradigms as long as the tool pool supports one of more of these standards. Unfortunately the no-brainer above appears to be the more difficult of the choices before decision makers. In other words, many would rather dig themselves into a deeper hole (unknowingly i can only presume) that ultimately leads to technology lock-in. The biggest challenge before any RDBMS based infrastructure today isn&#39;t which of the self-made big three to migrate to wholesale, rather, how to make progressive use of the pool of disparate applications, and application databases that proliferate the enterprise. This is another way of understanding the burgeoning market for Virtual Databases, which in my opiion present the new frontier in database technology.  </dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<P><A href="http://www.it-director.com/article.php?articleid=10951">Ingres - A Forgottent Database The Untold Story</A></P>
<P><EM>Ingres (technically, Advantage Ingres Enterprise) is, arguably, the forgotten database. There used to be five major databases: Oracle, DB2, Sybase, Informix and Ingres. Then along came Microsoft and, if you listened to most press comment (or the lack of it), you would think that there were only two of these left, plus SQL Server</EM>. [From <A href="http://www.it-director.com/article.php?articleid=10951">IT-Director</A>]</P>
<BLOCKQUOTE dir=ltr style="MARGIN-RIGHT: 0px">
<P><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">Oracle, Microsoft, and IBM would certainly like the illusion of a 3 horse race, as this is the only way they can induce Ingres, Informix, and Sybase users to jump ship, and this, even though database migrations are by far the most risk prone and problematic aspects of any IT infrastructure. <?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /><o:p></o:p></SPAN></P>
<P><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">Here is the interesting logic from the self-made big three, if you want to take advanatage of new paradigms and technologies such as XML, Web Services, and anything else in the pipeline you have to move all your data out of these databases, and then get all the mission critical applications re-associated with one of these databases, and by the way when you do so it is advisable that you use native interfaces (so that sometime in the future you have no chance whatsoever of repeating this folly at their expense).<o:p></o:p></SPAN></P>
<P><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">The simple fact of the matter (which the self-made big three do not want you to know) is that you can put ODBC, JDBC, even platform specific data access APIs such as OLE DB and ADO.NET atop any of these databases, and then explore and exploit the benefits of new technologies and paradigms as long as the tool pool supports one of more of these standards.<o:p></o:p></SPAN></P>
<P><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">Unfortunately the no-brainer above appears to be the more difficult of the choices before decision makers. In other words, many would rather dig themselves into a deeper hole (unknowingly i can only presume) that ultimately leads to technology lock-in.<o:p></o:p></SPAN></P>
<P><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">The biggest challenge before any RDBMS based infrastructure today isn't which of the self-made big three to migrate to wholesale, rather, how to make progressive use of the pool of disparate applications, and application databases that proliferate the enterprise. <o:p></o:p></SPAN></P>
<P><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">This is another way of understanding the burgeoning market for Virtual Databases, which in my opiion present the new frontier in database technology.<o:p></o:p></SPAN></P>
<P>&nbsp;</P></BLOCKQUOTE>]]></content:encoded>
 </rss:item>
</rdf:RDF>