<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0">
<channel>

<title>Orri Erling&#39;s Weblog</title><link>http://www.openlinksw.com/weblog/oerling/</link><description /><managingEditor>oerling@openlinksw.com</managingEditor><pubDate>Mon, 23 Nov 2009 11:33:52 GMT</pubDate><generator>Virtuoso Universal Server 05.12.3041</generator><webMaster>oerling@openlinksw.com</webMaster><image><title>Orri Erling&#39;s Weblog</title><url>http://www.openlinksw.com/weblog/public/images/vbloglogo.gif</url><link>http://www.openlinksw.com/weblog/oerling/</link><description /><width>88</width><height>31</height></image>
<item><title>Social Web Camp (#5 of 5)</title><guid>http://www.openlinksw.com/weblog/oerling/?date=2009-04-30#1554</guid><comments>http://www.openlinksw.com/weblog/oerling/?id=1554#comments</comments><pubDate>Thu, 30 Apr 2009 16:14:02 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2009-04-30T12:51:49-04:00</n0:modified><description>&lt;p&gt;(Last of five posts related to the &lt;a href=&quot;http://www2009.org/&quot; id=&quot;link-id0xd28c860&quot;&gt;WWW 2009&lt;/a&gt; conference, held the week of April 20, 2009.)

&lt;/p&gt;
&lt;p&gt;The social networks camp was interesting, with a special meeting around Twitter. Half jokingly, we (that is, the OpenLink folks attending) concluded that societies would never be completely classless, although mobility between, as well as criteria for membership in, given classes would vary with time and circumstance. Now, there would be a new class division between people for whom micro-blogging is obligatory and those for whom it is an option.&lt;/p&gt;

&lt;p&gt;By my experience, a great deal is possible in a short time, but this possibility depends on focus and concentration. These are increasingly rare. I am a great believer in core competence and focus. This is not only for geeks â one can have a lot of breadth-of-scope but this too depends on not getting sidetracked by constant &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id0x10019a70&quot;&gt;information&lt;/a&gt; overload.&lt;/p&gt;

&lt;p&gt;Insofar as personal success depends on constant reaction to online social media, this comes at a cost in time and focus and this cost will have to be managed somehow, for example by automation or outsourcing. But if the social media is only automated fronts twitting and re-twitting among themselves, a bit like electronic trading systems do with securities, with or without human operators, the value of the medium decreases.&lt;/p&gt;

&lt;p&gt;There are contradictory requirements. On one hand, what is said in electronic media is essentially permanent, so one had best only say things that are well considered. On the other hand, one must say these things without adequate time for reflection or analysis. To cope with this, one must have a well-rehearsed position that is compacted so that it fits in a short format and is easy to remember and unambiguous to express. A culture of pre-cooked fast-food advertising cuts down on depth. Real-world things are complex and multifaceted. Besides, prevalent patterns of communication train the brain for a certain mode of functioning. If we train for rapid-fire 140-character messaging, we optimize one side but probably at the expense of another. In the meantime, the world continues developing increased complexity by all kinds of emergent effects. Connectivity is good but don&amp;#39;t get lost in it.&lt;/p&gt;

&lt;p&gt;There is &lt;a href=&quot;https://www.cia.gov/library/center-for-the-study-of-intelligence/csi-publications/books-and-monographs/psychology-of-intelligence-analysis/index.html&quot; id=&quot;link-id170cb010&quot;&gt;a CIA memorandum about how analysts misinterpret data and see what they want to see&lt;/a&gt;. This is a relevant resource for understanding some psychology of perception and memory. With the information overload, largely driven by user generated content, interpreting fragmented and variously-biased real-time information is not only for the analyst but for everyone who needs to intelligently function in cyber-social space.&lt;/p&gt;

&lt;p&gt;I participated in discussions on security and privacy and on mobile social networks and context.&lt;/p&gt;

&lt;p&gt;For privacy, the main thing turned out to be whether people should be protected from themselves. Should information expire? Will it get buried by itself under huge volumes of new content? Well, for purposes of visibility, it will certainly get buried and will require constant management to stay visible. But for purposes of future finding of dirt, it will stay findable for those who are looking.&lt;/p&gt;

&lt;p&gt;There is also the corollary of setting security for resources, like documents, versus setting security for statements, i.e., structured data like social networks. As I have blogged before, policies &lt;a id=&quot;link-id14aaff90&quot;&gt;Ã  la&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x10b058d0&quot;&gt;SQL&lt;/a&gt; do not work well when schema is fluid and end-users can&amp;#39;t be expected to formulate or understand these. Remember &lt;a href=&quot;http://dbpedia.org/resource/Ted_Nelson&quot; id=&quot;link-id0x145b3070&quot;&gt;Ted Nelson&lt;/a&gt;? A user interface should be such that a beginner understands it in 10 seconds in an emergency. The user interaction question is how to present things so that the user understands who will have access to what content. Also, users should themselves be able to check what potentially sensitive information can be found out about them. A service along the lines of Garlic&amp;#39;s Data Patrol should be a part of the social web infrastructure of the future.&lt;/p&gt;

&lt;p&gt;People at MIT have developed AIR (Accountability In &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x10dec8f8&quot;&gt;RDF&lt;/a&gt;) for expressing policies about what can be done with data and for explaining why access is denied if it is denied. However, if we at all look at the history of secrets, it is rather seldom that one hears that access to information about X is restricted to compartment so-and-so; it is much more common to hear that there is no X. I would say that a policy system that just leaves out information that is not supposed to be available will please the users more. This is not only so for organizations; it is fully plausible that an individual might not wish to expose even the existence of some selected inner circle of friends, their parties together, or whatever.&lt;/p&gt;

&lt;p&gt;In conclusion, there is no self-evident solution for careless use of social media. A site that requires people to confirm multiple times that they know what they are doing when publishing a photo will not get much use. We will see.&lt;/p&gt;

&lt;p&gt;For mobility, there was some talk about the context of usage. Again, this is difficult. For different contexts, one would for example disclose one&amp;#39;s location at the granularity of the city; for some other purposes, one would say which conference room one is in.&lt;/p&gt;

&lt;p&gt;Embarrassing social situations may arise if mobile devices are too clever: If information about travel is pushed into the social network, one would feel like having to explain why one does not call on such-and-such a person and so on. Too much initiative in the mobile phone seems like a recipe for problems.&lt;/p&gt;

&lt;p&gt;There is a thin line between convenience and having IT infrastructure rule one&amp;#39;s life. The complexities and subtleties of social situations ought not to be reduced to the level of if-then rules. People and their interactions are more complex than they themselves often realize. A system is not its own metasystem, as GÃ¶del put it. Similarly, human self-&lt;a href=&quot;http://dbpedia.org/resource/Knowledge&quot; id=&quot;link-id0xd7b1808&quot;&gt;knowledge&lt;/a&gt;, let alone knowledge about another, is by this very principle only approximate. Not to forget what psychology tells us about state-dependent recall and of how circumstance can evoke patterns of behavior before one even notices. The history of expert systems did show that people do not do very well at putting their skills in the form of if-then rules. Thus automating sociality past a certain point seems a problematic proposition.&lt;/p&gt;</description></item><item><title>Beyond Applications - Introducing the Planetary Datasphere (Part 2)</title><guid>http://www.openlinksw.com/weblog/oerling/?date=2009-03-25#1537</guid><comments>http://www.openlinksw.com/weblog/oerling/?id=1537#comments</comments><pubDate>Wed, 25 Mar 2009 15:50:56 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2009-03-25T12:31:55-04:00</n0:modified><description>&lt;p&gt;
&lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1535&quot; id=&quot;link-id155e3bd0&quot;&gt;We have looked at the general implications of the DataSphere&lt;/a&gt;, a universal, ubiquitous database infrastructure, on end-user experience and application development and content. Now we will look at what this means at the back end, from hosting to security to server software and hardware.&lt;/p&gt;

&lt;h2&gt;Application Hosting&lt;/h2&gt;

&lt;p&gt;For the infrastructure provider, hosting the DataSphere is no different from hosting large Web 2.0 sites. This may be paid for by users, as in the cloud computing model where users rent capacity for their own purposes, or by advertisers, as in most of Web 2.0.&lt;/p&gt;

&lt;p&gt;Clouds play a role in this as places with high local connectivity. The DataSphere is the atmosphere; the Cloud is an atmospheric phenomenon.&lt;/p&gt;

&lt;h2&gt;What of Proprietary &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x13b5b4a0&quot;&gt;Data&lt;/a&gt; and its Security?&lt;/h2&gt;

&lt;p&gt;Having proprietary data does not imply using a proprietary language. I would say that for any domain of discourse, no matter how private or specialized, at least some structural concepts can be borrowed from public, more generic sources. This lowers training thresholds and facilitates integration. Being able to integrate does not imply opening one&amp;#39;s own data. To take an analogy, if you have a bunker with closed circuit air recycling, you still breathe air, even if that air is cut off from the atmosphere at large. For places with complex existing &lt;a href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id0x24db80e0&quot;&gt;RDBMS&lt;/a&gt; security, the best is to map the RDBMS to &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x24ea7c40&quot;&gt;RDF&lt;/a&gt; on the fly, always running all requests through the RDBMS. This implicitly preserves any policy or label based security schemes.&lt;/p&gt;

&lt;h2&gt;What of Individual Privacy on the Open Web?&lt;/h2&gt;

&lt;p&gt;The more complex situations will be found in environments with mixed security needs, as in social networking with partly-open and partly-closed profiles. The FOAF+SSL solution with &lt;code&gt;https://&lt;/code&gt; URIs is one approach. For query processing, we have a question of enforcing instance-level policies. In the DataSphere, granting privileges on tables and views no longer makes sense. In &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x24aaccc0&quot;&gt;SQL&lt;/a&gt;, a policy means that behind the scenes the DBMS will add extra criteria to queries and updates depending on who is issuing them. The query processor adds conditions like getting the user&amp;#39;s department ID and comparing it to the department ID on the payroll record. Labeled security is a scheme where data rows themselves contain security tags and the DBMS enforces these, row by row.&lt;/p&gt;

&lt;p&gt;I would say that these techniques are suited for highly-structured situations where the roles, compartments, and needs are clear, and where the organization has the database know-how to write, test, and deploy such rules by the table, row, and column. This does not sit well with schema-last. I would not bet much on an average developer&amp;#39;s capacity for making airtight policies on RDF data where not even 100% schema-adherence is guaranteed.&lt;/p&gt;

&lt;p&gt;Doing security at the RDF graph level seems more appropriate. In many use cases, the graph is analogous to a photo album or a file system directory. A Data &lt;a href=&quot;http://en.wikipedia.org/wiki/Data_Spaces&quot; id=&quot;link-id0x2396c058&quot;&gt;Space&lt;/a&gt; can be divided into graphs to provide more granularity for expressing topic, provenance, or security. If policy conditions apply mostly to the graph, then things are not as likely to slip by, for example, policy rules missing some infrequent misuse of the schema. In these cases, the burden on the query processor is also not excessive: Just as with documents, the container (table, graph) is the object of access grants, not the individual sentences (DBMS records, RDF triples) in the document.&lt;/p&gt;

&lt;p&gt;It is left to the application to present a choice of graph level policies to the user. Exactly what these will be depends on the domain of discourse. A policy might restrict access to a meeting in a calendar to people whose OpenIDs figure in the attendee list, or limit access to a photo album to people mentioned in the owner&amp;#39;s social network. Defining such policies is typically a task for the application developer.&lt;/p&gt;

&lt;p&gt;The difference between the Document Web and the &lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id0x238a0098&quot;&gt;Linked Data&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/Giant_Global_Graph&quot; id=&quot;link-id0x23882280&quot;&gt;Web&lt;/a&gt; is that while the Document Web enforces security when a thing is returned to the user, Linked Data Web enforcement must occur whenever a query references something, even if this is an intermediate result not directly shown to the user.&lt;/p&gt;

&lt;p&gt;The DataSphere will offer a generic policy scheme, filtering what graphs are accessed in a given query situation. Other applications may then verify the safety of one&amp;#39;s disclosed &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id0x2388e458&quot;&gt;information&lt;/a&gt; using the same DataSphere infrastructure. Of course, the user must rely on the infrastructure provider to correctly enforce these rules. Then again, some users will operate and audit their own infrastructure anyway.&lt;/p&gt;

&lt;h2&gt;Federation vs. Centralization&lt;/h2&gt;

&lt;p&gt;On the open web, there is the question of federation vs. centralization. If an application is seen to be an interface to a vocabulary, it becomes more agnostic with respect to this. In practice, if we are talking about hosted services, what is hosted together joins much faster. Data Spaces with lots of interlinking, such as closely connected social networks, will tend to cluster together on the same cloud to facilitate joint operation. Data is ubiquitous and not location-conscious, but what one can efficiently do with it depends on location. Joint access patterns favor joint location. Due to technicalities of the matter, single database clusters will run complex queries within the cluster 100 to 1000 times faster than between clusters. The size of such data clouds may be in the hundreds-of-billions of triples. It seems to make sense to have data belonging to same-type or jointly-used applications close together. In practice, there will arise partitioning by type of usage, user profile, etc., but this is no longer airtight and applications more-or-less float on top of all of this.&lt;/p&gt;

&lt;p&gt;A search engine can host a copy of the Document Web and allow text lookups on it. But a text lookup is a single well-defined query that happens to parallelize and partition very well. A search engine can also have all the structured public data copied, but the problem there is that queries are a lot less predictable and may take orders of magnitude more resources than a single text lookup. As a partial answer, even now, we can set up a database so that the first million single-row joins cost the user nothing, but doing more requires a special subscription.&lt;/p&gt;

&lt;p&gt;The cost for hosting a trillion triples will vary radically in function of what throughput is promised. This may result in pricing per service level, a bit like ISP pricing varies in function of promised connectivity. Queries can be run for free if no throughput guarantee applies, and might cost more if the host promises at least five-million joins-per-second including infrequently-accessed data.&lt;/p&gt;

&lt;p&gt;Performance and cost dynamics will probably lead to the emergence of domain-specific clusters of colocated Data Spaces. The landscape will be hybrid, where usage drives data colocation. A single Google is not a practical solution to the world&amp;#39;s spectrum of query needs.&lt;/p&gt;

&lt;h2&gt;What is the Cost of Schema-Last?&lt;/h2&gt;

&lt;p&gt;The DataSphere proposition is predicated on a worldwide database fabric that can store anything, just like a network can transport anything. It cannot enforce a fixed schema, just like TCP/IP cannot say that it will transport only email. This is continuous schema evolution. Well, TCP/IP can transport anything but it does transport a lot of HTML and email. Similarly, the DataSphere can optimize for some common vocabularies.&lt;/p&gt;

&lt;p&gt;We have seen that an application-specific relational schema is often 10 times more efficient than an equivalent completely generic RDF representation of the same thing. The gap may narrow, but task specific representations will keep an edge. We ought to know, as we do both.&lt;/p&gt;

&lt;p&gt;While anything can be represented, the masses are not that creative. For any data-hosting provider, making a specialized representation for the top 100 entities may cut data size in half or better. This is a behind-the-scenes optimization that will in time be a matter of course.&lt;/p&gt;

&lt;p&gt;Historically, our industry has been driven by two phenomena:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
  &lt;b&gt;New PCs every 2 years.&lt;/b&gt; To make this necessary, Windows has been getting bigger and bigger, and not upgrading is not an option if one must exchange documents with new data formats and keep up with security.&lt;/li&gt;

&lt;li&gt;
  &lt;b&gt;Agility, or &lt;i&gt;ad hoc&lt;/i&gt; over planned.&lt;/b&gt; The reason the RDBMS won over &lt;a href=&quot;http://dbpedia.org/resource/CODASYL&quot; id=&quot;link-id0x13b23460&quot;&gt;CODASYL&lt;/a&gt; network databases was that one did not have to define what queries could be made when creating the database. With the Linked Data Web, we have one more step in this direction when we say that one does not have to decide what can be represented when creating the database.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;To summarize, there is some cost to schema-last, but then our industry needs more complexity to keep justifying constant investment. The cost is in this sense not all bad.&lt;/p&gt;

&lt;p&gt;Building the DataSphere may be the next great driver of server demand. As a case in point, Cisco, whose fortune was made when the network became ubiquitous, just entered the server game. It&amp;#39;s in the air.&lt;/p&gt;

&lt;h2&gt;DataSphere Precursors&lt;/h2&gt;

&lt;p&gt;Right now, we have the &lt;a href=&quot;http://community.linkeddata.org/dataspace/organization/lod#this&quot; id=&quot;link-id0x236a9be8&quot;&gt;Linked Open Data&lt;/a&gt; movement with lots of new data being added. We have the drive for data- and reputation-portability. We have Freebase as a demonstrator of end-users actually producing structured data. We have convergence of terminology around &lt;a href=&quot;http://dbpedia.org/resource/DBpedia&quot; id=&quot;link-id0x24db8350&quot;&gt;DBpedia&lt;/a&gt;, FOAF, SIOC, and more. We have demonstrators of useful data integration on the RDF stack in diverse fields, especially life sciences.&lt;/p&gt;

&lt;p&gt;We have a totally ubiquitous network for the distribution of this, plus database technology to make this work.&lt;/p&gt;

&lt;p&gt;We have a practical need for semantics, as search is getting saturated, email is getting killed by spam, and information overload is a constant. Social networks can be leveraged for solving a lot of this, if they can only be opened.&lt;/p&gt;

&lt;p&gt;Of course, there is a call for transparency in society at large. Well, the battle of transparency vs. spin is a permanent feature of human existence but even there, we cannot ignore the possibilities of open data.&lt;/p&gt;

&lt;h2&gt;Databases and Servers&lt;/h2&gt;

&lt;p&gt;Technically, what does this take? Mostly, this takes a lot of memory. The software is there and we are productizing it as we speak. As with other data intensive things, the key is scalable querying over clusters of commodity servers. Nothing we have not heard before. Of course, the DBMS must know about RDF specifics to get the right query plans and so on but this we have explained elsewhere.&lt;/p&gt;

&lt;p&gt;This all comes down to the cost of memory. No amount of CPU or network speed will make any difference if data is not in memory. Right now, a board with 8G and a dual core AMD X86-64 and 4 disks may cost about $700. 2 x 4 core Xeon and 16G and 8 disks may be $4000, counting just the components. In our experience, about 32G per billion triples is a minimum. This must be backed by a few independent disks so as to fill the cache in parallel. A cluster with 1 TB of RAM would be under $100K if built from low end boards.&lt;/p&gt;

&lt;p&gt;The workload is all about large joins across partitions. The queries parallelize well, thus using the largest and most expensive machines for building blocks is not cost efficient. Having absolutely everything in RAM is also not cost efficient, but it is necessary to have many disks to absorb the random access load. Disk access is predominantly random, unlike some analytics workloads that can read serially. If SSD&amp;#39;s get a bit cheaper, one could have SSD for the database and disk for backup.&lt;/p&gt;

&lt;p&gt;With large data centers, redundancy becomes an issue. The most cost effective redundancy is simply storing partitions in duplicate or triplicate on different commodity servers. The DBMS software should handle the replication and fail-over.&lt;/p&gt;

&lt;p&gt;For operating such systems, scaling-on-demand is necessary. Data must move between servers, and adding or replacing servers should be an on-the-fly operation. Also, since access is essentially never uniform, the most commonly accessed partitions may benefit from being kept in more copies than less frequently accessed ones. The DBMS must be essentially self administrating since these things are quite complex and easily intractable if one does not have in depth understanding of this rather complex field.&lt;/p&gt;

&lt;p&gt;The best price point for hardware varies with time. Right now, the optimum is to have many basic motherboards with maximum memory in a rack unit, then another unit with local disks for each motherboard. Much cheaper than SAN&amp;#39;s and Infiniband fabrics.&lt;/p&gt;

&lt;h2&gt;Conclusions and Next Steps&lt;/h2&gt;

&lt;p&gt;The ingredients and use cases are there. If server clusters with 1TB RAM begin under $100K, the cost of deployment is small compared to personnel costs.&lt;/p&gt;

&lt;p&gt;Bootstrapping the DataSphere from current Linked Open Data, such as DBpedia, &lt;a href=&quot;http://dbpedia.org/resource/Cyc&quot; id=&quot;link-id0x2396a038&quot;&gt;OpenCYC&lt;/a&gt;, Freebase, and every sort of social network, is feasible. Aside from private data integration and analytics efforts and E-science, the use cases are liberating social networks and C2C and some aspects of search from silos, overcoming spam, and mass use of semantics extracted from text. Emergent effects will then carry the ball to places we have not yet been.&lt;/p&gt;

&lt;p&gt;The Linked Data Web has its origins in &lt;a href=&quot;http://dbpedia.org/resource/Semantic_Web&quot; id=&quot;link-id0x13ea7110&quot;&gt;Semantic Web&lt;/a&gt; research, and many of the present participants come from these circles. Things may have been slowed down by a disconnect, only too typical of human activity, between Semantic Web research on one hand and database engineering on the other. Right now, the challenge is one of engineering. As documented on this &lt;a href=&quot;http://dbpedia.org/resource/Blog&quot; id=&quot;link-id0x2388e368&quot;&gt;blog&lt;/a&gt;, we have worked quite a bit on cluster databases, mostly but not exclusively with RDF use cases. The actual challenges of this are however not at all what is discussed in Semantic Web conferences. These have to do with complexities of parallelism, timing, message bottlenecks, transactions, and the like, i.e., hardcore engineering. These are difficult beyond what the casual onlooker might guess but not impossible. The details that remain to be worked out are nothing semantic, they are hardcore database, concerning automatic provisioning and such matters.&lt;/p&gt;

&lt;p&gt;It is as if the Semantic Web people look with envy at the Web 2.0 side where there are big deployments in production, yet they do not seem quite ready to take the step themselves. Well, I will write some other time about research and engineering. For now, the message is &amp;amp;mdash &lt;i&gt;&lt;b&gt;go for it&lt;/b&gt;&lt;/i&gt;. Stay tuned for more announcements, as we near production with our next generation of software.&lt;/p&gt;


&lt;h2&gt;Related&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1535&quot; id=&quot;link-id14e02bb0&quot;&gt;Beyond Applications - Introducing the Planetary Datasphere (Part 1)&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/blog/~kidehen/?id=1442&quot; id=&quot;link-id117dc518&quot;&gt;Serendipitous Discovery Quotient (SDQ)&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/blog/~kidehen/?id=1534&quot; id=&quot;link-id15c52410&quot;&gt;How Linked Data will change Advertising&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/blog/~kidehen/?id=1519&quot; id=&quot;link-id11e93658&quot;&gt;The Time for RDBMS Primacy Downgrade is Nigh!&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/blog/~kidehen/?tag=DataSpace&quot; id=&quot;link-id1491a588&quot;&gt;Data Spaces&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Virtuoso RDF:  A Getting Started Guide for the Developer</title><guid>http://www.openlinksw.com/weblog/oerling/?date=2008-12-17#1504</guid><comments>http://www.openlinksw.com/weblog/oerling/?id=1504#comments</comments><pubDate>Wed, 17 Dec 2008 12:31:34 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-12-17T12:41:21.000001-05:00</n0:modified><description>
&lt;p&gt;It is a long standing promise of mine to dispel the false impression that using &lt;a href=&quot;http://virtuoso.openlinksw.com/&quot; id=&quot;link-id113506d0&quot;&gt;Virtuoso&lt;/a&gt; to work with &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id115d9528&quot;&gt;RDF&lt;/a&gt; is complicated.&lt;/p&gt;

&lt;p&gt;The purpose of this presentation is to show a programmer how to put RDF into Virtuoso and how to query it.  This is done programmatically, with no confusing user interfaces.&lt;/p&gt;

&lt;p&gt;You should have a Virtuoso Open Source tree built and installed.  We will look at the LUBM benchmark demo that comes with the package.  All you need is a Unix shell.  Running the shell under emacs (&lt;code&gt;m-x shell&lt;/code&gt;) is the best.  But the open source &lt;code&gt;isql&lt;/code&gt; utility should have command line editing also.  The emacs shell is however convenient for cutting and pasting things between shell and files.&lt;/p&gt;

&lt;p&gt;To get started, cd into &lt;code&gt;binsrc/tests/lubm&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;To verify that this works, you can do &lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;./test_server.sh virtuoso-t&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;This will test the server with the LUBM queries.  This should report 45 tests passed.  After this we will do the tests step-by-step.&lt;/p&gt;

&lt;h2&gt;Loading the &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id10f7bd90&quot;&gt;Data&lt;/a&gt;
&lt;/h2&gt; 

&lt;p&gt;The file &lt;code&gt;lubm-load.sql&lt;/code&gt; contains the commands for loading the LUBM single university qualification database.&lt;/p&gt;

&lt;p&gt;The data files themselves are in &lt;code&gt;lubm_8000&lt;/code&gt;, 15 files in RDFXML.&lt;/p&gt;

&lt;p&gt;There is also a little ontology called &lt;code&gt;inf.nt&lt;/code&gt;.  This declares the subclass and subproperty relations used in the benchmark.&lt;/p&gt;

&lt;p&gt;So now let&amp;#39;s go through this procedure.&lt;/p&gt;

&lt;p&gt;Start the server:&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;$ virtuoso-t -f &amp;amp;
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;This starts the server in foreground mode, and puts it in the background of the shell.&lt;/p&gt;

&lt;p&gt;Now we connect to it with the isql utility.&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;$ isql 1111 dba dba 
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;This gives a &lt;code&gt;SQL&amp;gt;&lt;/code&gt; prompt.  The default username and password are both &lt;code&gt;dba&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;When a command is &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id1176ce70&quot;&gt;SQL&lt;/a&gt;, it is entered directly.  If it is &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id156df468&quot;&gt;SPARQL&lt;/a&gt;, it is prefixed with the keyword &lt;code&gt;sparql&lt;/code&gt;.  This is how all the SQL clients work.  Any SQL client, such as any &lt;a href=&quot;http://dbpedia.org/resource/Open_Database_Connectivity&quot; id=&quot;link-id152d0a00&quot;&gt;ODBC&lt;/a&gt; or &lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id157ad6a0&quot;&gt;JDBC&lt;/a&gt; application, can use SPARQL if the SQL string starts with this keyword.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;lubm-load.sql&lt;/code&gt; file is quite self-explanatory. It begins with defining an SQL procedure that calls the RDF/XML load function, &lt;code&gt;DB..RDF_LOAD_RDFXML&lt;/code&gt;, for each file in a directory.&lt;/p&gt;

&lt;p&gt;Next it calls this function for the &lt;code&gt;lubm_8000&lt;/code&gt; directory under the server&amp;#39;s working directory.&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;sparql 
   CLEAR GRAPH &amp;lt;lubm&amp;gt;;

sparql 
   CLEAR GRAPH &amp;lt;inf&amp;gt;;

load_lubm ( server_root() || &amp;#39;/lubm_8000/&amp;#39; );
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;Then it verifies that the right number of triples is found in the &amp;lt;lubm&amp;gt; graph.&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;sparql 
   SELECT COUNT(*) 
     FROM &amp;lt;lubm&amp;gt; 
    WHERE { ?x ?y ?z } ;
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;The echo commands below this are interpreted by the isql utility, and produce output to show whether the test was passed.  They can be ignored for now.&lt;/p&gt;

&lt;p&gt;Then it adds some implied &lt;code&gt;subOrganizationOf&lt;/code&gt; triples.  This is part of setting up the LUBM test database.&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;sparql 
   PREFIX  ub:  &amp;lt;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#&amp;gt;
   INSERT 
      INTO GRAPH &amp;lt;lubm&amp;gt; 
      { ?x  ub:subOrganizationOf  ?z } 
   FROM &amp;lt;lubm&amp;gt; 
   WHERE { ?x  ub:subOrganizationOf  ?y  . 
           ?y  ub:subOrganizationOf  ?z  . 
         };
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;Then it loads the ontology file, &lt;code&gt;inf.nt&lt;/code&gt;, using the Turtle load function, &lt;code&gt;DB.DBA.TTLP&lt;/code&gt;.  The arguments of the function are the text to load, the default namespace prefix, and the &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Identifier&quot; id=&quot;link-id15835550&quot;&gt;URI&lt;/a&gt; of the target graph.&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;DB.DBA.TTLP ( file_to_string ( &amp;#39;inf.nt&amp;#39; ), 
              &amp;#39;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl&amp;#39;, 
              &amp;#39;inf&amp;#39; 
            ) ;
sparql 
   SELECT COUNT(*) 
     FROM &amp;lt;inf&amp;gt; 
    WHERE { ?x ?y ?z } ;
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;Then we declare that the triples in the &lt;code&gt;&amp;lt;inf&amp;gt;&lt;/code&gt; graph can be used for inference at run time.  To enable this, a SPARQL query will declare that it uses the &lt;code&gt;&amp;#39;inft&amp;#39;&lt;/code&gt; rule set.  Otherwise this has no effect.&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;rdfs_rule_set (&amp;#39;inft&amp;#39;, &amp;#39;inf&amp;#39;);
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;This is just a log checkpoint to finalize the work and truncate the transaction log.  The server would also eventually do this in its own time.&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;checkpoint;
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;Now we are ready for querying.&lt;/p&gt;

&lt;h2&gt;Querying the Data&lt;/h2&gt; 

&lt;p&gt;The queries are given in 3 different versions: The first file, &lt;code&gt;lubm.sql&lt;/code&gt;, has the queries with most inference open coded as &lt;code&gt;UNIONs&lt;/code&gt;. The second file, &lt;code&gt;lubm-inf.sql&lt;/code&gt;, has the inference performed at run time using the ontology &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id1109faf0&quot;&gt;information&lt;/a&gt; in the &lt;code&gt;&amp;lt;inf&amp;gt;&lt;/code&gt; graph we just loaded.  The last, &lt;code&gt;lubm-phys.sql&lt;/code&gt;, relies on having the entailed triples physically present in the &lt;code&gt;&amp;lt;lubm&amp;gt;&lt;/code&gt; graph.  These entailed triples are inserted by the SPARUL commands in the &lt;code&gt;lubm-cp.sql&lt;/code&gt; file.&lt;/p&gt;

&lt;p&gt;If you wish to run all the commands in a SQL file, you can type &lt;code&gt;load &amp;lt;filename&amp;gt;;&lt;/code&gt; (e.g., &lt;code&gt;load lubm-cp.sql;&lt;/code&gt;) at the &lt;code&gt;SQL&amp;gt;&lt;/code&gt; prompt. If you wish to try individual statements, you can paste them to the command line.&lt;/p&gt;

&lt;p&gt;For example: &lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;SQL&amp;gt; sparql 
   PREFIX ub: &amp;lt;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#&amp;gt;
   SELECT * 
     FROM &amp;lt;lubm&amp;gt;
    WHERE { ?x  a                     ub:Publication                                                . 
            ?x  ub:publicationAuthor  &amp;lt;http://www.Department0.University0.edu/AssistantProfessor0&amp;gt; 
          };

VARCHAR
_______________________________________________________________________

http://www.Department0.University0.edu/AssistantProfessor0/Publication0
http://www.Department0.University0.edu/AssistantProfessor0/Publication1
http://www.Department0.University0.edu/AssistantProfessor0/Publication2
http://www.Department0.University0.edu/AssistantProfessor0/Publication3
http://www.Department0.University0.edu/AssistantProfessor0/Publication4
http://www.Department0.University0.edu/AssistantProfessor0/Publication5

6 Rows. -- 4 msec.
&lt;/pre&gt;&lt;/blockquote&gt;


&lt;p&gt;To stop the server, simply type &lt;code&gt;shutdown;&lt;/code&gt; at the &lt;code&gt;SQL&amp;gt;&lt;/code&gt; prompt.&lt;/p&gt;

&lt;p&gt;If you wish to use a &lt;a href=&quot;http://www.w3.org/TR/rdf-sparql-protocol/&quot; id=&quot;link-id11384668&quot;&gt;SPARQL protocol&lt;/a&gt; end point, just enable the HTTP listener.  This is done by adding a stanza like â&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;[HTTPServer]
ServerPort    = 8421
ServerRoot    = .
ServerThreads = 2
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;â to the end of the &lt;code&gt;virtuoso.ini&lt;/code&gt; file in the &lt;code&gt;lubm&lt;/code&gt; directory.  Then shutdown and restart (type &lt;code&gt;shutdown;&lt;/code&gt; at the &lt;code&gt;SQL&amp;gt;&lt;/code&gt; prompt and then &lt;code&gt;virtuoso-t -f &amp;amp;&lt;/code&gt; at the shell prompt).&lt;/p&gt;

&lt;p&gt;Now you can connect to the end point with a web browser.  The &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Locator&quot; id=&quot;link-id113d02d8&quot;&gt;URL&lt;/a&gt; is &lt;code&gt;http://localhost:8421/sparql&lt;/code&gt;. Without parameters, this will show a human readable form.  With parameters, this will execute SPARQL.&lt;/p&gt;

&lt;p&gt;We have shown how to load and query RDF with Virtuoso using the most basic SQL tools. Next you can access RDF from, for example, &lt;a href=&quot;http://dbpedia.org/resource/PHP&quot; id=&quot;link-id142d0ba0&quot;&gt;PHP&lt;/a&gt;, using the PHP ODBC interface.&lt;/p&gt;

&lt;p&gt;To see how to use &lt;a href=&quot;http://jena.sourceforge.net/&quot; id=&quot;link-id117074f0&quot;&gt;Jena&lt;/a&gt; or &lt;a href=&quot;http://sourceforge.net/projects/sesame/&quot; id=&quot;link-id1103c9b0&quot;&gt;Sesame&lt;/a&gt; with Virtuoso, look at &lt;a href=&quot;http://docs.openlinksw.com/virtuoso/rdfnativestorageproviders.html&quot; id=&quot;link-id15488ce8&quot;&gt;Native RDF Storage Providers&lt;/a&gt;. To see how RDF data types are supported, see &lt;a href=&quot;http://docs.openlinksw.com/virtuoso/VirtuosoDriverJDBC.html#jdbcrdf&quot; id=&quot;link-id15784a40&quot;&gt;Extension datatype for RDF&lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;To work with large volumes of data, you must add memory to the configuration file and use the row-autocommit mode, i.e., do &lt;code&gt;log_enableÂ (2);&lt;/code&gt; before the load command. Otherwise Virtuoso will do the entire load as a single transaction, and will run out of rollback space.  See &lt;a href=&quot;http://docs.openlinksw.com/virtuoso/&quot; id=&quot;link-id111410f0&quot;&gt;documentation&lt;/a&gt; for more.&lt;/p&gt;</description></item><item><title>&quot;E Pluribus Unum&quot;, or &quot;Inversely Functional Identity&quot;, or &quot;Smooshing Without the Stickiness&quot; (re-updated)</title><guid>http://www.openlinksw.com/weblog/oerling/?date=2008-12-16#1498</guid><comments>http://www.openlinksw.com/weblog/oerling/?id=1498#comments</comments><pubDate>Tue, 16 Dec 2008 14:14:43 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-12-16T15:01:30-05:00</n0:modified><description>&lt;p&gt;What a terrible word, smooshing...  I have understood it to mean that when you have two names for one thing, you give each all the attributes of the other.  This smooshes them together, makes them interchangeable.&lt;/p&gt;

&lt;p&gt;This is complex, so I will begin with the point and the interested may read on for the details and implications.  Starting with soon to be released version 6, &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id15718cb8&quot;&gt;Virtuoso&lt;/a&gt; allows you to say that two things, if they share a uniquely identifying property, are the same.  Examples of uniquely identifying properties would be a book&amp;#39;s ISBN number, or a person&amp;#39;s social security plus full name.  In relational language this is a &lt;i&gt;unique key&lt;/i&gt;, and in &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id145ed998&quot;&gt;RDF&lt;/a&gt; parlance, an &lt;i&gt;inverse functional property&lt;/i&gt;.&lt;/p&gt;

&lt;p&gt;In most systems, such problems are dealt with as a preprocessing step before querying.  For example, all the items that are considered the same will get the same properties or at load time all identifiers will be normalized according to some application rules.  This is good if the rules are clear and understood.  This is so in closed situations, where things tend to have standard identifiers to begin with.  But on the open web this is not so clear cut.&lt;/p&gt;

&lt;p&gt;In this post, we show how to do these things &lt;i&gt;ad hoc&lt;/i&gt;, without materializing anything.  At the end, we also show how to materialize identity and what the consequences of this are with open web &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id11726358&quot;&gt;data&lt;/a&gt;.  We use real live web crawls from the &lt;a href=&quot;http://challenge.semanticweb.org/&quot; id=&quot;link-id14f40448&quot;&gt;Billion Triples Challenge&lt;/a&gt; data set.&lt;/p&gt;

&lt;p&gt;On the &lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id156e2b10&quot;&gt;linked data&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/Giant_Global_Graph&quot; id=&quot;link-id1106ce08&quot;&gt;web&lt;/a&gt;, there are independently arising descriptions of the same thing and thus arises the need to smoosh, if these are to be somehow integrated.  But this is only the beginning of the problems.&lt;/p&gt;

&lt;p&gt;To address these, we have added the option of specifying that some property will be considered inversely functional in a query.  This is done at run time and the property does not really have to be inversely functional in the pure sense.  &lt;code&gt;foaf:name&lt;/code&gt; will do for an example.  This simply means that for purposes of the query concerned, two subjects which have at least one &lt;code&gt;foaf:name&lt;/code&gt; in common are considered the same. In this way, we can join between FOAF files.  With the same database, a query about music preferences might consider having the same name as &amp;quot;same enough,&amp;quot; but a query about criminal prosecution would obviously need to be more precise about sameness.&lt;/p&gt;

&lt;p&gt;Our ontology is defined like this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;-- Populate a named graph with the triples you want to use in query time inferencing&lt;br /&gt;
ttlp ( &amp;#39;
        @prefix foaf: &amp;lt;xmlns=&amp;quot;http&amp;quot; xmlns.com=&amp;quot;xmlns.com&amp;quot; foaf=&amp;quot;foaf&amp;quot;&amp;gt;
                      &amp;lt;/&amp;gt;
        @prefix owl:  &amp;lt;xmlns=&amp;quot;http&amp;quot; www.w3.org=&amp;quot;www.w3.org&amp;quot; owl=&amp;quot;owl&amp;quot;&amp;gt;
                      &amp;lt;/&amp;gt;
        foaf:mbox_sha1sum  a  owl:InverseFunctionalProperty  .
        foaf:name          a  owl:InverseFunctionalProperty  .
       &amp;#39;,
       &amp;#39;xx&amp;#39;,
       &amp;#39;b3sifp&amp;#39;
     );&lt;br /&gt;
-- Declare that the graph contains an ontology for use in query time inferencing &lt;br /&gt;
rdfs_rule_set ( &amp;#39;http://example.com/rules/b3sifp#&amp;#39;,
                &amp;#39;b3sifp&amp;#39;
              );
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;Then use it:&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;sparql 
   DEFINE input:inference &amp;quot;http://example.com/rules/b3sifp#&amp;quot; 
   SELECT DISTINCT ?k ?f1 ?f2 
   WHERE { ?k   foaf:name     ?n                   . 
           ?n   bif:contains  &amp;quot;&amp;#39;Kjetil Kjernsmo&amp;#39;&amp;quot;  . 
           ?k   foaf:knows    ?f1                  . 
           ?f1  foaf:knows    ?f2 
         };&lt;br /&gt;
VARCHAR                                  VARCHAR                                           VARCHAR
______________________________________   _______________________________________________   ______________________________&lt;br /&gt;
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/dajobe
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/net_twitter
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/amyvdh
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/pom
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/mattb
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/davorg
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/distobj
http://www.kjetil.kjernsmo.net/foaf#me   http://norman.walsh.name/knows/who/robin-berjon   http://twitter.com/perigrin
....
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;Without the inference, we get no matches.  This is because the data in question has one graph per FOAF file, and blank nodes for persons.  No graph references any person outside the ones in the graph.  So if somebody is mentioned as known, then without the inference there is no way to get to what that person&amp;#39;s FOAF file says, since the same individual will be a different blank node there.  The declaration in the context named &lt;code&gt;b3sifp&lt;/code&gt; just means that all things with a matching &lt;code&gt;foaf:name&lt;/code&gt; or &lt;code&gt;foaf:mbox_sha1sum&lt;/code&gt; are the same.&lt;/p&gt;

&lt;p&gt;Sameness means that two are the same for purposes of &lt;code&gt;DISTINCT&lt;/code&gt; or &lt;code&gt;GROUP BY&lt;/code&gt;, and if two are the same, then both have the &lt;code&gt;UNION&lt;/code&gt; of all of the properties of both.&lt;/p&gt;

&lt;p&gt;If this were a naive smoosh, then the individuals would have all the same properties but would not be the same for &lt;code&gt;DISTINCT&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If we have complex application rules for determining whether individuals are the same, then one can materialize &lt;code&gt;owl:sameAs&lt;/code&gt; triples and keep them in a separate graph.  In this way, the original data is not contaminated and the materialized volume stays reasonable â nothing like the blow-up of duplicating properties across instances.&lt;/p&gt;

&lt;p&gt;The pro-smoosh argument is that if every duplicate makes exactly the same statements, then there is no great blow-up.  Best and worst cases will always depend on the data.  In rough terms, the more &lt;i&gt;ad hoc&lt;/i&gt; the use, the less desirable the materialization.  If the usage pattern is really set, then a relational-style application-specific representation with identity resolved at load time will perform best.  We can do that too, but so can others.&lt;/p&gt;

&lt;p&gt;The principal point is about agility as concerns the inference.  Run time is more agile than materialization, and if the rules change or if different users have different needs, then materialization runs into trouble.  When talking web scale, having multiple users is a given; it is very uneconomical to give everybody their own copy, and the likelihood of a user accessing any significant part of the corpus is minimal.  Even if the queries were not limited, the user would typically not wait for the answer of a query doing a scan or aggregation over 1 billion &lt;a href=&quot;http://dbpedia.org/resource/Blog&quot; id=&quot;link-id1156a550&quot;&gt;blog&lt;/a&gt; posts or something of the sort.  So queries will typically be selective.  Selective means that they do not access all of the data, hence do not benefit from ready-made materialization for things they do not even look at. &lt;/p&gt;

&lt;p&gt;The exception is corpus-wide statistics queries.  But these will not be done in interactive time anyway, and will not be done very often. Plus, since these do not typically run all in memory, these are disk bound.  And when things are disk bound, size matters.  Reading extra entailment on the way is just a performance penalty.&lt;/p&gt;

&lt;p&gt;Enough talk. Time for an experiment.  We take the Yahoo and Falcon web crawls from the Billion Triples Challenge set, and do two things with the FOAF data in them:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Resolve identity at insert time.  We remove duplicate person URIs, and give the single &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Identifier&quot; id=&quot;link-id11317008&quot;&gt;URI&lt;/a&gt; all the properties of all the duplicate URIs.  We expect these to be most often repeats.  If a person references another person, we normalize this reference to go to the single URI of the referenced person.&lt;/li&gt;

&lt;li&gt;Give every duplicate URI of a person all the properties of all the duplicates.  If these are the same value, the data should not get much bigger, or so we think.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For the experiment, we will consider two people the same if they have the same &lt;code&gt;foaf:name&lt;/code&gt; and are both instances of &lt;code&gt;foaf:Person&lt;/code&gt;.  This gets some extra hits but should not be statistically significant.&lt;/p&gt;

&lt;p&gt;The following is a commented &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id110945b0&quot;&gt;SQL&lt;/a&gt; script performing the smoosh.  We play with internal IDs of things, thus some of these operations cannot be done in SPARQL alone.  We use SPARQL where possible for readability.  As the documentation states, &lt;code&gt;iri_to_id&lt;/code&gt; converts from the qualified name of an IRI to its ID and &lt;code&gt;id_to_iri&lt;/code&gt; does the reverse.&lt;/p&gt;

&lt;p&gt;We count the triples that enter into the smoosh:&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;-- the name is an existence because else we&amp;#39;d get several times more due to 
-- the names occurring in many graphs &lt;br /&gt;
sparql 
   SELECT COUNT(*) 
    WHERE { { SELECT DISTINCT ?person 
               WHERE { ?person a foaf:Person }
            } . 
            FILTER ( bif:exists ( SELECT (1) 
                                   WHERE { ?person foaf:name ?nn } 
                                )
                       ) . 
            ?person ?p ?o
          };&lt;br /&gt;
-- We get 3284674
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;We make a few tables for intermediate results.&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;-- For each distinct name, gather the properties and objects from 
-- all subjects with this name &lt;br /&gt;
CREATE TABLE name_prop 
   ( np_name  ANY, 
     np_p     IRI_ID_8, 
     np_o     ANY, 
     PRIMARY KEY ( np_name, 
                   np_p, 
                   np_o
                 )
   );
ALTER INDEX name_prop 
   ON name_prop 
   PARTITION ( np_name VARCHAR (-1, 0hexffff) );&lt;br /&gt;
-- Map from name to canonical IRI used for the name &lt;br /&gt;
CREATE TABLE name_iri ( ni_name  ANY PRIMARY KEY, 
                        ni_s     IRI_ID_8
                      );
ALTER INDEX name_iri 
   ON name_iri 
   PARTITION ( ni_name VARCHAR (-1, 0hexffff) );&lt;br /&gt;
-- Map from person IRI to canonical person IRI&lt;br /&gt;
CREATE TABLE pref_iri 
   ( i     IRI_ID_8, 
     pref  IRI_ID_8, 
     PRIMARY KEY ( i )
   );
ALTER INDEX pref_iri 
   ON pref_iri 
   PARTITION ( i INT (0hexffff00) );&lt;br /&gt;
-- a table for the materialization where all aliases get all properties of every other &lt;br /&gt;
CREATE TABLE smoosh_ct 
   ( s  IRI_ID_8, 
     p  IRI_ID_8, 
     o  ANY, 
     PRIMARY KEY ( s, 
                   p, 
                   o
                 ) 
   );
ALTER INDEX smoosh_ct 
   ON smoosh_ct 
   PARTITION ( s INT (0hexffff00) );&lt;br /&gt;
-- disable transaction log and enable row auto-commit.  This is necessary, otherwise 
-- bulk operations are done transactionally and they will run out of rollback space.&lt;br /&gt;
LOG_ENABLE (2);&lt;br /&gt;
-- Gather all the properties of all persons with a name under that name.  
-- INSERT SOFT means that duplicates are ignored &lt;br /&gt;
INSERT SOFT name_prop 
   SELECT &amp;quot;n&amp;quot;, &amp;quot;p&amp;quot;, &amp;quot;o&amp;quot; 
   FROM ( sparql 
          DEFINE output:valmode &amp;quot;LONG&amp;quot; 
          SELECT ?n ?p ?o 
          WHERE { ?x a foaf:Person . 
                 ?x foaf:name ?n . 
                 ?x ?p ?o
               }
        ) xx ;&lt;br /&gt;
-- Now choose for each name the canonical IRI &lt;br /&gt;
INSERT INTO name_iri 
   SELECT np_name, 
          ( SELECT MIN (s) 
              FROM rdf_quad 
             WHERE o = np_name 
                   AND p = IRI_TO_ID (&amp;#39;http://xmlns.com/foaf/0.1/name&amp;#39;)
          ) AS mini 
     FROM name_prop 
    WHERE np_p = IRI_TO_ID (&amp;#39;http://xmlns.com/foaf/0.1/name&amp;#39;) ;&lt;br /&gt;
-- For each person IRI, map to the canonical IRI of that person &lt;br /&gt;
INSERT SOFT pref_iri (i, pref) 
   SELECT s, 
          ni_s 
     FROM name_iri, 
          rdf_quad 
    WHERE o = ni_name 
          AND p = IRI_TO_ID (&amp;#39;http://xmlns.com/foaf/0.1/name&amp;#39;) ;&lt;br /&gt;
-- Make a graph where all persons have one iri with all the properties of all aliases 
-- and where person-to-person refs are canonicalized&lt;br /&gt;
INSERT SOFT rdf_quad (g,s,p,o) 
   SELECT IRI_TO_ID (&amp;#39;psmoosh&amp;#39;), 
          ni_s, 
          np_p, 
 COALESCE ( ( SELECT pref 
              FROM pref_iri 
              WHERE i = np_o
            ), 
            np_o 
          )
     FROM name_prop, 
          name_iri 
    WHERE ni_name = np_name 
   OPTION ( loop, quietcast ) ;&lt;br /&gt;
-- A little explanation:  The properties of names are copied into rdf_quad with the name 
-- replaced with its canonical IRI.  If the object has a canonical IRI, this is used as 
-- the object, else the object is unmodified.  This is the COALESCE with the sub-query.&lt;br /&gt;
-- This takes a little time.  To check on the progress, take another connection to the 
-- server and do &lt;br /&gt;
STATUS (&amp;#39;cluster&amp;#39;);&lt;br /&gt;
-- It will return something like 
-- Cluster 4 nodes, 35 s. 108 m/s 1001 KB/s  75% cpu 186%  read 12% clw threads 5r 0w 0i 
-- buffers 549481 253929 d 8 w 0 pfs&lt;br /&gt;
-- Now finalize the state; this makes it permanent.  Else the work will be lost on server 
-- failure, since there was no transaction log &lt;br /&gt;
CL_EXEC (&amp;#39;checkpoint&amp;#39;);&lt;br /&gt;
-- See what we got&lt;br /&gt;
sparql 
   SELECT COUNT (*) 
     FROM &amp;lt;psmoosh&amp;gt; 
     WHERE {?s ?p ?o};&lt;br /&gt;
-- This is 2253102&lt;br /&gt;
-- Now make the copy where all have the properties of all synonyms.  This takes so much 
-- space we do not insert it as RDF quads, but make a special table for it so that we can 
-- run some statistics.  This saves time.&lt;br /&gt;
INSERT SOFT smoosh_ct (s, p, o)  
   SELECT s, np_p, np_o 
     FROM name_prop, 
          rdf_quad 
    WHERE o = np_name 
          AND p = IRI_TO_ID (&amp;#39;http://xmlns.com/foaf/0.1/name&amp;#39;) ;&lt;br /&gt;
-- as above, INSERT SOFT so as to ignore duplicates &lt;br /&gt;
SELECT COUNT (*) 
   FROM smoosh_ct;&lt;br /&gt;
-- This is  167360324&lt;br /&gt;
-- Find out where the bloat comes from &lt;br /&gt;
SELECT TOP 20 COUNT (*), 
              ID_TO_IRI (p) 
   FROM smoosh_ct 
   GROUP BY p 
   ORDER BY 1 DESC;
&lt;/pre&gt;&lt;/blockquote&gt;
&lt;p&gt;The results are:&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;54728777          http://www.w3.org/2002/07/owl#sameAs
48543153          http://xmlns.com/foaf/0.1/knows
13930234          http://www.w3.org/2000/01/rdf-schema#seeAlso
12268512          http://xmlns.com/foaf/0.1/interest
11415867          http://xmlns.com/foaf/0.1/nick
6683963           http://xmlns.com/foaf/0.1/weblog
6650093           http://xmlns.com/foaf/0.1/depiction
4231946           http://xmlns.com/foaf/0.1/mbox_sha1sum
4129629           http://xmlns.com/foaf/0.1/homepage
1776555           http://xmlns.com/foaf/0.1/holdsAccount
1219525           http://xmlns.com/foaf/0.1/based_near
305522            http://www.w3.org/1999/02/22-rdf-syntax-ns#type
274965            http://xmlns.com/foaf/0.1/name
155131            http://xmlns.com/foaf/0.1/dateOfBirth
153001            http://xmlns.com/foaf/0.1/img
111130            http://www.w3.org/2001/vcard-rdf/3.0#ADR
52930             http://xmlns.com/foaf/0.1/gender
48517             http://www.w3.org/2004/02/skos/core#subject
45697             http://www.w3.org/2000/01/rdf-schema#label
44860             http://purl.org/vocab/bio/0.1/olb
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;Now compare with the predicate distribution of the smoosh with identities canonicalized &lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;sparql 
     SELECT COUNT (*) ?p 
       FROM &amp;lt;psmoosh&amp;gt; 
      WHERE { ?s ?p ?o } 
   GROUP BY ?p 
   ORDER BY 1 DESC 
      LIMIT 20;&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;Results are:&lt;/p&gt;
&lt;blockquote&gt;
&lt;pre&gt;748311            http://xmlns.com/foaf/0.1/knows
548391            http://xmlns.com/foaf/0.1/interest
140531            http://www.w3.org/2000/01/rdf-schema#seeAlso
105273            http://www.w3.org/1999/02/22-rdf-syntax-ns#type
78497             http://xmlns.com/foaf/0.1/name
48099             http://www.w3.org/2004/02/skos/core#subject
45179             http://xmlns.com/foaf/0.1/depiction
40229             http://www.w3.org/2000/01/rdf-schema#comment
38272             http://www.w3.org/2000/01/rdf-schema#label
37378             http://xmlns.com/foaf/0.1/nick
37186             http://dbpedia.org/property/abstract
34003             http://xmlns.com/foaf/0.1/img
26182             http://xmlns.com/foaf/0.1/homepage
23795             http://www.w3.org/2002/07/owl#sameAs
17651             http://xmlns.com/foaf/0.1/mbox_sha1sum
17430             http://xmlns.com/foaf/0.1/dateOfBirth
15586             http://xmlns.com/foaf/0.1/page
12869             http://dbpedia.org/property/reference
12497             http://xmlns.com/foaf/0.1/weblog
12329             http://blogs.yandex.ru/schema/foaf/school
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;We can drop the &lt;code&gt;owl:sameAs&lt;/code&gt; triples from the count, so the bloat is a bit less by that but it still is tens of times larger than the canonicalized copy or the initial state.&lt;/p&gt;

&lt;p&gt;Now, when we try using the psmoosh graph, we still get different results from the results with the original data.  This is because &lt;code&gt;foaf:knows&lt;/code&gt; relations to things with no &lt;code&gt;foaf:name&lt;/code&gt; are not represented in the smoosh.  The exist:&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;sparql 
SELECT COUNT (*) 
   WHERE { ?s foaf:knows ?thing . 
           FILTER ( !bif:exists ( SELECT (1) 
                                   WHERE { ?thing foaf:name ?nn }
                                )
                  ) 
         };&lt;br /&gt;
-- 1393940
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;So the smoosh graph is not an accurate rendition of the social network.  It would have to be smooshed further to be that, since the data in the sample is quite irregular.  But we do not go that far here.&lt;/p&gt;

&lt;p&gt;Finally, we calculate the smoosh blow up factors.  We do not include &lt;code&gt;owl:sameAs&lt;/code&gt; triples in the counts.&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;select (167360324 - 54728777) / 3284674.0;
34.290022997716059&lt;br /&gt;
select 2229307 / 3284674.0;
= 0.678699621332284
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;So, to get a smoosh that is not really the equivalent of the original, either multiply the original triple count by 34 or 0.68, depending on whether synonyms are collapsed or not.&lt;/p&gt;

&lt;p&gt;Making the smooshes does not take very long, some minutes for the small one.  Inserting the big one would be longer, a couple of hours maybe.  It was 33 minutes for filling the &lt;code&gt;smoosh_ct&lt;/code&gt; table.  The metrics were not with optimal tuning so the performance numbers just serve to show that smooshing takes time.  Probably more time than allowable in an interactive situation, no matter how the process is optimized.&lt;/p&gt;</description></item><item><title>Virtuoso Vs. MySQL:  Setting the Berlin Record Straight (update 2)</title><guid>http://www.openlinksw.com/weblog/oerling/?date=2008-11-20#1484</guid><comments>http://www.openlinksw.com/weblog/oerling/?id=1484#comments</comments><pubDate>Thu, 20 Nov 2008 11:06:11 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-11-24T10:15:05-05:00</n0:modified><description>&lt;p&gt;In the context of the &lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id0xa5314d8&quot;&gt;Berlin SPARQL Benchmark&lt;/a&gt;, I have repeatedly written about measurement procedures and steady state.  The point is that the numbers at larger scales are unreliable due to cache behavior if one is not careful about measurement and does not have adequate warmup.  Thus it came to pass that one cut of the &lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id0x18482c20&quot;&gt;BSBM&lt;/a&gt; paper had 3 seconds for &lt;a href=&quot;http://dbpedia.org/resource/MySQL&quot; id=&quot;link-id0xb8c54de8&quot;&gt;MySQL&lt;/a&gt; and 100 for &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x189b2210&quot;&gt;Virtuoso&lt;/a&gt;, basically through ignoring cache effects.&lt;/p&gt;

&lt;p&gt;So we decided to do it ourselves.&lt;/p&gt;

&lt;p&gt;The score is (updated with revised &lt;code&gt;innodb_buffer_pool_size&lt;/code&gt; setting, based on advice noted down below):&lt;/p&gt;

&lt;table border=&quot;1&quot; cellspacing=&quot;2&quot; cellpadding=&quot;5&quot;&gt;
&lt;tr&gt;
    &lt;th&gt;n-clients&lt;/th&gt;
    &lt;th&gt;Virtuoso&lt;/th&gt;
    &lt;th&gt;MySQL &lt;br /&gt; (with increased buffer pool size)&lt;/th&gt;
    &lt;th&gt;MySQL &lt;br /&gt; (with default buffer poll size)&lt;/th&gt;
  &lt;/tr&gt;
&lt;tr align=&quot;right&quot;&gt;
    &lt;td&gt;1&lt;/td&gt;
    &lt;td&gt; 41,161.33&lt;/td&gt;
    &lt;td&gt; 27,023.11 &lt;/td&gt;
    &lt;td&gt; 12,171.41&lt;/td&gt;
  &lt;/tr&gt;
&lt;tr align=&quot;right&quot;&gt;
    &lt;td&gt;4&lt;/td&gt;
    &lt;td&gt; 127,918.30&lt;/td&gt;
    &lt;td&gt; (pending) &lt;/td&gt;
    &lt;td&gt;  37,566.82&lt;/td&gt;
  &lt;/tr&gt;
&lt;tr align=&quot;right&quot;&gt;
    &lt;td&gt;8&lt;/td&gt;
    &lt;td&gt; 218,162.29 &lt;/td&gt;
    &lt;td&gt; 105,524.23 &lt;/td&gt;
    &lt;td&gt;  51,104.39 &lt;/td&gt;
  &lt;/tr&gt;
&lt;tr align=&quot;right&quot;&gt;
    &lt;td&gt;16&lt;/td&gt;
    &lt;td&gt; 214,763.58 &lt;/td&gt;
    &lt;td&gt;  98,852.42 &lt;/td&gt;
    &lt;td&gt;  47,589.18 &lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;


&lt;p&gt;The metric is the query mixes per hour from the BSBM test driver output.  For the interested, the complete output is &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/texts/bsbmres.txt&quot; id=&quot;link-id1119f770&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The benchmark is pure &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x5257718&quot;&gt;SQL&lt;/a&gt;, nothing to do with &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0xb8c463e0&quot;&gt;SPARQL&lt;/a&gt; or &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x16e68d50&quot;&gt;RDF&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The hardware is 2 x Xeon 5345 (2 x quad core, 2.33 GHz), 16 G RAM.  The OS is 64-bit Debian Linux.&lt;/p&gt;

&lt;p&gt;The benchmark was run at a scale of 200,000.  Each run had 2000 warm-up query mixes and 500 measured query mixes, which gives steady state, eliminating any effects of OS disk cache and the like.  Both databases were configured to use 8G for disk cache.  The test effectively runs from memory.  We ran an analyze table on each MySQL table but noticed that this had no effect.  Virtuoso does the stats sampling on the go; possibly MySQL also since the explicit stats did not make any difference.  The MySQL tables were served by the InnoDB engine.  MySQL appears to cache results of queries in some cases.  This was not apparent in the tests.&lt;/p&gt;

&lt;p&gt;The versions are 5.09 for Virtuoso and 5.1.29 for MySQL.  You can download and examine --&lt;/p&gt;
&lt;ul&gt; 
&lt;li&gt;
&lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/texts/virtuoso.ini&quot; id=&quot;link-id14fe17f0&quot;&gt;Virtuoso configuration file&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/texts/my.cnf&quot; id=&quot;link-id116fe490&quot;&gt;MySQL configuration file&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
    &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/texts/create_tables_and_rdf_view.sql&quot; id=&quot;link-id14ce9268&quot;&gt;Table definitions &amp;amp; RDF views&lt;/a&gt; 
&lt;/li&gt;
&lt;li&gt; &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/texts/mysqlinx.sql&quot; id=&quot;link-id1535e298&quot;&gt;Indexes on MySQL tables&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;
&lt;strike&gt;MySQL ought to do better.  We suspect that here, just as in the TPC-D experiment we made way back, the query plans are not quite right. Also we rarely saw over 300% CPU utilization for MySQL.  It is possible there is a config parameter that affects this.  The public is invited to tell us about such.&lt;/strike&gt;
&lt;/p&gt;

&lt;p&gt;
&lt;b&gt;Update:&lt;/b&gt;
&lt;/p&gt;

&lt;p&gt;Andreas Schultz of the BSBM team advised us to increase the &lt;code&gt;innodb_buffer_pool_size&lt;/code&gt; setting in the MySQL config.  We did and it produced some improvement.  Indeed, this is more like it, as we now see CPU utilization around 700% instead of the 300% in the previously published run, which rendered it suspect. Also, our experiments with TPC-D led us to expect better.  We ran these things a few times so as to have warm cache.&lt;/p&gt;

&lt;p&gt;On the first run, we noticed that the Innodb warm up time was somewhere well in excess of 2000 query mixes.  Another time, we should make a graph of throughput as a function of time for both MySQL and Virtuoso.  We recently made a greedy prefetch hack that should give us some mileage there.  For the next BSBM, all we can advise is to run larger scale system for half an hour first and then measure and then measure again.  If the second measurement is the same as the first then it is good.&lt;/p&gt;

&lt;p&gt;As always, since MySQL is not our specialty, we confidently invite the public to tell us how to make it run faster. So, unless something more turns up, our next trial is a revisit of &lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id0x122eaa00&quot;&gt;TPC-H&lt;/a&gt;.&lt;/p&gt;</description></item><item><title>ISWC 2008: Some Questions</title><guid>http://www.openlinksw.com/weblog/oerling/?date=2008-11-04#1479</guid><comments>http://www.openlinksw.com/weblog/oerling/?id=1479#comments</comments><pubDate>Tue, 04 Nov 2008 15:54:42 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-11-04T14:36:50.000010-05:00</n0:modified><description>&lt;h2&gt;Inference: Is it always forward chaining?&lt;/h2&gt;

&lt;p&gt;We got a number of questions about &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x131604a8&quot;&gt;Virtuoso&lt;/a&gt;&amp;#39;s inference support. It seems that we are the odd one out, as we do not take it for granted that inference ought to consist of materializing entailment.&lt;/p&gt;

&lt;p&gt;Firstly, of course one can materialize all one wants with Virtuoso. The simplest way to do this is using SPARUL. With the recent transitivity extensions to &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x1422f910&quot;&gt;SPARQL&lt;/a&gt;, it is also easy to materialize implications of transitivity with a single statement. Our point is that for trivial entailment such as subclass, sub-property, single transitive property, and &lt;a href=&quot;http://dbpedia.org/resource/Web_Ontology_Language&quot; id=&quot;link-id0x145894a8&quot;&gt;owl&lt;/a&gt;:sameAs, we do not require materialization, as we can resolve these at run time also, with backward-chaining built into the engine.&lt;/p&gt;

&lt;p&gt;For more complex situations, one needs to materialize the entailment. At the present time, we know how to generalize our transitive feature to run arbitrary backward-chaining rules, including recursive ones. We could have a sort of Datalog backward-chaining embedded in our &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x1458a288&quot;&gt;SQL&lt;/a&gt;/SPARQL and could run this with good parallelism, as the transitive feature already works with clustering and partitioning without dying of message latency. Exactly when and how we do this will be seen. Even if users want entailment to be materialized, such a rule system could be used for producing the materialization at good speed.&lt;/p&gt;

&lt;p&gt;We had a word with &lt;a href=&quot;http://web.comlab.ox.ac.uk/people/Ian.Horrocks/&quot; id=&quot;link-id117c99d0&quot;&gt;Ian Horrocks&lt;/a&gt; on the question. He noted that it is often naive on behalf of the community to tend to equate description of semantics with description of algorithm. The &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x14cf0b18&quot;&gt;data&lt;/a&gt; need not always be blown up.&lt;/p&gt;

&lt;p&gt;The advantage of not always materializing is that the working set stays better. Once the working set is no longer in memory, response times jump disproportionately. Also, if the data changes or is retracted or is unreliable, one can end up doing a lot of extra work with materialization. Consider the effect of one malicious sameAs statement. This can lead to a lot of effects that are hard to retract. On the other hand, if running in memory with static data such as the LUBM benchmark, the queries run some 20% faster if entailment subclasses and sub-properties are materialized rather than done at run time.&lt;/p&gt;

&lt;h2&gt;Genetic Algorithms for SPARQL?&lt;/h2&gt;

&lt;p&gt;Our compliments for the wildest idea of the conference go to &lt;a href=&quot;http://www.eyaloren.org/&quot; id=&quot;link-id1a203af8&quot;&gt;Eyal Oren&lt;/a&gt;, &lt;a href=&quot;http://www.few.vu.nl/~cgueret/&quot; id=&quot;link-id16208758&quot;&gt;Christophe GuÃ©ret&lt;/a&gt;, and &lt;a href=&quot;http://www.few.vu.nl/~schlobac/&quot; id=&quot;link-id111923e0&quot;&gt;Stefan Schlobach&lt;/a&gt;, &lt;i&gt;et al&lt;/i&gt;, for their &lt;a href=&quot;http://www.informatik.uni-trier.de/~ley/db/conf/semweb/iswc2008.html#OrenGS08&quot; id=&quot;link-id11793540&quot;&gt;paper on using genetic algorithms for guessing how variables in a SPARQL query ought to be instantiated&lt;/a&gt;. Prisoners of our &amp;quot;conventional wisdom&amp;quot; as we are, this might never have occurred to us.&lt;/p&gt;

&lt;h2&gt;Schema Last?&lt;/h2&gt;

&lt;p&gt;It is interesting to see how the industry comes to the &lt;a href=&quot;http://dbpedia.org/resource/Semantic_Web&quot; id=&quot;link-id0x1154c1b0&quot;&gt;semantic web&lt;/a&gt; conferences talking about schema last while at the same time the traditional semantic web people stress enforcing schema constraints and making more predictably performing and database friendlier logics. So do the extremes converge.&lt;/p&gt;

&lt;p&gt;There is a point to schema last. &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x14c6a930&quot;&gt;RDF&lt;/a&gt; is very good for getting a view of ad hoc or unknown data. One can just load and look at what there is. Also, additions of unforeseen optional properties or relations to the schema are easy and efficient. However, it seems that a really high traffic online application would always benefit from having some application specific data structures. Such could also save considerably in hardware.&lt;/p&gt;

&lt;p&gt;It is not a sharp divide between RDF and relational application oriented representation. We have the capabilities in our RDB to RDF mapping. We just need to show this and have SPARUL and data loading&lt;/p&gt;</description></item><item><title>ISWC 2008: Billion Triples Challenge</title><guid>http://www.openlinksw.com/weblog/oerling/?date=2008-11-04#1478</guid><comments>http://www.openlinksw.com/weblog/oerling/?id=1478#comments</comments><pubDate>Tue, 04 Nov 2008 15:52:11 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-11-04T13:52:00-05:00</n0:modified><description>&lt;p&gt;We showed our billion triples demo at the &lt;a href=&quot;http://iswc2008.semanticweb.org/&quot; id=&quot;link-id0x13a0a520&quot;&gt;ISWC 2008&lt;/a&gt; poster session. Generally people liked what they saw, as we basically did what one always had wanted to do with &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x138f5798&quot;&gt;SPARQL&lt;/a&gt; but never could. This means firstly full &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x1264a688&quot;&gt;SQL&lt;/a&gt; parity, with sub-queries, aggregation, full text, etc. Beyond SQL, we have transitive sub-queries, &lt;a href=&quot;http://dbpedia.org/resource/Web_Ontology_Language&quot; id=&quot;link-id0x1e084138&quot;&gt;owl&lt;/a&gt;:sameAs at run time, and other inference things, all on demand.&lt;/p&gt;

&lt;p&gt;The live demo is at &lt;a href=&quot;http://b3s.openlinksw.com/&quot; id=&quot;link-id14ba36e0&quot;&gt;http://b3s.openlinksw.com/&lt;/a&gt;. This site is under development and may not be on all the time. We are taking it in the direction of hosting the whole &lt;a href=&quot;http://community.linkeddata.org/dataspace/organization/lod#this&quot; id=&quot;link-id0x13355f80&quot;&gt;LOD&lt;/a&gt; cloud. This is an evolving operation where we will continue showcasing how one can ask increasingly interesting questions from a growing online database, in the spirit of the billion triples charter.&lt;/p&gt;

&lt;p&gt;In the words of &lt;a href=&quot;http://www.cs.rpi.edu/~hendler/&quot; id=&quot;link-id111ad740&quot;&gt;Jim Hendler&lt;/a&gt;, we were not selected for the finale because this would have made the challenge a database shootout instead of a more research-oriented event. There is some point to this since if the event becomes like the TPC benchmarks, this will limit the entrance to full time database players. Anyway, we got a special mention in the intro of the challenge track.&lt;/p&gt;

&lt;p&gt;The winner was Semaplorer, a federated SPARQL query system. There is some merit to this, as we ourselves are not convinced that centralization is always the right direction. As discussed in the &lt;i&gt;&lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1376&quot; id=&quot;link-id1831cce0&quot;&gt;DARQ Matter of Federation&lt;/a&gt;&lt;/i&gt; post, we have a notion of how to do this production-strength with our cluster engine, now also over wide area networks. We shall see.&lt;/p&gt;

&lt;h2&gt;Why Not Just Join?&lt;/h2&gt;

&lt;p&gt;The entries from Deri and LARKC (&lt;a href=&quot;http://www.larkc.eu/marvin/&quot; id=&quot;link-id1bb42778&quot;&gt;MaRVIN&lt;/a&gt;, &amp;quot;Massive &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id19c15d30&quot;&gt;RDF&lt;/a&gt; Versatile Inference Network&amp;quot;) were doing materialization of inference results in a cluster environment. The thing they were not doing was joining across partitions. Thus, the &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x1d3c1a38&quot;&gt;data&lt;/a&gt; was partitioned on whatever criterion and then the data in each partition was further refined according to rules known to all partitions. Deri did not address joining further.&lt;/p&gt;

&lt;p&gt;&amp;quot;Nature shall be the guide of the alchemist,&amp;quot; goes the old adage. We can look at MaRVIN as an example of this dictum. Networks of people are low bandwidth, not nearly fully connected. Asking a colleague for &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id0x125dd698&quot;&gt;information&lt;/a&gt; is expensive and subject to misunderstanding; asking another research group might never produce an answer.&lt;/p&gt;

&lt;p&gt;Even looking at one individual, we have no reason to think that the human expert would do complete reasoning. Indeed, the brain is a sort of compute cluster, but it does not have flat latency point to point connectivity â some joins are fast; others are not even tried, for all we know.&lt;/p&gt;

&lt;p&gt;A database running on a cluster is a sort of counter-example. A database with RDF workload will end up joining across partitions pretty much all of the time.&lt;/p&gt;

&lt;p&gt;MaRVIN&amp;#39;s approach to joining could be likened to a country dance: Boys get to take a whirl with different girls according to a complex pattern. For match-making, some matches are produced early but one never knows if the love of a lifetime might be just around the corner. Also, if the dancers are inexperienced, they will have little ability to evaluate how good a match they have with their partner. A few times around the dance floor are needed to get the hang of things.&lt;/p&gt;

&lt;p&gt;The question is, at what point will it no longer be possible to join across the database? This depends on the interconnect latency. The higher the latency, the more useful the square-dancing approach becomes.&lt;/p&gt;

&lt;p&gt;Another practical consideration is the fact that RDF reasoners are not usually built for distributed memory multiprocessors. If the reasoner must be a plug-in component, then it cannot be expected to be written for grids.&lt;/p&gt;

&lt;p&gt;We can think of a product safety use case: Find cosmetics that have ingredients that are considered toxic in the amounts they are present in each product. This can be done as a database query with some transitive operations, like running through a cosmetics taxonomy and a poisons database. If the business logic deciding whether the presence of an ingredient in the product is a health hazard is very complex, we can get a lot of joins.&lt;/p&gt;

&lt;p&gt;The MaRVIN way would be to set up a ball where each lipstick and eyeliner dances with every poison and then see if matches are made. The matching logic could be arbitrarily complex since it would run locally. Of course here, some domain &lt;a href=&quot;http://dbpedia.org/resource/Knowledge&quot; id=&quot;link-id0x133b84b8&quot;&gt;knowledge&lt;/a&gt; is needed in order to set up the processing so that each product and poison carry all the associated information with them. Dancing with half a partner can bias one&amp;#39;s perceptions: Again, it is like nature, sometimes not all cards are on the table.&lt;/p&gt;

&lt;p&gt;It would seem that there is some setup involved before answering a question: Composition of partitions, frequency of result exchange, etc. How critical the domain knowledge implicit in the setup is for the quality of results is an interesting question.&lt;/p&gt;

&lt;p&gt;The question is, at what point will a cluster using &lt;a href=&quot;http://dbpedia.org/resource/federated_database_system&quot; id=&quot;link-id0x1466c1c0&quot;&gt;distributed database&lt;/a&gt; operations for inference become impractical? Of course, it is impractical from the get-go if the reasoners and query processors are not made for this. But what if they are? We are presently evaluating different message patterns for joining between partitions. The baseline is some 250,000 random single-triple lookups per second per core. Using a cluster increases this throughput. The increase is more or less linear depending on whether all intermediate results pass via one coordinating node (worst case) or whether each node can decide which other node will do the next join step for each result (best case). For example, a &lt;code&gt;DISTINCT&lt;/code&gt; operation requires that data passes through a single place but &lt;code&gt;JOIN&lt;/code&gt;ing and aggregation in general do not.&lt;/p&gt;

&lt;p&gt;We will still publish numbers during this November.&lt;/p&gt;</description></item><item><title>Virtuoso - Are We Too Clever for Our Own Good? (updated)</title><guid>http://www.openlinksw.com/weblog/oerling/?date=2008-10-26#1465</guid><comments>http://www.openlinksw.com/weblog/oerling/?id=1465#comments</comments><pubDate>Sun, 26 Oct 2008 12:15:35 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-10-27T12:07:52-04:00</n0:modified><description>&lt;p&gt;&amp;quot;Physician, heal thyself,&amp;quot; it is said. We profess to say what the messaging of the &lt;a href=&quot;http://dbpedia.org/resource/Semantic_Web&quot; id=&quot;link-id0x1fa3da18&quot;&gt;semantic web&lt;/a&gt; ought to be, but is our own perfect?&lt;/p&gt;

&lt;p&gt;I will here engage in some critical introspection as well as amplify on some answers given to &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x1e1eecf0&quot;&gt;Virtuoso&lt;/a&gt;-related questions in recent times.&lt;/p&gt;

&lt;p&gt;I use some conversations from the &lt;a href=&quot;http://dbpedia.org/resource/Vienna&quot; id=&quot;link-id0x1ec0b2e0&quot;&gt;Vienna&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id0x2045ac10&quot;&gt;Linked Data&lt;/a&gt; Practitioners meeting as a starting point. These views are mine and are limited to the Virtuoso server. These do not apply to the &lt;a href=&quot;http://dbpedia.org/resource/OpenLink_Data_Spaces&quot; id=&quot;link-id0x2045ac38&quot;&gt;ODS&lt;/a&gt; (&lt;a href=&quot;http://dbpedia.org/resource/OpenLink_Data_Spaces&quot; id=&quot;link-id0x14f63c58&quot;&gt;OpenLink Data Spaces&lt;/a&gt;) applications line, &lt;a href=&quot;http://oat.openlinksw.com/&quot; id=&quot;link-id0x14f63c80&quot;&gt;OAT&lt;/a&gt; (&lt;a href=&quot;http://oat.openlinksw.com/&quot; id=&quot;link-id0x1e536928&quot;&gt;OpenLink Ajax Toolkit&lt;/a&gt;), or &lt;a href=&quot;http://ode.openlinksw.com/&quot; id=&quot;link-id0x1eaed7f8&quot;&gt;ODE&lt;/a&gt; (&lt;a href=&quot;http://ode.openlinksw.com/&quot; id=&quot;link-id0x1edfff88&quot;&gt;OpenLink Data Explorer&lt;/a&gt;).&lt;/p&gt;

&lt;h3&gt;&amp;quot;It is not always clear what the main thrust is, we get the impression that you are spread too thin,&amp;quot; said &lt;a href=&quot;http://www.informatik.uni-leipzig.de/~auer/foaf.rdf#me&quot; id=&quot;link-id0x1b8a9580&quot;&gt;SÃ¶ren Auer&lt;/a&gt;.&lt;/h3&gt;

&lt;p&gt;Well, personally, I am all for core competence. This is why I do not participate in all the online conversations and groups as much as I could, for example. Time and energy are critical resources and must be invested where they make a difference. In this case, the real core competence is running in the database race. This in itself, come to think of it, is a pretty broad concept.&lt;/p&gt;

&lt;p&gt;This is why we put a lot of emphasis on Linked Data and the &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x1b85fa38&quot;&gt;Data&lt;/a&gt; Web for now, as this is the emerging game. This is a deliberate choice, not an outside imperative or built-in limitation. More specifically, this means exposing any pre-existing relational data as linked data plus being the definitive &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x1f5b4468&quot;&gt;RDF&lt;/a&gt; store.&lt;/p&gt;

&lt;p&gt;We can do this because we own our database and &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x20076468&quot;&gt;SQL&lt;/a&gt; and data access middleware and have a history of connecting to any &lt;a href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id0x1ffd6f98&quot;&gt;RDBMS&lt;/a&gt; out there.&lt;/p&gt;

&lt;p&gt;The principal message we have been hearing from the RDF field is the call for scale of triple storage. This is even louder than the call for relational mapping. We believe that in time mapping will exceed triple storage as such, once we get some real production strength mappings deployed, enough to outperform RDF warehousing.&lt;/p&gt;

&lt;p&gt;There are also RDF middleware things like RDF-ization and demand-driven web harvesting (i.e, the so-called Sponger). These are &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x1316f720&quot;&gt;SPARQL&lt;/a&gt; options, thus accessed via standard interfaces. We have little desire to create our own languages or APIs, or to tell people how to program. This is why we recently introduced &lt;a href=&quot;http://sourceforge.net/projects/sesame/&quot; id=&quot;link-id0x20756a68&quot;&gt;Sesame&lt;/a&gt;- and &lt;a href=&quot;http://jena.sourceforge.net/&quot; id=&quot;link-id0x1ec01ac0&quot;&gt;Jena&lt;/a&gt;-compatible APIs to our RDF store. From what we hear, these work. On the other hand, we do not hesitate to move beyond the standards when there is obvious value or necessity. This is why we brought SPARQL up to and beyond SQL expressivity. It is not a case of E3 (Embrace, Extend, Extinguish).&lt;/p&gt;

&lt;p&gt;Now, this message could be better reflected in our material on the web. This &lt;a href=&quot;http://dbpedia.org/resource/Blog&quot; id=&quot;link-id0x2027b410&quot;&gt;blog&lt;/a&gt; is a rather informal step in this direction; more is to come. For now we concentrate on delivering.&lt;/p&gt;

&lt;p&gt;The conventional communications wisdom is to split the message by target audience. For this, we should split the RDF, relational, and web services messages from each other. We believe that a challenger, like the semantic web technology stack, must have a compelling message to tell for it to be interesting. This is not a question of research prototypes. The new technology cannot lack something the installed technology takes for granted.&lt;/p&gt;

&lt;p&gt;This is why we do not tend to show things like how to insert and query a few triples: No business out there will insert and query triples for the sake of triples. There must be a more compelling story â for example, turning the whole world into a database. This is why our examples start with things like turning the &lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id0x2051ff98&quot;&gt;TPC-H&lt;/a&gt; database into RDF, queries and all. Anything less is not interesting. Why would an enterprise that has business intelligence and integration issues way more complex than the rather stereotypical TPC-H even look at a technology that pretends to be all for integration and all for expressivity of queries, yet cannot answer the first question of the entry exam?&lt;/p&gt;

&lt;p&gt;The world out there is complex. But maybe we ought to make some simple tutorials? So, as a call to the people out there, tell us what a good tutorial would be. The question is more about figuring out what is out there and adapting these and making a sort of compatibility list.  Jena and Sesame stuff ought to run as is. We could offer a webinar to all the data web luminaries showing how to promote the data web message with Virtuoso. After all, why not show it on the best platform?&lt;/p&gt;

&lt;h3&gt;&amp;quot;You are arrogant. When I read your papers or documentation, the impression I get is that you say you are smart and the reader is stupid.&amp;quot;&lt;/h3&gt;

&lt;p&gt;We should answer in multiple  parts.&lt;/p&gt;

&lt;p&gt;For general collateral, like web sites and documentation:&lt;/p&gt;

&lt;p&gt;The web site gives a confused product image.  For the Virtuoso product, we should divide at the top into&lt;/p&gt;

&lt;ul&gt;  
&lt;li&gt; Data web and RDF - Host linked data, expose relational assets as linked data;&lt;/li&gt;
&lt;li&gt; Relational Database - Full function, high performance, open source, Federated/Virtual Relational DBMS, expose heterogeneous RDB assets through one point of contact for integration;&lt;/li&gt;
&lt;li&gt; Web Services - access all the above over standard protocols, dynamic web pages, web hosting.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For each point, one simple statement.  We all know what the above things mean?&lt;/p&gt;

&lt;p&gt;Then we add a new point about scalability that impacts all the above, namely the Virtuoso version 6 Cluster, meaning that you can do all these things at 10 to 1000 times the scale. This means this much more data or in some cases this much more requests per second. This too is clear.&lt;/p&gt;

&lt;p&gt;Far as I am concerned, hosting Java or .&lt;a href=&quot;http://dbpedia.org/resource/.NET_Framework&quot; id=&quot;link-id0x1f297540&quot;&gt;NET&lt;/a&gt; does not have to be on the front page. Also, we have no great interest in going against &lt;a href=&quot;http://dbpedia.org/resource/Apache&quot; id=&quot;link-id0x1ea29578&quot;&gt;Apache&lt;/a&gt; when it comes to a web server only situation. The fact that we have a web listener is important for some things but our claim to fame does not rest on this.&lt;/p&gt;

&lt;p&gt;Then for documentation and training materials: The documentation should be better. Specifically it should have more of a how-to dimension since nobody reads the whole thing anyhow. About online tutorials, the order of presentation should be different. They do not really reflect what is important at the present moment either.&lt;/p&gt;

&lt;p&gt;Now for conference papers: Since taking the data web as a focus area, we have submitted some papers and had some rejected because these do not have enough references and do not explain what is obvious to ourselves.&lt;/p&gt;

&lt;p&gt;I think that the communications failure in this case is that we want to talk about end to end solutions and the reviewers expect research. For us, the solution is interesting and exists only if there is an adequate functionality mix for addressing a specific use case. This is why we do not make a paper about query cost model alone because the cost model, while indispensable, is a thing that is taken for granted where we come from. So we mention RDF adaptations to cost model, as these are important to the whole but do not find these to be the justification for a whole paper. If we made papers on this basis, we would have to make five times as many. Maybe we ought to.&lt;/p&gt;

&lt;h3&gt;&amp;quot;Virtuoso is very big and very difficult&amp;quot;&lt;/h3&gt;

&lt;p&gt;One thing that is not obvious from the Virtuoso packaging is that the minimum installation is an executable under 10MB and a config file. Two files.&lt;/p&gt;

&lt;p&gt;This gives you SQL and SPARQL out of the box.  Adding &lt;a href=&quot;http://dbpedia.org/resource/Open_Database_Connectivity&quot; id=&quot;link-id0x20a2e7d0&quot;&gt;ODBC&lt;/a&gt; and &lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id0x1e4cceb8&quot;&gt;JDBC&lt;/a&gt; clients is as simple as it gets. After this, there is basic database functionality. Tuning is a matter of a few parameters that are explained on this blog and elsewhere. Also, the full scale installation is available as an Amazon EC2 image, so no installation required.&lt;/p&gt;

&lt;p&gt;Now for the difficult side:&lt;/p&gt;

&lt;p&gt;Use SQL and SPARQL; use stored procedures whenever there is server side business logic. For some time critical web pages, use VSP. Do not use VSPX. Otherwise, use whatever you are used to â &lt;a href=&quot;http://dbpedia.org/resource/PHP&quot; id=&quot;link-id0x20b03f08&quot;&gt;PHP&lt;/a&gt; or Java or anything else. For web services, simple is best. Stick to basics. &amp;quot;The engineer is one who can invent a simple thing.&amp;quot; Use SQL statements rather than admin UI.&lt;/p&gt;

&lt;p&gt;Know that you can start a server with no database file and you get an initial database with nothing extra. The demo database, the way it is produced by installers is cluttered.&lt;/p&gt;

&lt;p&gt;We should put this into a couple of use case oriented how-tos.&lt;/p&gt;

&lt;p&gt;Also, we should create a network of &amp;quot;friendly local virtuoso geeks&amp;quot; for providing basic training and services so we do not have to explain these things all the time. To all you data-web-ers out there â please sign up and we will provide instructions, etc. Contact YrjÃ¤nÃ¤ Rankka (ghard[at-sign]openlinksw.com), or go through the mailing lists; do not contact me directly.&lt;/p&gt;

&lt;h3&gt;&amp;quot;OK, we understand that you may be good at the large end of the spectrum but how do you reconcile this with the lightweight or embedded end, like the semantic desktop?&amp;quot;&lt;/h3&gt;

&lt;p&gt;Now, what is good for one end is usually good for the other. Namely, a database, no matter the scale, needs to have space efficient storage, fast index lookup, and correct query plans. Then there are things that occur only at the high-end, like clustering, but these are separate things. For embedding, the initial memory footprint needs to be small. With Virtuoso, this is accomplished by leaving out some 200 built-in tables and 100,000 lines of SQL procedures that are normally in by default, supporting things such as DAV and diverse other protocols. After all, if SPARQL is all one wants these are not needed.&lt;/p&gt;

&lt;p&gt;If one really wants to do one&amp;#39;s server logic (like web listener and thread dispatching) oneself, this is not impossible but requires some advice from us. On the other hand, if one wants to have logic for security close to the data, then using stored procedures is recommended; these execute right next to the data, and support inline SPARQL and SQL. Depending on the license status of the other code, some special licensing arrangements may apply.&lt;/p&gt;

&lt;p&gt;We are talking about such things with different parties at present.&lt;/p&gt;

&lt;h3&gt;&amp;quot;How webby are you?  What is webby?&amp;quot;&lt;/h3&gt;

&lt;p&gt;&amp;quot;Webby means distributed, heterogeneous, open; not monolithic consolidation of everything.&amp;quot;&lt;/p&gt;

&lt;p&gt;We are philosophically webby. We come from open standards; we are after all called OpenLink; our history consists of connecting things. We believe in choice â the user should be able to pick the best of breed for components and have them work together. We cannot and do not wish to force replacement of existing assets. Transforming data on the fly and connecting systems, leaving data where it originally resides, is the first preference. For the data web, the first preference is a federation of independent SPARQL end points. When there is harvesting, we prefer to do it on demand, as with our Sponger. With the immense amount of data out there we believe in finding what is relevant &lt;i&gt;when&lt;/i&gt; it is relevant, preferably close at hand, leveraging things like social networks. With a data web, many things which are now siloized, such as marketplaces and social networks, will return to the open.&lt;/p&gt;

&lt;p&gt;Google-style crawling of everything becomes less practical if one needs to run complex &lt;i&gt;ad hoc&lt;/i&gt; queries against the mass of data. For these types of scenarios, if one needs to warehouse, the data cloud will offer solutions where one pays for database on demand. While we believe in loosely coupled federation where possible, we have serious work on the scalability side for the data center and the compute-on-demand cloud.&lt;/p&gt;

&lt;h3&gt;&amp;quot;How does OpenLink see the next five years unfolding?&amp;quot;&lt;/h3&gt;

&lt;p&gt;Personally, I think we have the basics for the birth of a new inflection in the &lt;a href=&quot;http://dbpedia.org/resource/Knowledge&quot; id=&quot;link-id0x2018bd98&quot;&gt;knowledge&lt;/a&gt; economy. The &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Identifier&quot; id=&quot;link-id0x1ec110d8&quot;&gt;URI&lt;/a&gt; is the unit of exchange; its value and competitive edge lie in the data it links you with. A name without context is worth little, but as a name gets more use, more &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id0x1ecfba08&quot;&gt;information&lt;/a&gt; can be found through that name. This is anything from financial statistics, to legal precedents, to news reporting or government data. Right now, if the SEC just added one line of markup to the XBRL template, this would instantaneously make all SEC-mandated reporting into linked data via GRDDL.&lt;/p&gt;

&lt;p&gt;The URI is a carrier of brand. An information brand gets traffic and references, and this can be monetized in diverse ways. The key word is &lt;i&gt;context&lt;/i&gt;. Information overload is here to stay, and only better context offers the needed increase in productivity to stay ahead of the flood.&lt;/p&gt;

&lt;p&gt;Semantic technologies on the whole can help with this. Why these should be semantic web or data web technologies as opposed to just semantic is the linked data value proposition. Even smart islands are still islands. Agility, scale, and scope, depend on the possibility of combining things. Therefore common terminologies and dereferenceability and discoverability are important. Without these, we are at best dealing with closed systems even if they were smart. The expert systems of the 1980s are a case in point.&lt;/p&gt;

&lt;p&gt;Ever since the .com era, the &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Locator&quot; id=&quot;link-id0x1c4c9248&quot;&gt;URL&lt;/a&gt; has been a brand. Now it becomes a URI. Thus, entirely hiding the URI from the user experience is not always desirable. The URI is a sort of handle on the provenance and where more can be found; besides, people are already used to these.&lt;/p&gt;

&lt;p&gt;With linked data, information value-add products become easy to build and deploy. They can be basically just canned SPARQL queries combining data in a useful and insightful manner. And where there is traffic there can be monetization, whether by advertizing, subscription, or other means. Such possibilities are a natural adjunct to the blogosphere. To publish analysis, one no longer needs to be a think tank or media company. We could call this scenario the birth of a meshup economy.&lt;/p&gt;

&lt;p&gt;For OpenLink itself, this is our roadmap. The immediate future is about getting our high end offerings like clustered RDF storage generally available, both on the cloud and for private data centers. Ourselves, we will offer the whole &lt;a href=&quot;http://community.linkeddata.org/dataspace/organization/lod#this&quot; id=&quot;link-id0x20791bf0&quot;&gt;Linked Open Data&lt;/a&gt; cloud as a database. The single feature to come in version 2 of this is fully automatic partitioning and repartitioning for on-demand scale; now, you have to choose how many partitions you have.&lt;/p&gt;

&lt;p&gt;This makes some things possible that were hard thus far.&lt;/p&gt;

&lt;p&gt;On the mapping front, we go for real-scale data integration scenarios where we can show that SPARQL can unify terms and concepts across databases, yet bring no added cost for complex queries. Enterprises can use their existing warehouses and have an added level of abstraction, the possibility of cross systems interlinking, the advantages of using the same taxonomies and ontologies across systems, and so forth.&lt;/p&gt;

&lt;p&gt;Then there will be developments in the direction of smarter web harvesting on demand with the Virtuoso &lt;a href=&quot;http://virtuoso.openlinksw.com/Whitepapers/html/VirtSpongerWhitePaper.html&quot; id=&quot;link-id0x1f27e6d8&quot;&gt;Sponger&lt;/a&gt;, and federation of heterogeneous SPARQL end points. The federation is not so unlike clustering, except the time scales are 2 orders of magnitude longer. The work on SPARQL end point statistics and data set description and discovery is a good development in the community.&lt;/p&gt;

&lt;p&gt;Then there will be NLP integration, as exemplified by the Open Calais linked data wrapper and more.&lt;/p&gt;

&lt;p&gt;Can we pull this off or is this being spread too thin? We know from experience that all this can be accomplished. Scale is already here; we show it with the billion triples set. Mapping is here; we showed it last in the Berlin Benchmark. We will also show some TPC-H results after we get a little quiet after the ISWC event.  Then there is ongoing maintenance but with this we have shown a steady turnaround and quick time to fix for pretty much anything.&lt;/p&gt;</description></item><item><title>State of the Semantic Web, Part 2 - The Technical Questions (updated)</title><guid>http://www.openlinksw.com/weblog/oerling/?date=2008-10-26#1464</guid><comments>http://www.openlinksw.com/weblog/oerling/?id=1464#comments</comments><pubDate>Sun, 26 Oct 2008 12:02:43 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-10-27T11:28:01-04:00</n0:modified><description>&lt;p&gt;Here I will talk about some more technical questions that came up.  This is mostly general; &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x1f53d1a0&quot;&gt;Virtuoso&lt;/a&gt; specific questions and answers are separate.
&lt;/p&gt;

&lt;h3&gt;&amp;quot;How to Bootstrap?  Where will the triples come from?&amp;quot;&lt;/h3&gt;

&lt;p&gt;There are already wrappers producing &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x1beda278&quot;&gt;RDF&lt;/a&gt; from many applications. Since any structured or semi-structured &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x1e57c648&quot;&gt;data&lt;/a&gt; can be converted to RDF and often there is even a pre-existing terminology for the application domain, the availability of the data &lt;i&gt;per se&lt;/i&gt; is not the concern.&lt;/p&gt;

&lt;p&gt;The triples may come from any application or database, but they will not come from the end user directly.  There was a good talk about photograph annotation in &lt;a href=&quot;http://dbpedia.org/resource/Vienna&quot; id=&quot;link-id0x2028b7e8&quot;&gt;Vienna&lt;/a&gt;, describing many ways of deriving metadata for photos.  The essential wisdom is annotating on the spot and wherever possible doing so automatically.  The consumer is very unlikely to go annotate  photos after the fact.  Further, one can infer that photos made with the same camera around the same time are from the same location.  There are other such heuristics.  In this use case, the end user does not need to see triples.  There is some benefit though in using commonly used geographical terminology for linking to other data sources.&lt;/p&gt;

&lt;h3&gt;&amp;quot;How will one develop applications?&amp;quot;&lt;/h3&gt;

&lt;p&gt;I&amp;#39;d say one will develop them much the same way as thus far.  In &lt;a href=&quot;http://dbpedia.org/resource/PHP&quot; id=&quot;link-id0x1eff1748&quot;&gt;PHP&lt;/a&gt;, for example.  Whether one&amp;#39;s query language is &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x1d83dff8&quot;&gt;SPARQL&lt;/a&gt; or &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x1e9f4e88&quot;&gt;SQL&lt;/a&gt; does not make a large difference in how basic web UI is made.&lt;/p&gt;

&lt;p&gt;A SPARQL end-point is no more an end-user item than a SQL command-line is.&lt;/p&gt;

&lt;p&gt;A common mistake among techies is that they think the data structure and user experience can or ought to be of the same structure.  The UI dialogs do not, for example, have to have a 1:1 correspondence with SQL tables.&lt;/p&gt;

&lt;p&gt;The idea of generating UI from data, whether relational or data-web, is so seductive that generation upon generation of developers fall for it, repeatedly.  Even I, at OpenLink, after supposedly having been around the block a couple of times made some experiments around the topic.  What does make sense is putting a thin wrapper or HTML around the application, using XSLT and such for formatting.  Since the model does allow for unforeseen properties of data, one can build a viewer for these alongside the regular forms.  For this, Ajax technologies like &lt;a href=&quot;http://oat.openlinksw.com/&quot; id=&quot;link-id0x1d780520&quot;&gt;OAT&lt;/a&gt; (the &lt;a href=&quot;http://oat.openlinksw.com/&quot; id=&quot;link-id0x20943788&quot;&gt;OpenLink AJAX Toolkit&lt;/a&gt;) will be good.&lt;/p&gt;

&lt;p&gt;The UI ought not to completely hide the URIs of the data from the user.  It should offer a drill down to faceted views of the triples for example.  Remember when Xerox talked about graphical user interfaces in 1980? &amp;quot;Don&amp;#39;t mode me in&amp;quot; was the slogan, as I recall.&lt;/p&gt;

&lt;p&gt;Since then, we have vacillated between modal and non-modal interaction models.  Repetitive workflows like order entry go best modally and are anyway being replaced by web services.  Also workflows that are very infrequent benefit from modality; take personal network setup wizards, for example.  But enabling the &lt;a href=&quot;http://dbpedia.org/resource/Knowledge&quot; id=&quot;link-id0x1e14eb88&quot;&gt;knowledge&lt;/a&gt; worker is a domain that by its nature must retain some respect for human intelligence and not kill this by denying access to the underlying data, including provenance and URIs.  Face it: the world is not getting simpler.  It is increasingly data dependent and when this is so, having semantics and flexibility of access for the data is important.&lt;/p&gt;

&lt;p&gt;For a real-time task-oriented user interface like a fighter plane cockpit, one will not show URIs unless specifically requested.  For planning fighter sorties though, there is some potential benefit in having all data such as friendly and hostile assets, geography, organizational structure, etc., as &lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id0x1e91d118&quot;&gt;linked data&lt;/a&gt;.  It makes for more flexible querying.  Linked data does not &lt;i&gt;per se&lt;/i&gt; mean open, so one can be joinable with open data through using the same identifiers even while maintaining arbitrary levels of security and compartmentalization.&lt;/p&gt;

&lt;p&gt;For automating tasks that every time involve the same data and queries, RDF has no intrinsic superiority.  Thus the user interfaces in places where RDF will have real edge must be more capable of &lt;i&gt;ad hoc&lt;/i&gt; viewing and navigation than regular real-time or line of business user interfaces.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;http://ode.openlinksw.com/&quot; id=&quot;link-id0x1c7f8ee0&quot;&gt;OpenLink Data Explorer&lt;/a&gt; idea of a &amp;quot;data behind the web page&amp;quot; view goes in this direction. Read the web as before, then hit a switch to go to the data view.  There are and will be separate clarifications and demos about this.&lt;/p&gt;

&lt;h3&gt;&amp;quot;What of the proliferation of standards?  Does this not look too tangled, no clear identity?  How would one know where to begin?&amp;quot;&lt;/h3&gt;

&lt;p&gt;When &lt;a href=&quot;http://www.w3.org/2001/sw/sweo/&quot; id=&quot;link-id0x1d73c268&quot;&gt;SWEO&lt;/a&gt; was beginning, there was an endlessly protracted discussion of the so-called layer cake. This acronym jungle is not good messaging. Just say linked, flexibly repurpose-able data, and rich vocabularies and structure.  Just the right amount of structure for the application, less rigid and easier to change than relational.&lt;/p&gt;

&lt;p&gt;Do not even mention the different serialization formats.  Just say that it fits on top of the accepted web infrastructure â &lt;a href=&quot;http://dbpedia.org/resource/Hypertext_Transfer_Protocol&quot; id=&quot;link-id0x1efefed0&quot;&gt;HTTP&lt;/a&gt;, URIs, and &lt;a href=&quot;http://dbpedia.org/resource/XML&quot; id=&quot;link-id0x1af89b18&quot;&gt;XML&lt;/a&gt; where desired.&lt;/p&gt;

&lt;p&gt;It is misleading to say inference is a box at some specific place in the diagram.  Inference of different types may or may not take place at diverse points, whether presentation or storage, on demand or as a preprocessing step.  Since there is structure and semantics, inference is possible if desired.&lt;/p&gt;

&lt;h3&gt;&amp;quot;Can I make a social network application in RDF only, with no &lt;a href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id0x1cb62cd8&quot;&gt;RDBMS&lt;/a&gt;?&amp;quot;&lt;/h3&gt;

&lt;p&gt;Yes, in principle, but what do you have in mind?  The answer is very context dependent.  The person posing the question had an E-learning system in mind, with things such as course catalogues, course material, etc.  In such a case, RDF is a great match, especially since the user count will not be in the millions.  No university has that many students and anyway they do not hang online browsing the course catalogue.&lt;/p&gt;

&lt;p&gt;On the other hand, if I think of making a social network site with RDF as the exclusive data model, I see things that would be very inefficient. For example, keeping a count of logins or the last time of login would be by default several times less efficient than with a RDBMS.&lt;/p&gt;

&lt;p&gt;If some application is really large scale and has a knowable workload profile, like any social network does, then some task-specific data structure is simply economical.  This does not mean that the application language cannot be SPARQL but this means that the storage format must be tuned to favor some operations over others, relational style.  This is a matter of cost more than of feasibility.  Ten servers cost less than a hundred and have failures ten times less frequently.&lt;/p&gt;

&lt;p&gt;In the near term we will see the birth of an application paradigm for the data web. The data will be open, exposed, first-class citizen; yet the user experience will not have to be in a 1:1 image of the data.&lt;/p&gt;</description></item><item><title>Virtuoso Update, Billion Triples and Outlook</title><guid>http://www.openlinksw.com/weblog/oerling/?date=2008-10-02#1448</guid><comments>http://www.openlinksw.com/weblog/oerling/?id=1448#comments</comments><pubDate>Thu, 02 Oct 2008 09:31:17 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-10-02T12:47:02.000002-04:00</n0:modified><description>&lt;p&gt;I will say a few things about what we have been doing and where we can go.&lt;/p&gt;

&lt;p&gt;Firstly, we have a fairly scalable platform with &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0xa412e450&quot;&gt;Virtuoso&lt;/a&gt; 6 Cluster. It was most recently tested with the workload discussed in the previous &lt;a href=&quot;http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1445&quot; id=&quot;link-id1638a5b8&quot;&gt;Billion Triples post&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;There is an updated version of &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/2008webscale_rdf.pdf&quot; id=&quot;link-id16280a68&quot;&gt;the paper about this&lt;/a&gt;. This will be presented at the web scale workshop of ISWC 2008 in Karlsruhe.&lt;/p&gt;

&lt;p&gt;Right now, we are polishing some things in Virtuoso 6 -- some optimizations for smarter balancing of interconnect traffic over multiple network interfaces, and some more &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x1c1c5f48&quot;&gt;SQL&lt;/a&gt; optimizations specific to &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x1bcb6108&quot;&gt;RDF&lt;/a&gt;. The must-have basics, like parallel running of sub-queries and aggregates, and all-around unrolling of loops of every kind into large partitioned batches, is all there and proven to work.&lt;/p&gt;

&lt;p&gt;We spent a lot of time around the &lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id0x3a4e17c8&quot;&gt;Berlin SPARQL Benchmark&lt;/a&gt; story, so we got to the more advanced stuff like the &lt;a href=&quot;http://challenge.semanticweb.org/&quot; id=&quot;link-id0x1a66c568&quot;&gt;Billion Triples Challenge&lt;/a&gt; rather late. We did along the way also run &lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id0x188c2608&quot;&gt;BSBM&lt;/a&gt; with an &lt;a href=&quot;http://dbpedia.org/resource/Oracle_Database&quot; id=&quot;link-id0x1aa97f98&quot;&gt;Oracle&lt;/a&gt; back-end, with Virtuoso mapping &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x1abd87a0&quot;&gt;SPARQL&lt;/a&gt; to SQL. This merits its own analysis in the near future. This will be the basic how-to of mapping OLTP systems to RDF. Depending on the case, one can use this for lookups in real-time or ETL.&lt;/p&gt;

&lt;p&gt;RDF will deliver value in complex situations. An example of a complex relational mapping use case came from Ordnance Survey, presented at the &lt;a href=&quot;http://www.w3.org/2005/Incubator/rdb2rdf/&quot; id=&quot;link-id0x1a941678&quot;&gt;RDB2RDF XG&lt;/a&gt;. Examples of complex warehouses include the &lt;a href=&quot;http://neurocommons.org/page/Main_Page&quot; id=&quot;link-id0x1aa5a9f8&quot;&gt;Neurocommons&lt;/a&gt; database, the Billion Triples Challenge, and the &lt;a href=&quot;http://www.garlik.com/&quot; id=&quot;link-id0x372df7b0&quot;&gt;Garlik DataPatrol&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In comparison, the Berlin workload is really simple and one where RDF is not at its best, as amply discussed on the &lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id0x1a671cf0&quot;&gt;Linked Data&lt;/a&gt; forum. BSBM&amp;#39;s primary value is as a demonstrator for the basic mapping tasks that will be repeated over and over for pretty much any online system when presence on the &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x1ab83dd0&quot;&gt;data&lt;/a&gt; web becomes as indispensable as presence on the HTML web.&lt;/p&gt;

&lt;p&gt;I will now talk about the complex warehouse/web-harvesting side. I will come to the mapping in another post.&lt;/p&gt;

&lt;p&gt;Now, all the things shown in the &lt;a href=&quot;http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1445&quot; id=&quot;link-id14de1d18&quot;&gt;Billion Triples post&lt;/a&gt; can be done with a relational system specially built for each purpose. Since we are a general purpose &lt;a href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id0x340d3470&quot;&gt;RDBMS&lt;/a&gt;, we use this capability where it makes sense. For example, storing statistics about which tags or interests occur with which other tags or interests as RDF blank nodes makes no sense. We do not even make the experiment; we know ahead of time that the result is at least an order of magnitude in favor of the relational row-oriented solution in both space and time.&lt;/p&gt;

&lt;p&gt;Whenever there is a data structure specially made for answering one specific question, like joint occurrence of tags, RDB and mapping is the way to go. With Virtuoso, this can fully-well coexist with physical triples, and can still be accessed in SPARQL and mixed with triples. This is territory that we have not extensively covered yet, but we will be giving some examples about this later.&lt;/p&gt;

&lt;p&gt;The real value of RDF is in agility. When there is no time to design and load a new warehouse for every new question, RDF is unparalleled. Also SPARQL, once it has the necessary extensions of aggregating and sub-queries, is nicer than SQL, especially when we have sub-classes and sub-properties, transitivity, and &amp;quot;same as&amp;quot; enabled. These things have some run time cost and if there is a report one is hitting absolutely all the time, then chances are that resolving terms and identity at load-time and using materialized views in SQL is the reasonable thing. If one is inventing a new report every time, then RDF has a lot more convenience and flexibility.&lt;/p&gt;

&lt;p&gt;We are just beginning to explore what we can do with data sets such as the online conversation space, linked data, and the open ontologies of &lt;a href=&quot;http://umbel.org/about/&quot; id=&quot;link-id0x19cabf38&quot;&gt;UMBEL&lt;/a&gt; and &lt;a href=&quot;http://dbpedia.org/resource/Cyc&quot; id=&quot;link-id0x19cecd10&quot;&gt;OpenCyc&lt;/a&gt;. It is safe to say that we can run with real world scale without loss of query expressivity. There is an incremental cost for performance but this is not prohibitive. Serving the whole billion triples set from memory would cost about $32K in hardware. $8K will do if one can wait for disk part of the time. One can use these numbers as a basis for costing larger systems. For online search applications, one will note that running the indexes pretty much from memory is necessary for flat response time. For back office analytics this is not necessarily as critical. It all depends on the use case.&lt;/p&gt;

&lt;p&gt;We expect to be able to combine geography, social proximity, subject matter, and &lt;a href=&quot;http://dbpedia.org/resource/Named_entity_recognition&quot; id=&quot;link-id0x1a8202e8&quot;&gt;named entities&lt;/a&gt;, with hierarchical taxonomies and traditional full text, and to present this through a simple user interface.&lt;/p&gt;

&lt;p&gt;We expect to do this with online response times if we have a limited set of starting points and do not navigate more than 2 or 3 steps from each starting point. An example would be to have a full text pattern and news group, and get the cloud of interests from the authors of matching posts. Another would be to make a faceted view of the properties of the 1000 people most closely connected to one person.&lt;/p&gt;

&lt;p&gt;Queries like finding the fastest online responders to questions about romance across the global board-scape, or finding the person who initiates the most long running conversations about crime, take a bit longer but are entirely possible.&lt;/p&gt;

&lt;p&gt;The genius of RDF is to be able to do these things within a general purpose database, ad hoc, in a single query language, mostly without materializing intermediate results. Any of these things could be done with arbitrary efficiency in a custom built system. But what is special now is that the cost of access to this type of &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id0x1ab0a918&quot;&gt;information&lt;/a&gt; and far beyond drops dramatically as we can do these things in a far less labor intensive way, with a general purpose system, with no redesigning and reloading of warehouses at every turn. The query becomes a commodity.&lt;/p&gt;

&lt;p&gt;Still, one must know what to ask. In this respect, the self-describing nature of RDF is unmatched. A query like &lt;i&gt;list the top 10 attributes with the most distinct values for all persons&lt;/i&gt; cannot be done in SQL. SQL simply does not allow the columns to be variable.&lt;/p&gt;

&lt;p&gt;Further, we can accept queries as text, the way people are used to supplying them, and use structure for drill-down or result-relevance, and also recognize named entities and subject matter concepts in query text. Very simple NLP will go a long way towards keeping SPARQL out of the user experience.&lt;/p&gt;

&lt;p&gt;The other way of keeping query complexity hidden is to publish hand-written SPARQL as parameter-fed canned reports.&lt;/p&gt;

&lt;p&gt;Between now and ISWC 2008, the last week of October, we will put out demos showing some of these things. Stay tuned.&lt;/p&gt;</description></item><item><title>Transitivity and Graphs for SQL</title><guid>http://www.openlinksw.com/weblog/oerling/?date=2008-09-08#1433</guid><comments>http://www.openlinksw.com/weblog/oerling/?id=1433#comments</comments><pubDate>Mon, 08 Sep 2008 09:20:11 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-09-08T15:43:04.000006-04:00</n0:modified><description>&lt;h2&gt;Background&lt;/h2&gt; 

&lt;p&gt;I have mentioned on a couple of prior occasions that basic graph operations ought to be integrated into the &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0xb1fe830&quot;&gt;SQL&lt;/a&gt; query language.&lt;/p&gt;

&lt;p&gt;The history of databases is by and large about moving from specialized applications toward a generic platform. The introduction of the DBMS itself is the archetypal example.  It is all about extracting the common features of applications and making these the features of a platform instead.&lt;/p&gt;

&lt;p&gt;It is now time to apply this principle to graph traversal.&lt;/p&gt;

&lt;p&gt;The rationale is that graph operations are somewhat tedious to write in a parallelize-able, latency-tolerant manner. Writing them as one would for memory-based &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x1cb37218&quot;&gt;data&lt;/a&gt; structures is easier but totally unscalable as soon as there is any latency involved, i.e., disk reads or messages between cluster peers.&lt;/p&gt;

&lt;p&gt;The ad-hoc nature and very large volume of &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x1e1850a0&quot;&gt;RDF&lt;/a&gt; data makes this a timely question.  Up until now, the answer to this question has been to materialize any implied facts in RDF stores.  If &lt;i&gt;a&lt;/i&gt; was part of &lt;i&gt;b&lt;/i&gt;, and &lt;i&gt;b&lt;/i&gt; part of &lt;i&gt;&lt;a href=&quot;http://dbpedia.org/resource/C_(programming_language)&quot; id=&quot;link-id0xa1a08d38&quot;&gt;c&lt;/a&gt;&lt;/i&gt;, the implied fact that &lt;i&gt;a&lt;/i&gt; is part of &lt;i&gt;c&lt;/i&gt; would be inserted explicitly into the database as a pre-query step.&lt;/p&gt;

&lt;p&gt;This is simple and often efficient, but tends to have the downside that one makes a specialized warehouse for each new type of query.  The activity becomes less ad-hoc.&lt;/p&gt;

&lt;p&gt;Also, this becomes next to impossible when the scale approaches web scale, or if some of the data is liable to be on-and-off included-into or excluded-from the set being analyzed.  This is why with &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0xa51bd10&quot;&gt;Virtuoso&lt;/a&gt; we have tended to favor inference on demand (&amp;quot;backward chaining&amp;quot;) and mapping of relational data into RDF without copying.&lt;/p&gt;

&lt;p&gt;The SQL world has taken steps towards dealing with recursion with the &lt;code&gt;WITH - UNION&lt;/code&gt; construct which allows definition of recursive views.  The idea there is to define, for example, a tree walk as a &lt;code&gt;UNION&lt;/code&gt; of the data of the starting node plus the recursive walk of the starting node&amp;#39;s immediate children.&lt;/p&gt;

&lt;p&gt;The main problem with this is that I do not very well see how a SQL optimizer could effectively rearrange queries involving &lt;code&gt;JOIN&lt;/code&gt;s between such recursive views.  This model of recursion seems to lose SQL&amp;#39;s non-procedural nature.  One can no longer easily rearrange &lt;code&gt;JOIN&lt;/code&gt;s based on what data is given and what is to be retrieved.  If the recursion is written from root to leaf, it is not obvious how to do this from leaf to root.  At any rate, queries written in this way are so complex to write, let alone optimize, that I decided to take another approach.&lt;/p&gt;

&lt;p&gt;Take a question like &amp;quot;list the parts of products of category &lt;i&gt;C&lt;/i&gt; which have materials that are classified as toxic.&amp;quot;  Suppose that the product categories are a tree, the product parts are a tree, and the materials classification is a tree taxonomy where &amp;quot;toxic&amp;quot; has a multilevel substructure.&lt;/p&gt;

&lt;p&gt;Depending on the count of products and materials, the query can be evaluated as either going from products to parts to materials and then climbing up the materials tree to see if the material is toxic. Or one could do it in reverse, starting with the different toxic materials, looking up the parts containing these, going to the part tree to the product, and up the product hierarchy to see if the product is in the right category.  One should be able to evaluate the identical query either way depending on what indices exist, what the cardinalities of the relations are, and so forth â regular cost based optimization.&lt;/p&gt;

&lt;p&gt;Especially with RDF, there are many problems of this type.  In regular SQL, it is a long-standing cultural practice to flatten hierarchies, but this is not the case with RDF.&lt;/p&gt;

&lt;p&gt;In Virtuoso, we see &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0xb4b3ce8&quot;&gt;SPARQL&lt;/a&gt; as reducing to SQL.  Any RDF-oriented database-engine or query-optimization feature is accessed via SQL.  Thus, if we address run-time-recursion in the Virtuoso query engine, this becomes, &lt;i&gt;ipso facto&lt;/i&gt;, an SQL feature.  Besides, we remember that SQL is a much more mature and expressive language than the current SPARQL recommendation.&lt;/p&gt;

&lt;h2&gt; SQL and Transitivity &lt;/h2&gt;

&lt;p&gt;We will here look at some simple social network queries.  A later article will show how to do more general graph operations. We extend the SQL derived table construct, i.e., &lt;code&gt;SELECT&lt;/code&gt; in another &lt;code&gt;SELECT&lt;/code&gt;&amp;#39;s &lt;code&gt;FROM&lt;/code&gt; clause, with a &lt;code&gt;TRANSITIVE&lt;/code&gt; clause.&lt;/p&gt;

&lt;p&gt;Consider the data:&lt;/p&gt;

&lt;blockquote&gt;
 &lt;pre&gt;&lt;code&gt;CREATE TABLE &amp;quot;knows&amp;quot; 
   (&amp;quot;p1&amp;quot; INT, 
    &amp;quot;p2&amp;quot; INT, 
    PRIMARY KEY (&amp;quot;p1&amp;quot;, &amp;quot;p2&amp;quot;)
   );
ALTER INDEX &amp;quot;knows&amp;quot; 
   ON &amp;quot;knows&amp;quot; 
   PARTITION (&amp;quot;p1&amp;quot; INT);
CREATE INDEX &amp;quot;knows2&amp;quot; 
   ON &amp;quot;knows&amp;quot; (&amp;quot;p2&amp;quot;, &amp;quot;p1&amp;quot;) 
   PARTITION (&amp;quot;p2&amp;quot; INT);
&lt;/code&gt;
 &lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;We represent a social network with the many-to-many relation &amp;quot;knows&amp;quot;.  The persons are identified by integers.&lt;/p&gt;

&lt;blockquote&gt;
 &lt;pre&gt;&lt;code&gt;INSERT INTO &amp;quot;knows&amp;quot; VALUES (1, 2);
INSERT INTO &amp;quot;knows&amp;quot; VALUES (1, 3);
INSERT INTO &amp;quot;knows&amp;quot; VALUES (2, 4);&lt;/code&gt;
 &lt;/pre&gt;

&lt;pre&gt;&lt;code&gt;SELECT * 
   FROM (SELECT 
            TRANSITIVE 
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               &amp;quot;p1&amp;quot;, 
            &amp;quot;p2&amp;quot; 
         FROM &amp;quot;knows&amp;quot;
        ) &amp;quot;k&amp;quot; 
   WHERE &amp;quot;k&amp;quot;.&amp;quot;p1&amp;quot; = 1;&lt;/code&gt;&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;We obtain the result:&lt;/p&gt;

&lt;blockquote&gt;
&lt;table width=&quot;100&quot;&gt;
&lt;tr&gt;
    &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;p1&lt;/th&gt;
    &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;p2&lt;/th&gt;
  &lt;/tr&gt;
&lt;tr&gt;
    &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;3&lt;/td&gt;
  &lt;/tr&gt;
&lt;tr&gt;
    &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;2&lt;/td&gt;
  &lt;/tr&gt;
&lt;tr&gt;
    &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;4&lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;
&lt;/blockquote&gt;

&lt;p&gt;The operation is reversible:&lt;/p&gt;

&lt;blockquote&gt;
 &lt;pre&gt;&lt;code&gt;SELECT * 
   FROM (SELECT 
            TRANSITIVE 
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               &amp;quot;p1&amp;quot;, 
            &amp;quot;p2&amp;quot; 
         FROM &amp;quot;knows&amp;quot;
        ) &amp;quot;k&amp;quot; 
   WHERE &amp;quot;k&amp;quot;.&amp;quot;p2&amp;quot; = 4;
&lt;/code&gt;
 &lt;/pre&gt;

&lt;table width=&quot;100&quot;&gt;
&lt;tr&gt;
    &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;p1&lt;/th&gt;
    &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;p2&lt;/th&gt;
  &lt;/tr&gt;
&lt;tr&gt;
    &lt;td align=&quot;center&quot;&gt;2&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;4&lt;/td&gt;
  &lt;/tr&gt;
&lt;tr&gt;
    &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;4&lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;
&lt;/blockquote&gt;

&lt;p&gt;Since now we give &lt;i&gt;p2&lt;/i&gt;, we traverse from &lt;i&gt;p2&lt;/i&gt; towards &lt;i&gt;p1&lt;/i&gt;. The result set states that 4 is known by 2 and 2 is known by 1.&lt;/p&gt;

&lt;p&gt;To see what would happen if &lt;i&gt;x&lt;/i&gt; knowing &lt;i&gt;y&lt;/i&gt; also meant &lt;i&gt;y&lt;/i&gt; knowing &lt;i&gt;x&lt;/i&gt;, one could write:&lt;/p&gt;

&lt;blockquote&gt;
 &lt;pre&gt;&lt;code&gt;SELECT * 
   FROM (SELECT 
            TRANSITIVE
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               &amp;quot;p1&amp;quot;, 
            &amp;quot;p2&amp;quot; 
	    FROM (SELECT 
                  &amp;quot;p1&amp;quot;, 
                  &amp;quot;p2&amp;quot; 
               FROM &amp;quot;knows&amp;quot; 
               UNION ALL 
                  SELECT 
                     &amp;quot;p2&amp;quot;, 
                     &amp;quot;p1&amp;quot; 
                  FROM &amp;quot;knows&amp;quot;
              ) &amp;quot;k2&amp;quot;
        ) &amp;quot;k&amp;quot; 
   WHERE &amp;quot;k&amp;quot;.&amp;quot;p2&amp;quot; = 4;&lt;/code&gt;
 &lt;/pre&gt;

&lt;table width=&quot;100&quot;&gt;
&lt;tr&gt;
    &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;p1&lt;/th&gt;
    &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;p2&lt;/th&gt;
  &lt;/tr&gt;
&lt;tr&gt;
    &lt;td align=&quot;center&quot;&gt;2&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;4&lt;/td&gt;
  &lt;/tr&gt;
&lt;tr&gt;
    &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;4&lt;/td&gt;
  &lt;/tr&gt;
&lt;tr&gt;
    &lt;td align=&quot;center&quot;&gt;3&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;4&lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;
&lt;/blockquote&gt;


&lt;p&gt;Now, since we know that 1 and 4 are related, we can ask how they are related.&lt;/p&gt;
&lt;blockquote&gt;
 &lt;pre&gt;&lt;code&gt;SELECT * 
   FROM (SELECT 
            TRANSITIVE 
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               &amp;quot;p1&amp;quot;, 
            &amp;quot;p2&amp;quot;, 
            T_STEP (1) AS &amp;quot;via&amp;quot;, 
            T_STEP (&amp;#39;step_no&amp;#39;) AS &amp;quot;step&amp;quot;, 
            T_STEP (&amp;#39;path_id&amp;#39;) AS &amp;quot;path&amp;quot; 
         FROM &amp;quot;knows&amp;quot;
        ) &amp;quot;k&amp;quot; 
   WHERE &amp;quot;p1&amp;quot; = 1 
      AND &amp;quot;p2&amp;quot; = 4;&lt;/code&gt;
 &lt;/pre&gt;

&lt;table width=&quot;250&quot;&gt;
&lt;tr&gt;
    &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;p1&lt;/th&gt;
    &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;p2&lt;/th&gt;
    &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;via&lt;/th&gt;
    &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;step&lt;/th&gt;
    &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;path&lt;/th&gt;
  &lt;/tr&gt;
&lt;tr&gt;
    &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;4&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;0&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;0&lt;/td&gt;
  &lt;/tr&gt;
&lt;tr&gt;
    &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;4&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;2&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;0&lt;/td&gt;
  &lt;/tr&gt;
&lt;tr&gt;
    &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;4&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;4&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;2&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;0&lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;
&lt;/blockquote&gt;


&lt;p&gt;The two first columns are the ends of the path.  The next column is the person that is a step on the path.  The next one is the number of the step, counting from 0, so that the end of the path that corresponds to the end condition on the column designated as input, i.e., &lt;i&gt;p1&lt;/i&gt;, has number 0.  Since there can be multiple solutions, the last column is a sequence number allowing distinguishing multiple alternative paths from each other.&lt;/p&gt;

&lt;p&gt;For LinkedIn users, the friends ordered by distance and descending friend count query, which is at the basis of most LinkedIn search result views can be written as: &lt;/p&gt;

&lt;blockquote&gt;
 &lt;pre&gt;&lt;code&gt;SELECT p2, 
      dist, 
      (SELECT 
          COUNT (*) 
          FROM &amp;quot;knows&amp;quot; &amp;quot;c&amp;quot; 
          WHERE &amp;quot;c&amp;quot;.&amp;quot;p1&amp;quot; = &amp;quot;k&amp;quot;.&amp;quot;p2&amp;quot;
      ) 
   FROM (SELECT 
            TRANSITIVE t_in (1) t_out (2) t_distinct &amp;quot;p1&amp;quot;, 
            &amp;quot;p2&amp;quot;, 
            t_step (&amp;#39;step_no&amp;#39;) AS &amp;quot;dist&amp;quot;
         FROM &amp;quot;knows&amp;quot;
        ) &amp;quot;k&amp;quot; 
   WHERE &amp;quot;p1&amp;quot; = 1 
   ORDER BY &amp;quot;dist&amp;quot;, 3 DESC;&lt;/code&gt;
 &lt;/pre&gt;


&lt;table width=&quot;150&quot;&gt;
&lt;tr&gt;
    &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;p2&lt;/th&gt;
    &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;dist&lt;/th&gt;
    &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;aggregate&lt;/th&gt;
  &lt;/tr&gt;
&lt;tr&gt;
    &lt;td align=&quot;center&quot;&gt;2&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt;
  &lt;/tr&gt;
&lt;tr&gt;
    &lt;td align=&quot;center&quot;&gt;3&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;0&lt;/td&gt;
  &lt;/tr&gt;
&lt;tr&gt;
    &lt;td align=&quot;center&quot;&gt;4&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;2&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;0&lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;
&lt;/blockquote&gt;


&lt;h2&gt;How?&lt;/h2&gt;

&lt;p&gt;The queries shown above work on Virtuoso v6.  When running in cluster mode, several thousand graph traversal steps may be proceeding at the same time, meaning that all database access is parallelized and that the algorithm is internally latency-tolerant.  By default, all results are produced in a deterministic order, permitting predictable slicing of result sets.&lt;/p&gt;

&lt;p&gt;Furthermore, for queries where both ends of a path are given, the optimizer may decide to attack the path from both ends simultaneously. So, supposing that every member of a social network has an average of 30 contacts, and we need to find a path between two users that are no more than 6 steps apart, we begin at both ends, expanding each up to 3 levels, and we stop when we find the first intersection.  Thus, we reach 2 * 30^3 = 54,000 nodes, and not 30^6 = 729,000,000 nodes.&lt;/p&gt;

&lt;p&gt;Writing a generic database driven graph traversal framework on the application side, say in Java over &lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id0xb595050&quot;&gt;JDBC&lt;/a&gt;, would easily be over a thousand lines. This is much more work than can be justified just for a one-off, ad-hoc query.  Besides, the traversal order in such a case could not be optimized by the DBMS.&lt;/p&gt;

&lt;h2&gt;Next&lt;/h2&gt; 

&lt;p&gt;In a future &lt;a href=&quot;http://dbpedia.org/resource/Blog&quot; id=&quot;link-id0x1e4d4f18&quot;&gt;blog&lt;/a&gt; post I will show how this feature can be used for common graph tasks like critical path, itinerary planning, traveling salesman, the 8 queens chess problem, etc.  There are lots of switches for controlling different parameters of the traversal.  This is just the beginning.  I will also give examples of the use of this in SPARQL.&lt;/p&gt;</description></item><item><title>Virtuoso 5.0.7 Release, Now With Jena and Sesame APIs</title><guid>http://www.openlinksw.com/weblog/oerling/?date=2008-07-17#1392</guid><comments>http://www.openlinksw.com/weblog/oerling/?id=1392#comments</comments><pubDate>Thu, 17 Jul 2008 17:16:19 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-07-17T15:28:20-04:00</n0:modified><description>&lt;h2&gt;Improvements&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
  &lt;a href=&quot;http://docs.openlinksw.com:80/virtuoso/rdfnativestorageproviders.html&quot; id=&quot;link-id13e54d98&quot;&gt;Full operation&lt;/a&gt; with &lt;a href=&quot;http://jena.sourceforge.net/&quot; id=&quot;link-id0x11839970&quot;&gt;Jena&lt;/a&gt; and &lt;a href=&quot;http://sourceforge.net/projects/sesame/&quot; id=&quot;link-id0x118521a0&quot;&gt;Sesame&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x11e14758&quot;&gt;RDF&lt;/a&gt; Frameworks. This fully replaces any previous attempts at interop, and introduces samples and test suites.&lt;/li&gt;
&lt;li&gt;Better support for alternate RDF indexing schemes&lt;/li&gt;
&lt;li&gt;Parallel operation of the RDF Sponger, importing multiple
sources concurrently.&lt;/li&gt;
&lt;li&gt;New &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x13661868&quot;&gt;data&lt;/a&gt; formats supported for on-demand RDF-ization in the
Sponger&lt;/li&gt;
&lt;li&gt;More efficient support for inference of subclass and
sub-property; now capable of efficiently handling taxonomies of tens
of thousands of classes&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://dbpedia.org/resource/Web_Ontology_Language&quot; id=&quot;link-id0x9df079b8&quot;&gt;OWL&lt;/a&gt; &lt;a href=&quot;http://docs.openlinksw.com:80/virtuoso/rdfsparqlrule.html#rdfsparqlruleintro&quot; id=&quot;link-id104d58d8&quot;&gt;equivalentClass and equivalentProperty&lt;/a&gt; support.&lt;/li&gt;
&lt;li&gt;
    &lt;a href=&quot;http://docs.openlinksw.com:80/virtuoso/rdfdatarepresentation.html#rdfdynamiclocal&quot; id=&quot;link-id109606a8&quot;&gt;Dynamic IRI host part&lt;/a&gt; support for mapped data and for metadata of local resources. Renaming the host or using multiple virtual hosts will accept URIs with the right host part and refer to the same thing, no duplicate storage required.&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x110e7688&quot;&gt;SPARQL&lt;/a&gt; optimizations for &lt;code&gt;LIMIT&lt;/code&gt; and &lt;code&gt;OFFSET&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Documentation&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
    &lt;a href=&quot;http://docs.openlinksw.com:80/virtuoso/perfdiag.html#perfdiagqueryplans&quot; id=&quot;link-id10a56dd0&quot;&gt;How to read query plans and how to use the key performance meters&lt;/a&gt;
  &lt;/li&gt;
&lt;li&gt;
    &lt;a href=&quot;http://docs.openlinksw.com:80/virtuoso/rdfperformancetuning.html#rdfperfcost&quot; id=&quot;link-id106cb5c0&quot;&gt;How to diagnose SPARQL queries and how to decide what indexing scheme is right for each RDF use case&lt;/a&gt;
  &lt;/li&gt;
&lt;li&gt;How to debug RDF views&lt;/li&gt;
&lt;ul&gt;
  &lt;li&gt;
    &lt;a href=&quot;http://docs.openlinksw.com:80/virtuoso/sparqldebug.html&quot; id=&quot;link-id133b4420&quot;&gt;Better documentation of SPARQL extensions and options&lt;/a&gt;
  &lt;/li&gt;
&lt;li&gt;
    &lt;a href=&quot;http://docs.openlinksw.com:80/virtuoso/rdfviews.html#rdfviewnorthwindexample1&quot; id=&quot;link-id1060fdd8&quot;&gt;A sample of correct RDF view usage with the Northwind demo data&lt;/a&gt;
  &lt;/li&gt;
&lt;/ul&gt;
&lt;/ul&gt;
&lt;h2&gt;Bug Fixes&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Generally improved safety of built-in functions, better
argument checking.&lt;/li&gt;
&lt;li&gt;Verified UTF8 international character support in all RDF use
cases, &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x11140c28&quot;&gt;SQL&lt;/a&gt; client/&lt;a href=&quot;http://www.w3.org/TR/rdf-sparql-protocol/&quot; id=&quot;link-id0x110947e8&quot;&gt;SPARQL protocol&lt;/a&gt;/all data formats.&lt;/li&gt;
&lt;/ul&gt;
</description></item><item><title>The DARQ Matter of Federation</title><guid>http://www.openlinksw.com/weblog/oerling/?date=2008-06-09#1376</guid><comments>http://www.openlinksw.com/weblog/oerling/?id=1376#comments</comments><pubDate>Mon, 09 Jun 2008 13:57:30 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-06-11T15:15:00-04:00</n0:modified><description>&lt;p&gt;Astronomers propose that the universe is held together, so to speak, by the gravity of invisible &amp;quot;dark matter&amp;quot; spread in interstellar and intergalactic space.&lt;/p&gt;
&lt;p&gt;For the &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x19bbd830&quot;&gt;data&lt;/a&gt; web, it will be held together by federation, also an invisible factor. As in Minkowski space, so in &lt;a href=&quot;http://dbpedia.org/resource/Cyberspace&quot; id=&quot;link-id0x19af2488&quot;&gt;cyberspace&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;To take the astronomical analogy further, putting too much visible stuff in one place makes a black hole, whose chief properties are that it is very heavy, can only get heavier and that nothing comes out.&lt;/p&gt;
&lt;p&gt;
&lt;a href=&quot;http://darq.sourceforge.net/&quot; id=&quot;link-id0x19b7a9c8&quot;&gt;DARQ&lt;/a&gt; is Bastian Quilitz&amp;#39;s federated extension of the &lt;a href=&quot;http://jena.sourceforge.net/&quot; id=&quot;link-id0x19ce3da0&quot;&gt;Jena&lt;/a&gt; &lt;a href=&quot;http://jena.sourceforge.net/ARQ/&quot; id=&quot;link-id0xa569a258&quot;&gt;ARQ&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x1a8d2270&quot;&gt;SPARQL&lt;/a&gt; processor. It has existed for a while and was also presented at &lt;a href=&quot;http://www.eswc2008.org/&quot; id=&quot;link-id0x1aad1d00&quot;&gt;ESWC2008&lt;/a&gt;. There is also SPARQL FED from Andy Seaborne, an explicit means of specifying which end point will process which fragment of a distributed SPARQL query. Still, for federation to deliver in an open, decentralized world, it must be transparent. For a specific application, with a predictable workload, it is of course OK to partition queries explicitly.&lt;/p&gt;
&lt;p&gt;Bastian had split &lt;a href=&quot;http://dbpedia.org/resource/DBpedia&quot; id=&quot;link-id0x1a8ac770&quot;&gt;DBpedia&lt;/a&gt; among five &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x19601d30&quot;&gt;Virtuoso&lt;/a&gt; servers and was querying this set with DARQ. The end result was that there was a rather frightful cost of federation as opposed to all the data residing in a single Virtuoso. The other result was that if selectivity of predicates was not correctly guessed by the federation engine, the proposition was a non-starter. With correct join order it worked, though.&lt;/p&gt;
&lt;p&gt;Yet, we really want federation. Looking further down the road, we simply must make federation work. This is just as necessary as running on a server cluster for mid-size workloads.&lt;/p&gt;
&lt;p&gt;Since we are convinced of the cause, let&amp;#39;s talk about the means.&lt;/p&gt;
&lt;p&gt;For DARQ as it now stands, there&amp;#39;s probably an order of magnitude or even more to gain from a couple of simple tricks. If going to a SPARQL end point that is not the outermost in the loop join sequence, batch the requests together in one &lt;a href=&quot;http://dbpedia.org/resource/Hypertext_Transfer_Protocol&quot; id=&quot;link-id0x19b94818&quot;&gt;HTTP&lt;/a&gt;/1.1 message. So, if the query is &amp;quot;get me my friends living in cities of over a million people,&amp;quot; there will be the fragment &amp;quot;get city where x lives&amp;quot; and later &amp;quot;ask if population of x greater than 1000000&amp;quot;. If I have 100 friends, I send the 100 requests in a batch to each eligible server.&lt;/p&gt;
&lt;p&gt;Further, if running against a server of known brand, use a client-server connection and prepared statements with array parameters. This can well improve the processing speed at the remote end point by another order of magnitude. This gain may however not be as great as the latency savings from message batching. We will provide a sample of how to do this with Virtuoso over &lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id0x17822258&quot;&gt;JDBC&lt;/a&gt; so Bastian can try this if interested.&lt;/p&gt;
&lt;p&gt;These simple things will give a lot of mileage and may even decide whether federation is an option in specific applications. For the open web however, these measures will not yet win the day.&lt;/p&gt;
&lt;p&gt;When federating &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x1a651628&quot;&gt;SQL&lt;/a&gt;, colocation of data is sort of explicit. If two tables are joined and they are in the same source, then the join can go to the source. For SPARQL this is also so but with a twist:&lt;/p&gt;
&lt;p&gt;If a foaf:Person is found on a given server, this does not mean that the Person&amp;#39;s geek code or email hash will be on the same server. Thus &lt;code&gt;{?p name &amp;quot;Johnny&amp;quot; . ?p geekCode ?g . ?p emailHash ?h }&lt;/code&gt; does not necessarily denote a colocated join if many servers serve items of the vocabulary.&lt;/p&gt;
&lt;p&gt;However, in most practical cases, for obtaining a rapid answer, treating this as a colocated fragment will be appropriate. Thus, it may be necessary to be able to declare that geek codes will be assumed colocated with names. This will save a lot of message passing and offer decent, if not theoretically total recall. For search style applications, starting with such assumptions will make sense. If nothing is found, then we can partition each join step separately for the unlikely case that there were a server that gave geek codes but not names.&lt;/p&gt;
&lt;p&gt;For Virtuoso, we find that a federated query&amp;#39;s asynchronous, parallel evaluation model is not so different from that on a local cluster. So the cluster version could have the option of federated query. The difference is that a cluster is local and tightly coupled and predictably partitioned but a federated setting is none of these.&lt;/p&gt;
&lt;p&gt;For description, we would take DARQ&amp;#39;s description model and maybe extend it a little where needed. Also we would enhance the protocol to allow just asking for the query cost estimate given a query with literals specified. We will do this eventually.&lt;/p&gt;
&lt;p&gt;We would like to talk to Bastian about large improvements to DARQ, specially when working with Virtuoso. We&amp;#39;ll see.&lt;/p&gt;
&lt;p&gt;Of course, one mode of federating is the crawl-as-you-go approach of the Virtuoso &lt;a href=&quot;http://virtuoso.openlinksw.com/Whitepapers/html/VirtSpongerWhitePaper.html&quot; id=&quot;link-id0x1dddce48&quot;&gt;Sponger&lt;/a&gt;. This will bring in fragments following seeAlso or sameAs declarations or other references. This will however not have the recall of a warehouse or federation over well described SPARQL end-points. But up to a certain volume it has the speed of local storage.&lt;/p&gt;
&lt;p&gt;The emergence of voiD (Vocabulary of Interlinked Data) is a step in the direction of making federation a reality. There is &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1377&quot; id=&quot;link-id1109a4c8&quot;&gt;a separate post&lt;/a&gt; about this.&lt;/p&gt;</description></item><item><title>Aspects of RDF to RDF Mapping</title><guid>http://www.openlinksw.com/weblog/oerling/?date=2008-06-09#1375</guid><comments>http://www.openlinksw.com/weblog/oerling/?id=1375#comments</comments><pubDate>Mon, 09 Jun 2008 13:52:20 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-06-11T13:15:19.000010-04:00</n0:modified><description>&lt;p&gt;The W3C has recently launched an &lt;a href=&quot;http://www.w3.org/2005/Incubator/rdb2rdf/&quot; id=&quot;link-idd763f48&quot;&gt;incubator group about mapping relational data to RDF&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;From participating in the group for the few initial sessions, I get the following impressions.&lt;/p&gt;
&lt;p&gt;There is a segment of users, for example from the biomedical community, who do heavy duty &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x1b388bf0&quot;&gt;data&lt;/a&gt; integration and look to &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x1a24b198&quot;&gt;RDF&lt;/a&gt; for managing complexity. Unifying heterogeneous data under OWL ontologies, reasoning, and data integrity, are points of interest.&lt;/p&gt;
&lt;p&gt;There is another segment that is concerned with semantifying the document web, which topic includes initiatives such as &lt;a href=&quot;http://triplify.org/&quot; id=&quot;link-id0x16cb5c48&quot;&gt;Triplify&lt;/a&gt; and &lt;a href=&quot;http://dbpedia.org/resource/Semantic_Web&quot; id=&quot;link-id0x1adcd2b8&quot;&gt;semantic web&lt;/a&gt; search such as &lt;a href=&quot;http://sindice.org/&quot; id=&quot;link-id0x1a462ee0&quot;&gt;Sindice&lt;/a&gt;. The emphasis there is on minimizing entry cost and creating critical mass. The next one to come will clean up the semantics, if these need be cleaned up at all.&lt;/p&gt;
&lt;p&gt;(Some cleanup is taking place with &lt;a href=&quot;http://www.mpi-inf.mpg.de/~suchanek/downloads/yago/&quot; id=&quot;link-id0x17faa940&quot;&gt;Yago&lt;/a&gt; and &lt;a href=&quot;http://zitgist.com/about/&quot; id=&quot;link-id0x1acd23f0&quot;&gt;Zitgist&lt;/a&gt;, but this is a matter for a different post.)&lt;/p&gt;
&lt;p&gt;Thus, technically speaking, the mapping landscape is diverse, but ETL (extract-transform-load) seems to predominate. The biomedical people make data warehouses for answering specific questions. The web people are interested in putting data out in the expectation that the next player will warehouse it and allow running complex meshups against the whole of the RDF-ized web.&lt;/p&gt;
&lt;p&gt;As one would expect, these groups see different issues and needs. Roughly speaking, one is about quality and structure and the other is about volume.&lt;/p&gt;
&lt;p&gt;Where do we stand?&lt;/p&gt;
&lt;p&gt;We are with the research data warehousers in saying that the mapping question is very complex and that it would indeed be nice to bypass ETL and go to the source &lt;a href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id0x17f28d60&quot;&gt;RDBMS&lt;/a&gt;(s) on demand. Projects in this direction are ongoing.&lt;/p&gt;
&lt;p&gt;We are with the web people in building large RDF stores with scalable query answering for arbitrary RDF, for example, hosting a lot of the Linking Open Data sets, and working with Zitgist.&lt;/p&gt;
&lt;p&gt;These things are somewhat different.&lt;/p&gt;
&lt;p&gt;At present, both the research warehousers and the web scalers predominantly go for ETL.&lt;/p&gt;
&lt;p&gt;This is fine by us as we definitely are in the large RDF store race.&lt;/p&gt;
&lt;p&gt;Still, mapping has its point. A relational store will perform quite a bit faster than a quad store if it has the right covering indices or application-specific compressed columnar layout. Thus, there is nothing to block us from querying analytics in &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x1a2c81c8&quot;&gt;SPARQL&lt;/a&gt;, once the obviously necessary extensions of sub-query, expressions and aggregation are in place.&lt;/p&gt;
&lt;p&gt;To cite an example, the Ordnance Survey of the UK has a GIS system running on &lt;a href=&quot;http://dbpedia.org/resource/Oracle_Database&quot; id=&quot;link-id0x18a82010&quot;&gt;Oracle&lt;/a&gt; with an entry pretty much for each mailbox, lamp post, and hedgerow in the country. According to Ordnance Survey, this would be 1 petatriple, 1e15 triples. &amp;quot;Such a big server farm that we&amp;#39;d have to put it on our map,&amp;quot; as Jenny Harding put it at &lt;a href=&quot;http://www.eswc2008.org/&quot; id=&quot;link-id0x16533418&quot;&gt;ESWC2008&lt;/a&gt;. I&amp;#39;d add that an even bigger map entry would be the power plant needed to run the 100,000 or so PCs this would take. This is counting 10 gigatriples per PC, which would not even give very good working sets.&lt;/p&gt;
&lt;p&gt;So, on-the-fly RDBMS-to-RDF mapping in some cases is simply necessary. Still, the benefits of RDF for integration can be preserved if the translation middleware is smart enough. Specifically, this entails knowing what tables can be joined with what other tables and pushing maximum processing to the RDBMS(s) involved in the query.&lt;/p&gt;
&lt;p&gt;You can download the slide set I used for the &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x16c57ed0&quot;&gt;Virtuoso&lt;/a&gt; presentation for the RDB to RDF mapping incubator group (&lt;a href=&quot;http://virtuoso.openlinksw.com/wiki/main/Main/VirtPresentations/Relational2RDF.ppt&quot; id=&quot;link-id106f9e88&quot;&gt;PPT&lt;/a&gt;; &lt;a href=&quot;http://virtuoso.openlinksw.com/wiki/main/Main/VirtPresentations&quot; id=&quot;link-id10a8dc90&quot;&gt;other formats&lt;/a&gt; coming soon). The main point is that real integration is hard and needs smart query splitting and optimization, as well as real understanding of the databases and subject matter from the &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id0x1b132910&quot;&gt;information&lt;/a&gt; architect. Sometimes in the web space it can suffice to put data out there with trivial RDF translation and hope that a search engine or such will figure out how to join this with something else. For the enterprise, things are not so. Benefits are clear if one can navigate between disjoint silos but making this accurate enough for deriving business conclusions, as well as efficient enough for production, is a soluble and non-trivial question.&lt;/p&gt;
&lt;p&gt;We will show the basics of this with the &lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id0x17fc7b58&quot;&gt;TPC-H&lt;/a&gt; mapping, and by joining this with physical triples. We will also make a set of TPC-H format table sets, make mappings between keys in one to keys in the other, and show joins between the two. The SPARQL querying of one such data store is a done deal, including the SPARQL extensions for this. There is even a demo paper, Business Intelligence Extensions for SPARQL (&lt;a href=&quot;http://virtuoso.openlinksw.com/wiki/main/Main/VirtPresentations/RDFAndMapped_BI.pdf&quot; id=&quot;link-id12ea4b18&quot;&gt;PDF&lt;/a&gt;; &lt;a href=&quot;http://virtuoso.openlinksw.com/wiki/main/Main/VirtPresentations&quot; id=&quot;link-id106e1810&quot;&gt;other formats&lt;/a&gt; coming soon), by us on the subject in the ESWC 2008 proceedings. If there is an issue left, it is just the technicality of always producing &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x18439b70&quot;&gt;SQL&lt;/a&gt; that looks hand-crafted and hence is better understood by the target RDBMS(s). For example, Oracle works better if one uses an &lt;code&gt;IN&lt;/code&gt; sub-query instead of the equivalent existence test.&lt;/p&gt;
&lt;p&gt;Follow this &lt;a href=&quot;http://dbpedia.org/resource/Blog&quot; id=&quot;link-id0x16c29ea0&quot;&gt;blog&lt;/a&gt; for more on the topic; published papers are always a limited view on the matter.&lt;/p&gt;</description></item><item><title>ESWC 2008</title><guid>http://www.openlinksw.com/weblog/oerling/?date=2008-06-09#1374</guid><comments>http://www.openlinksw.com/weblog/oerling/?id=1374#comments</comments><pubDate>Mon, 09 Jun 2008 13:49:15 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-06-11T13:15:11.000008-04:00</n0:modified><description>&lt;p&gt;YrjÃ¤nÃ¤ Rankka and I attended &lt;a href=&quot;http://www.eswc2008.org/&quot; id=&quot;link-id10b7a038&quot;&gt;ESWC2008&lt;/a&gt; on behalf of OpenLink.&lt;/p&gt;
&lt;p&gt;We were invited at the last minute to give a &lt;a href=&quot;http://community.linkeddata.org/dataspace/organization/lod#this&quot; id=&quot;link-id105df758&quot;&gt;Linked Open Data&lt;/a&gt; talk at Paolo Bouquet&amp;#39;s Identity and Reference workshop. We also had a demo of &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id12eacca0&quot;&gt;SPARQL&lt;/a&gt; BI (&lt;a href=&quot;http://virtuoso.openlinksw.com/wiki/main/Main/VirtPresentations/ESWC2008%20SPARQL%20BI%20OpenLink.ppt&quot; id=&quot;link-id10b43e58&quot;&gt;PPT&lt;/a&gt;); &lt;a href=&quot;http://virtuoso.openlinksw.com/wiki/main/Main/VirtPresentations&quot; id=&quot;link-id1116d8f0&quot;&gt;other formats coming soon&lt;/a&gt;), our business intelligence extensions to &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x1843a368&quot;&gt;SPARQL&lt;/a&gt; as well as joining between relational &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id10badc40&quot;&gt;data&lt;/a&gt; mapped to &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id108edaf8&quot;&gt;RDF&lt;/a&gt; and native &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x1843a3b0&quot;&gt;RDF&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x1843a3c8&quot;&gt;data&lt;/a&gt;. i was also speaking at the social networks panel chaired by Harry Halpin.&lt;/p&gt;
&lt;p&gt;I have gathered a few impressions that I will share in the next few posts (&lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1375&quot; id=&quot;link-id107298e0&quot;&gt;1 - RDF Mapping&lt;/a&gt;, &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1376&quot; id=&quot;link-id10b3a530&quot;&gt;2 - DARQ&lt;/a&gt;, &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1377&quot; id=&quot;link-id107290e0&quot;&gt;3 - voiD&lt;/a&gt;, &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1378&quot; id=&quot;link-id1071a950&quot;&gt;4 - Paradigmata&lt;/a&gt;). &lt;i&gt;Caveat: This is not meant to be complete or impartial press coverage of the event but rather some quick comments on issues of personal/OpenLink interest. The fact that I do not mention something does not mean that it is unimportant.&lt;/i&gt;
&lt;/p&gt;
&lt;h2&gt;The voiD Graph&lt;/h2&gt;
&lt;p&gt;
&lt;a href=&quot;http://community.linkeddata.org/dataspace/organization/lod#this&quot; id=&quot;link-id0x16c781e0&quot;&gt;Linked Open Data&lt;/a&gt; was well represented, with Chris Bizer, Tom Heath, ourselves and many others. The great advance for &lt;a href=&quot;http://community.linkeddata.org/dataspace/organization/lod#this&quot; id=&quot;link-id108f3c48&quot;&gt;LOD&lt;/a&gt; this time around is &lt;a href=&quot;http://community.linkeddata.org/MediaWiki/index.php?MetaLOD#Kick-off_meeting_at_ESWC08&quot; id=&quot;link-id10df9830&quot;&gt;voiD, the Vocabulary of Interlinked Datasets&lt;/a&gt;, a means to describe what in fact is inside the &lt;a href=&quot;http://community.linkeddata.org/dataspace/organization/lod#this&quot; id=&quot;link-id0x16c78228&quot;&gt;LOD&lt;/a&gt; cloud, how to join it with what and so forth. Big time important if there is to be a &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1377&quot; id=&quot;link-iddf74578&quot;&gt;web of federatable data sources&lt;/a&gt;, feeding directly into what we have been saying for a while about SPARQL end-point self-description and discovery. There is reasonable hope of having something by the date of &lt;a href=&quot;http://www.linkeddataplanet.com/&quot; id=&quot;link-id10dd0848&quot;&gt;Linked Data Planet&lt;/a&gt; in a couple of weeks.&lt;/p&gt;
&lt;h2&gt;Federating&lt;/h2&gt;
&lt;p&gt;Bastian Quilitz gave a talk about his &lt;a href=&quot;http://darq.sourceforge.net/&quot; id=&quot;link-id108746e8&quot;&gt;DARQ&lt;/a&gt;, a federated version of Jena&amp;#39;s ARQ.&lt;/p&gt;
&lt;p&gt;Something like &lt;a href=&quot;http://darq.sourceforge.net/&quot; id=&quot;link-id0x16c782e8&quot;&gt;DARQ&lt;/a&gt;&amp;#39;s optimization statistics should make their way into the &lt;a href=&quot;http://www.w3.org/TR/rdf-sparql-protocol/&quot; id=&quot;link-id10992348&quot;&gt;SPARQL protocol&lt;/a&gt; as well as the voiD data set description.&lt;/p&gt;
&lt;p&gt;We really need federation but more on this in &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1376&quot; id=&quot;link-id1059d688&quot;&gt;a separate post&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
&lt;a href=&quot;http://xsparql.deri.ie/&quot; id=&quot;link-id10314308&quot;&gt;XSPARQL&lt;/a&gt;
&lt;/h2&gt;
&lt;p&gt;Axel Polleres et al had a paper about &lt;a href=&quot;http://xsparql.deri.ie/&quot; id=&quot;link-id0x1a2d8458&quot;&gt;XSPARQL&lt;/a&gt;, a merge of &lt;a href=&quot;http://dbpedia.org/resource/XQuery&quot; id=&quot;link-id10b98e90&quot;&gt;XQuery&lt;/a&gt; and SPARQL. While visiting DERI a couple of weeks back and again at the conference, we talked about OpenLink implementing the spec. It is evident that the engines must be in the same process and not communicate via the &lt;a href=&quot;http://www.w3.org/TR/rdf-sparql-protocol/&quot; id=&quot;link-id0x1d99c1d0&quot;&gt;SPARQL protocol&lt;/a&gt; for this to be practical. We could do this. We&amp;#39;ll have to see when.&lt;/p&gt;
&lt;p&gt;Politically, using &lt;a href=&quot;http://dbpedia.org/resource/XQuery&quot; id=&quot;link-id0x1acae1f0&quot;&gt;XQuery&lt;/a&gt; to give expressions and XML synthesis to SPARQL would be fitting. These things are needed anyhow, as surely as aggregation and sub-queries but the latter would not so readily come from XQuery. Some rapprochement between RDF and XML folks is desirable anyhow.&lt;/p&gt;
&lt;h2&gt;Panel: Will the Sem Web Rise to the Challenge of the Social Web?&lt;/h2&gt;
&lt;p&gt;The social web panel presented the question of whether the sem web was ready for prime time with data portability.&lt;/p&gt;
&lt;p&gt;The main thrust was expressed in Harry Halpin&amp;#39;s rousing closing words: &amp;quot;Men will fight in a battle and lose a battle for a cause they believe in. Even if the battle is lost, the cause may come back and prevail, this time changed and under a different name. Thus, there may well come to be something like our &lt;a href=&quot;http://dbpedia.org/resource/Semantic_Web&quot; id=&quot;link-id122f4da0&quot;&gt;semantic web&lt;/a&gt;, but it may not be the one we have worked all these years to build if we do not rise to the occasion before us right now.&amp;quot;&lt;/p&gt;
&lt;p&gt;So, how to do this? Dan Brickley asked the audience how many supported, or were aware of, the latest Web 2.0 things, such as &lt;a href=&quot;http://dbpedia.org/page/OAuth&quot; id=&quot;link-idf300bc0&quot;&gt;OAuth&lt;/a&gt; and &lt;a href=&quot;http://dbpedia.org/page/OpenID&quot; id=&quot;link-id10ce7a40&quot;&gt;OpenID&lt;/a&gt;. A few were. The general idea was that research (after all, this was a research event) should be more integrated and open to the world at large, not living at the &amp;quot;outdated pace&amp;quot; of a 3 year funding cycle. Stefan Decker of DERI acquiesced in principle. Of course there is impedance mismatch between specialization and interfacing with everything.&lt;/p&gt;
&lt;p&gt;I said that triples and vocabularies existed, that OpenLink had &lt;a href=&quot;http://dbpedia.org/resource/OpenLink_Data_Spaces&quot; id=&quot;link-id1210dbf8&quot;&gt;ODS&lt;/a&gt; (&lt;a href=&quot;http://dbpedia.org/resource/OpenLink_Data_Spaces&quot; id=&quot;link-id11076be8&quot;&gt;OpenLink Data Spaces&lt;/a&gt;, &lt;a href=&quot;http://community.linkeddata.org/&quot; id=&quot;link-id10d46710&quot;&gt;Community LinkedData&lt;/a&gt;) for managing one&amp;#39;s data-web presence, but that scale would be the next thing. Rather large scale even, with 100 gigatriples (Gtriples) reached before one even noticed. It takes a lot of PCs to host this, maybe $400K worth at today&amp;#39;s prices, without replication. Count 16G ram and a few cores per Gtriple so that one is not waiting for disk all the time.&lt;/p&gt;
&lt;p&gt;The tricks that Web 2.0 silos do with app-specific data structures and app-specific partitioning do not really work for RDF without compromising the whole point of smooth schema evolution and tolerance of ragged data.&lt;/p&gt;
&lt;p&gt;So, simple vocabularies, minimal inference, minimal blank nodes. Besides, note that the inference will have to be done at run time, not forward-chained at load time, if only because users will not agree on what sameAs and other declarations they want for their queries. Not to mention spam or malicious sameAs declarations!&lt;/p&gt;
&lt;p&gt;As always, there was the question of business models for the open data web and for semantic technologies in general. As we see it, &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id108b7688&quot;&gt;information&lt;/a&gt; overload is the factor driving the demand. Better contextuality will justify semantic technologies. Due to the large volumes and complex processing, a data-as-service model will arise. The data may be open, but its query infrastructure, cleaning, and keeping up-to-date, can be monetized as services.&lt;/p&gt;
&lt;h2&gt;Identity and Reference&lt;/h2&gt;
&lt;p&gt;For the identity and reference workshop, the ultimate question is metaphysical and has no single universal answer, even though people, ever since the dawn of time and earlier, have occupied themselves with the issue. Consequently, I started with the Genesis quote where Adam called things by &lt;i&gt;nominibus suis&lt;/i&gt;, off-hand implying that things would have some intrinsic ontologically-due names. This would be among the older references to the question, at least in widely known sources.&lt;/p&gt;
&lt;p&gt;For present purposes, the consensus seemed to be that what would be considered the same as something else depended entirely on the application. What was similar enough to warrant a sameAs for cooking purposes might not warrant a sameAs for chemistry. In fact, complete and exact sameness for URIs would be very rare. So, instead of making generic weak similarity assertions like similarTo or seeAlso, one would choose a set of strong sameAs assertions and have these in effect for query answering if they were appropriate to the granularity demanded by the application.&lt;/p&gt;
&lt;p&gt;Therefore sameAs is our permanent companion, and there will in time be malicious and spam sameAs. So, nothing much should be materialized on the basis of sameAs assertions in an &lt;a href=&quot;http://dbpedia.org/resource/Open_world_assumption&quot; id=&quot;link-id10c4dfd0&quot;&gt;open world&lt;/a&gt;. For an app-specific warehouse, sameAs can be resolved at load time.&lt;/p&gt;
&lt;p&gt;There was naturally some apparent tension between the Occam camp of &lt;a href=&quot;http://dbpedia.org/resource/Entity&quot; id=&quot;link-id105fd240&quot;&gt;entity&lt;/a&gt; name services and the LOD camp. I would say that the issue is more a perceived polarity than a real one. People will, inevitably, continue giving things names regardless of any centralized authority. Just look at natural language. But having a dictionary that is commonly accepted for established domains of discourse is immensely helpful.&lt;/p&gt;
&lt;h2&gt;CYC and NLP&lt;/h2&gt;
&lt;p&gt;The semantic search workshop was interesting, especially CYC&amp;#39;s presentation. CYC is, as it were, the grand old man of &lt;a href=&quot;http://dbpedia.org/resource/Knowledge&quot; id=&quot;link-id10568158&quot;&gt;knowledge&lt;/a&gt; representation. Over the long term, I would have support of the CYC inference language inside a database query processor. This would mostly be for repurposing the huge &lt;a href=&quot;http://dbpedia.org/resource/Knowledge&quot; id=&quot;link-id0x17f7dd40&quot;&gt;knowledge&lt;/a&gt; base for helping in search type queries. If it is for transactions or financial reporting, then queries will be &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id130a0a80&quot;&gt;SQL&lt;/a&gt; and make little or no use of any sort of inference. If it is for summarization or finding things, the opposite holds. For scaling, the issue is just making correct cardinality guesses for query planning, which is harder when inference is involved. We&amp;#39;ll see.&lt;/p&gt;
&lt;p&gt;I will also have a closer look at natural language one of these days, quite inevitably, since &lt;a href=&quot;http://zitgist.com/about/&quot; id=&quot;link-id10795828&quot;&gt;Zitgist&lt;/a&gt; (for example) is into &lt;a href=&quot;http://dbpedia.org/resource/Entity&quot; id=&quot;link-id0x1a2c8bd0&quot;&gt;entity&lt;/a&gt; disambiguation.&lt;/p&gt;
&lt;h2&gt;Scale&lt;/h2&gt;
&lt;p&gt;Garlic gave a talk about their Data Patrol and QDOS. We agree that storing the data for these as triples instead of 1000 or so constantly changing relational tables could well make the difference between next-to-unmanageable and efficiently adaptive.&lt;/p&gt;
&lt;p&gt;Garlic probably has the largest triple collection in constant online use to date. We will soon join them with our hosting of the whole LOD cloud and &lt;a href=&quot;http://sindice.org/&quot; id=&quot;link-id0x1b383720&quot;&gt;Sindice&lt;/a&gt;/&lt;a href=&quot;http://zitgist.com/about/&quot; id=&quot;link-id0x1b383738&quot;&gt;Zitgist&lt;/a&gt; as triples.&lt;/p&gt;
&lt;h2&gt;Conclusions&lt;/h2&gt;
&lt;p&gt;There is a mood to deliver applications. Consequently, scale remains a central, even the principal topic. So for now we make bigger centrally-managed databases. At the next turn around the corner we will have to turn to federation. The point here is that a planetary-scale, centrally-managed, online system can be made when the workload is uniform and anticipatable, but if it is free-form queries and complex analysis, we have a problem. So we move in the direction of federating and charging based on usage whenever the workload is more complex than making simple lookups now and then.&lt;/p&gt;
&lt;p&gt;For the &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id1026ac28&quot;&gt;Virtuoso&lt;/a&gt; roadmap, this changes little. Next we make data sets available on Amazon EC2, as widely promised at ESWC. With big scale also comes rescaling and repartitioning, so this gets additional weight, as does further parallelizing of single user workloads. As it happens, the same medicine helps for both. At &lt;a href=&quot;http://www.linkeddataplanet.com/&quot; id=&quot;link-id0x1a2c7eb0&quot;&gt;Linked Data Planet&lt;/a&gt;, we will make more announcements.&lt;/p&gt;</description></item><item><title>RDF Benchmarking, Role, Motives, and Rationale</title><guid>http://www.openlinksw.com/weblog/oerling/?date=2007-11-21#1274</guid><comments>http://www.openlinksw.com/weblog/oerling/?id=1274#comments</comments><pubDate>Wed, 21 Nov 2007 14:19:39 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-04-30T14:28:17.000001-04:00</n0:modified><description>&lt;p&gt;Arising from the recent W3C workshop on &lt;a href=&quot;http://www.w3.org/2007/03/RdfRDB/&quot; id=&quot;link-id10679c70&quot;&gt;mapping relational data to RDF&lt;/a&gt;, there is some discussion on &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1268&quot; id=&quot;link-id1258dca8&quot;&gt;starting a benchmarking oriented experimental group&lt;/a&gt; under the W3C. I&amp;#39;ll here make some comments on where this might fit and how this might serve our nascent industry.&lt;/p&gt;
&lt;p&gt;To the public, basically any recipient of the semantic &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0xa203a350&quot;&gt;data&lt;/a&gt; web message, the benchmarking activity should communicate:&lt;/p&gt;
&lt;ul&gt;
 &lt;li&gt;
  &lt;p&gt;The semantic data web claims to&lt;/p&gt;
&lt;ol&gt;
    &lt;li&gt; allow integrating any legacy data from wherever and allow translating this into common, mutually joinable vocabularies, and&lt;/li&gt;
&lt;li&gt;make the web into a big database capable of answering structured queries on any open data.&lt;/li&gt;
  &lt;/ol&gt;
 &lt;/li&gt;
&lt;li&gt;
  &lt;p&gt;The benchmarking activity is to prove that this is not a pipe dream that Gartner Group forecast for 2027. Instead, there exists &lt;/p&gt;
&lt;ol&gt;
    &lt;li&gt;an industry, &lt;/li&gt;
&lt;li&gt;a degree of consensus within the industry concerning what the semantic data web is for, and&lt;/li&gt;
&lt;li&gt;products that are beyond experimental and can deliver at least some of the claimed benefits of the semantic data web.&lt;/li&gt;
  &lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To the general public, the message will be best delivered by the existence of online services that do interesting things with &lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id0x1404e708&quot;&gt;linked data&lt;/a&gt;, starting from search and going to more specialized derivative products of structured &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id0xa1e12cd8&quot;&gt;information&lt;/a&gt; on the web.&lt;/p&gt;
&lt;p&gt;To those intending to apply some semantic data web things themselves, the benchmark activity should give a directory of products to look at. The reason why a benchmark suite backed by some industry consortium is useful is that it adds to the end user&amp;#39;s confidence that the use case being measured is of somewhat general relevance and not just made to demonstrate any single product&amp;#39;s strengths. Besides this, the TPC idea of disclosing scale, throughput, price per throughput and date is fine because it makes for easy tabulation of results. The intricacies in the full disclosure is effectively masked and it is my guess that very few read the actual full disclosures.&lt;/p&gt;
&lt;p&gt;The inference that an evaluator draws from benchmark results is that some product figuring there consistently is somewhat serious and can be studied further. Being in the running is like a stamp of approval. The benchmarks are complex and the evaluator seldom goes to the trouble of really analyzing performance by individual query or transaction even if these are and must be given. It is a bit like Formula 1 viewers do not generally read the rules on car engine or aerodynamics, let alone understand their finer points.&lt;/p&gt;
&lt;p&gt;For credibility to be thus given to products and hence the industry, we should just have a couple of well defined and agreed upon benchmarks, just like TPC.&lt;/p&gt;
&lt;p&gt;The third public is the developer. As a DBMS developer, I am a great fan of TPC. The great benefit I derive from their work is that they give a test suite for measuring effects of code changes on performance. Also, assuming that the TPC workload mix is representative, it also allows ranking what optimizations are more important than others. Lastly, TPC gives a great way of describing results, e.g., changes resulting in x% improvement on throughput of y. In such usage, the benchmarks are pretty much never run by the rules but results obtained are still good for internal comparison.&lt;/p&gt;
&lt;p&gt;Communication about IS should allow for short, simple messages: Release XX Halves Price per Throughput.&lt;/p&gt;
&lt;p&gt;The existence of benchmarks is, if not absolutely necessary, then at least a great help for such communication. Besides, people are culturally used to all kinds of racing and sports results so this is even a familiar format.&lt;/p&gt;
&lt;p&gt;Now the TPC is also not perfect. In the high end, the measured configurations are so large that one does not see them very often in practice. It is like the techno sports of Formula 1 or America&amp;#39;s Cup. Interesting for the curiosity value but not immediately relevant to the regular car buyer or weekend yachtsman. Further, sponsoring a by-the-book audited TPC result is not so simple. Not as expensive as putting out an America&amp;#39;s Cup challenge but still some trouble and expense.&lt;/p&gt;
&lt;p&gt;So, for us to benefit by the benchmarking activity, we must find a group that can both agree and be somewhat representative. Then we must put out a simple message: This here is for integration of relational sources and this here for storage and query of &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0xa192c590&quot;&gt;RDF&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Furthermore, in so far we derive from relational or similar sources, the technology should not do less than the established alternative. This sends the wrong message.&lt;/p&gt;
&lt;p&gt;Entering the running should not be overly difficult for vendors, hence we should not have too many benchmarks and the ones that there are should be representative and sufficiently varied workloads. The results should be compact and easy to state. One more reason why I like TPC&amp;#39;s work is the fact that the benchmarks have an easy to understand, unified use case behind them. Approximately what is done in each becomes clear from a very short and succinct description even though the details can be complex. I suspect this is one side of their appeal. I would venture the guess that a single use case story is easier to sell than a composite metric of disparate tests. Also in the scientific computing world, we have use cases, like NAS for aerodynamics, so having a use case story is quite common and a factor for making a benchmark&amp;#39;s relevance understandable.&lt;/p&gt;
&lt;p&gt;Is this all possible?&lt;/p&gt;
&lt;p&gt;To play the devil&amp;#39;s advocate, I could say that the use cases are not as well settled as the relational ones hence formulating a generally representative benchmark is not possible. Now this is certainly not a message that this community wishes to send. Besides, there exists decades worth of history of the problems of information integration and a great deal of RDF data out there, , even a compilation of dozens of industry use cases by the SWEO, so we are not exactly in the dark here.&lt;/p&gt;
&lt;p&gt;Can there be political agreement in reasonable time? If we look at the TPC as a precedent, judging by the rate of publication and revision, the process is not exactly quick. Now, for the TPC, it does not have to be. Judging by the frequency of published test results, hardware vendors are happy enough to have a forum to show off and do so at every turn.&lt;/p&gt;
&lt;p&gt;Now we are not at this stage of maturity yet.&lt;/p&gt;
&lt;p&gt;Composing a TPC style test spec is possible in a reasonable time for an individual but likely not for a committee. It is quite voluminous but also quite formulaic. While TPC&amp;#39;s material is their own, I see no reason that we could not reference or link to it it where applicable.&lt;/p&gt;
&lt;p&gt;Who would be motivated by such activity? How to pitch the activity to would be participants? I don&amp;#39;t think that just talking about what to measure and how is interesting enough. This is covered ground. Vendors want to promote themselves and end users want to have vendors compete at solving their problems. Or so it would be in a simpler world.&lt;/p&gt;
&lt;p&gt;Personally, I&amp;#39;d like to see a benchmark with a use case story people can relate to emerge in the next few months. Now I am not necessarily holding my breath waiting for this. For purposes of ongoing development, there is the real data out there and we can for example do the social web workload mix I suggested a couple of &lt;a href=&quot;http://dbpedia.org/resource/Blog&quot; id=&quot;link-id0x1f671bf8&quot;&gt;blog&lt;/a&gt; posts back on that and it is good enough for us. But that is not good enough for the industry&amp;#39;s messaging.&lt;/p&gt;
&lt;p&gt;I&amp;#39;d say that we have to assume that people play in good faith and simply ask who want to run and get an extra edge by being in on the design of the race track. By good faith I here mean a sincere wish to have the race take place in the first place.&lt;/p&gt;
&lt;p&gt;The sport is exciting for the players and spectators alike if there is a use case story that they can relate to and an actual tournament. So this is what we should aim for. Because this is so far a niche public, we should not fragment the activity too much and we should consider how understandable and relevant the benchmark activity is to likely semantic data web adopters.&lt;/p&gt; </description></item><item><title>Virtuoso and cluster capacity allocation</title><guid>http://www.openlinksw.com/weblog/oerling/?date=2007-08-28#1246</guid><comments>http://www.openlinksw.com/weblog/oerling/?id=1246#comments</comments><pubDate>Tue, 28 Aug 2007 10:08:25 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-04-25T12:39:16.000002-04:00</n0:modified><description>&lt;p&gt;I just read &lt;a href=&quot;http://labs.google.com/papers/bigtable.html&quot; id=&quot;link-id10967a78&quot;&gt;Google&amp;#39;s Bigtable&lt;/a&gt; paper. It is relevant here because it talks about keeping petabyte scale (1024TB) tables on a variable size cluster of machines.&lt;/p&gt;
&lt;p&gt;I have talked about partitioning versus distributed cache in the &lt;a href=&quot;http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1229&quot; id=&quot;link-id10913318&quot;&gt;second to last post&lt;/a&gt;. The problem in short is that you do not expect a DBA to really know how to partition things, and even if the indices are correctly partitioned initially, repartitioning them is so bad that doing it online can be a problem. And repartitioning is needed whenever adding machines, unless the size increment is a doubling, which it will never be.&lt;/p&gt;
&lt;p&gt;So &lt;a href=&quot;http://dbpedia.org/resource/Oracle_Database&quot; id=&quot;link-id0x1c4caaa0&quot;&gt;Oracle&lt;/a&gt; has really elegantly stepped around the whole problem by not partitioning for clustering in the first place. So incremental capacity change does not require repartitioning. Oracle has partitioning for other purposes but this is not tied to their cluster proposition.&lt;/p&gt;
&lt;p&gt;I did not go the cache fusion route because I could not figure a way to know with near certainty where to send a request for a given key value. In the case we are interested in, the job simply must go to the &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0xa1b52ab8&quot;&gt;data&lt;/a&gt; and not the other way around. Besides, not being totally dependent on a microsecond latency interconnect and a SAN for performance enhances deployment options. Sending large batches of functions tolerates latency better than cache consistency messages which are a page at a time, unless of course you kill yourself with extra trickery for batching these too.&lt;/p&gt;
&lt;p&gt;So how to adapt to capacity change? Well, by making the unit of capacity allocation much smaller than a machine, of course.&lt;/p&gt;
&lt;p&gt;Google has done this in Bigtable by a scheme of dynamic range partitioning. The partition size is in the tens to hundreds of megabytes, something that can be moved around within reason. When the partition, called a tablet, gets too big, it splits. Just like a Btree index. The tree top must be common &lt;a href=&quot;http://dbpedia.org/resource/Knowledge&quot; id=&quot;link-id0x9f94a5f8&quot;&gt;knowledge&lt;/a&gt;, as well as the allocation of partitions to servers but these can be cached here and there and do not change all the time.&lt;/p&gt;
&lt;p&gt;So how could we do something of the sort here? I know for an experiential fact that when people cannot change the server memory pool size, let alone correctly set up disk striping, they simply cannot be expected to deal with partitioning. Besides, even if you know exactly what you are doing and why, configuring and refilling large numbers of partitions by hand is error prone, tedious, time consuming, and will run out of disk and require restoring backups and all sorts of DBA activity that will have everything down for a long time, unless of course you have MIS staff such as is not easily found.&lt;/p&gt;
&lt;p&gt;The solution is not so complex. We start with a set number of machines and make a file group on each. A file group has a bunch of disk stripes and a log file and can be laid out on the local file system in the usual manner. The data goes into the file group, partitioned as defined. You still specify partitioning columns but not where each partition goes. The system will decide this by itself. When a server&amp;#39;s file group gets too big, it splits. One half of each key&amp;#39;s partition in the original stays where it was and the other half goes to the copy. The copies will hold rows that no longer belong there but these can be removed in the background. The new file group will be managed by the same server process and the partitioning &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id0x1a3e17a0&quot;&gt;information&lt;/a&gt; on all servers gets updated to reflect the existence of the new file group and the range of hash values that belong there.&lt;/p&gt;
&lt;p&gt;If a file group is kept at some reasonable size, under a few GB, these can be moved around between servers, even dynamically.  &lt;/p&gt;
&lt;p&gt;If data is kept replicated, then the replicas have to split at the same time and the system will have to make sure that the replicas are kept on separate machines.&lt;/p&gt;
&lt;p&gt;So what happens to disk locality when file groups split? Nothing much. Firstly, partitioning will be set up so that consecutive values go to the same hash value, so that key compression is not ruined. Thus, consecutive numbers will be on the same page. Imagine an integer key partitioned two ways on bits 10-20. Values 0-1K go together, values 1K-2K go another way, values 2K-3K go the first way etc.  &lt;/p&gt;
&lt;p&gt;Now let us suppose the first partition, the even K&amp;#39;s splits. It could split so that multiples of 4 go one way and the rest another way. Now we&amp;#39;d have 0-1K in place, 2-3K in the new partition, 4K-5K in place and so on. A sequential disk read, with some read ahead, would scan the partitions in parallel but the disk access would be made sequential by the read ahead logic â remember that these are controlled by the same server process.&lt;/p&gt;
&lt;p&gt;For purposes of sending functions, the file group would be the recipient, not the host, per se. The allocation of file groups to hosts could change.  &lt;/p&gt;
&lt;p&gt;Now picture a transaction that touches multiple file groups. The requests going to collocated file groups can travel in the same batch and the recipient server process can run them sequentially or with a thread per file group, as may be convenient. Multiple threads per query on the same index make contention and needless thread switches. But since distinct file groups have their distinct mutexes there is less interference.&lt;/p&gt;
&lt;p&gt;For purposes of transactions, we might view a file group as deserving a its own branch. In this way we would not have to abort transactions if file groups moved. A file group split would probably have to kill all uncommitted transactions on it so as not to have to split one branch in two or deal with uncommitted data in the split. This is hardly a problem, the event being rare. For purposes of checkpoints, logging, log archival, recovery, and such, a file group is its own unit. The Bigtable paper had some ideas about combining transaction logs and such, all quite straightforward and intuitive.&lt;/p&gt;
&lt;p&gt;Writing the clustering logic with the file group, not the database process, as the main unit of location is a good idea and an entirely trivial change. This will make it possible to adjust capacity in almost real time without bringing everything to a halt by re-inserting terabytes of data in system wide repartitioning runs.&lt;/p&gt;
&lt;p&gt;Implementing this on the current &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x1a35c638&quot;&gt;Virtuoso&lt;/a&gt; is not a real difficulty. There is already a concept of file group, although we use only two, one for the data and one for temp. Using multiple ones is not a big deal.&lt;/p&gt;
&lt;p&gt;Supporting capacity allocation at the file group level instead of the server level can be introduced towards the middle of the clustering effort and will not greatly impact timetables.&lt;/p&gt; Â </description></item><item><title>Virtuoso and Database Scalability</title><guid>http://www.openlinksw.com/weblog/oerling/?date=2006-04-24#961</guid><comments>http://www.openlinksw.com/weblog/oerling/?id=961#comments</comments><pubDate>Mon, 24 Apr 2006 15:27:23 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-04-16T16:13:18.000003-04:00</n0:modified><description>&lt;p&gt;We have a new &lt;a href=&quot;http://virtuoso.openlinksw.com/wiki/main/Main/VOSScale&quot; id=&quot;link-id1068c3f8&quot;&gt;technical article&lt;/a&gt;, benchmarking &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0xd32b4d8&quot;&gt;Virtuoso&lt;/a&gt; on different hardware configurations.&lt;/p&gt;
&lt;p&gt;This is useful reading for anyone interested in using Virtuoso as a database back end for online applications or simply anyone interested in relational database scalability, no matter what specific DBMS.&lt;/p&gt;
&lt;p&gt;We use an adaptation of the well known &lt;a href=&quot;http://dbpedia.org/resource/TPC-C&quot; id=&quot;link-id0xdae8e30&quot;&gt;TPC-C&lt;/a&gt; benchmark to see what hardware configuration will give the best price/performance. We also explain how to tune Virtuoso and how and why different parameters affect the throughput.&lt;/p&gt;
</description></item><item><title>New Article on XML, Full Text and Smart Alerts</title><guid>http://www.openlinksw.com/weblog/oerling/?date=2006-04-17#958</guid><comments>http://www.openlinksw.com/weblog/oerling/?id=958#comments</comments><pubDate>Mon, 17 Apr 2006 17:07:53 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-04-16T16:13:15-04:00</n0:modified><description>&lt;p&gt;There is a new article, &lt;a href=&quot;http://virtuoso.openlinksw.com/wiki/main/Main/VOSArtText&quot; id=&quot;link-id101ebda0&quot;&gt;XML and Full Text Indexing and Filtering in Virtuoso&lt;/a&gt;, on the &lt;a href=&quot;http://virtuoso.openlinksw.com/wiki/main/Main&quot; id=&quot;link-id105e7248&quot;&gt;Virtuoso Open Source Edition&lt;/a&gt; wiki. &lt;/p&gt;
&lt;p&gt;The article shows how to harvest ATOM feeds, search them, and register alerts that fire when a stored search condition is met by incoming &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x18b85f38&quot;&gt;data&lt;/a&gt;. This lets the new data index the stored queries and not the other way around. This is the first in a series of hands on technical articles on &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x1646ca88&quot;&gt;Virtuoso&lt;/a&gt;.&lt;/p&gt;</description></item>
</channel>
</rss>
