<?xml version="1.0" encoding="UTF-8" ?>
<!--ATOM based XML document generated By OpenLink Virtuoso-->
<atom:feed xmlns:atom="http://www.w3.org/2005/Atom" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:vi="http://www.openlinksw.com/weblog/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/" xmlns:itunes="http://www.itunes.com/DTDs/Podcast-1.0.dtd" xmlns:dc="http://purl.org/dc/elements/1.1/">
<atom:id>http://www.openlinksw.com/weblog/oerling/</atom:id>
<atom:title>Orri Erling&#39;s Weblog</atom:title>
<atom:link href="http://www.openlinksw.com/weblog/oerling/" type="text/html" rel="alternate" />
<atom:link href="http://www.openlinksw.com/weblog/oerling/gems/atom_tag_arch.xml?:tag=benchmarking&amp;:bid=oerling-blog-0" type="application/atom+xml" rel="self" />
 <atom:author>
  <atom:name>oerling@openlinksw.com</atom:name>
  <atom:email>oerling@openlinksw.com</atom:email>
  </atom:author>
<atom:updated>2009-11-23T10:51:19Z</atom:updated>
<atom:generator>Virtuoso Universal Server 05.12.3041</atom:generator>
<atom:logo>http://www.openlinksw.com/weblog/public/images/vbloglogo.gif</atom:logo>
 <atom:entry>
  <atom:title>European Commission and the Data Overflow</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2009-10-27#1585</atom:id>
  <atom:published>2009-10-27T18:29:51Z</atom:published>
  <atom:updated>2009-10-27T14:57:28.000002-04:00</atom:updated>
  <atom:content type="html">&lt;p&gt;The European Commission recently circulated a questionnaire to selected experts on what could be done for the future of big &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x79cfe58&quot;&gt;data&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;Since the &lt;a href=&quot;http://cordis.europa.eu/fp7/ict/content-knowledge/consultation_en.html&quot; id=&quot;link-id1191c0f8&quot;&gt;questionnaire is public&lt;/a&gt;, I am publishing my answers below.&lt;/p&gt; &lt;ol type=&quot;1&quot; start=&quot;1&quot;&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;Data and data types&lt;/b&gt; &lt;/p&gt; &lt;ol type=&quot;a&quot; start=&quot;1&quot;&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;What volumes of data are we dealing with today? What is the growth rate? Where can we expect to be in 2015? &lt;/b&gt; &lt;/p&gt; &lt;p&gt;Private data warehouses of corporations have more than doubled yearly for the past years; hundreds of TB is not exceptional. This will continue. The real shift is in structured data being published in increasing quantities with a minimum level of integrate-ability through use of &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x7d7e7a0&quot;&gt;RDF&lt;/a&gt; and &lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id0x7f2a788&quot;&gt;linked data&lt;/a&gt; principles. There are rewards for use of standard vocabularies and identifiers through search engines recognizing such data. There is convergence around &lt;a href=&quot;http://dbpedia.org/resource/DBpedia&quot; id=&quot;link-id0x7dfbca8&quot;&gt;DBpedia&lt;/a&gt; identifiers for real-world entities, e.g., most things that would be in the news.&lt;/p&gt; &lt;p&gt;This also means that internal data processes and silos may be enriched with this content. There is consequent pressure for accommodating more diversity of data, with more flexible &lt;a href=&quot;http://dbpedia.org/resource/Database_schema&quot; id=&quot;link-id0x7babaf8&quot;&gt;schema&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;Ultimately, all content presently stored in RDBs and presented in public accessible dynamic web pages will end up on the web of linked data. Examples are product catalogs, price lists, event schedules and the like.&lt;/p&gt; &lt;p&gt;The volume of the well known linked data sets is around 10 billion statements. With the above mentioned trends, growth by two or three orders of magnitude by 2015 seems reasonable, This is so especially if explicit semantics are extracted from the document web and if there is some further progress in the precision/recall of such extraction.&lt;/p&gt; &lt;p&gt;Relevant sections of this mass of data are a potential addition to any present or future analytics application.&lt;/p&gt; &lt;p&gt;Since arbitrary analytics over the database which is the web cannot be economically provided by a centralized search engine, a cloud model may be used for on-demand selection of relevant data and mixing it with private data. This will drive database innovation for the next years even more than the continued classical warehouse growth.&lt;/p&gt; &lt;p&gt;Science data is another driver of the data overflow. For example, faster gene sequencing, more accurate measurements in high energy physics, better imaging, and remote sensing will produce large volumes of data. This data has highly regular structure but labeling this data with source and lineage calls for a flexible, schema-last, self-describing model, such as RDF and linked data. Data and &lt;a href=&quot;http://dbpedia.org/resource/Metadata&quot; id=&quot;link-id0x96ce60&quot;&gt;metadata&lt;/a&gt; should travel together but may have different data models.&lt;/p&gt; &lt;p&gt;By and large, the metadata of science data will be another stream to the web of linked data, at least to the degree it is publicly accessible. Restricted circles can and likely will implement similar ideas.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;What types of data can we deal with intelligently due to their inherent structure (geospatial, temporal, social or &lt;a href=&quot;http://dbpedia.org/resource/Knowledge&quot; id=&quot;link-id0x7e8e248&quot;&gt;knowledge&lt;/a&gt; graphs, 3D, sensor streams...)?&lt;/b&gt; &lt;/p&gt; &lt;p&gt;All the above types should be supported inside one DBMS so as to allow efficient querying combining conditions on all these types of data, e.g., &lt;i&gt;photos of sunsets taken last summer in Ibiza, with over 20 megapixels, by people I know.&lt;/i&gt; &lt;/p&gt; &lt;p&gt;Note that the test for being a sunset is an operation on the image blob that should be taken to the data; the images cannot be economically transferred.&lt;/p&gt; &lt;p&gt;Interleaving of all database functions and types becomes increasingly important.&lt;/p&gt; &lt;/li&gt; &lt;/ol&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;Industries, communities&lt;/b&gt; &lt;/p&gt; &lt;ol type=&quot;a&quot; start=&quot;1&quot;&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;Who is producing these data and why? Could they do it better? How?&lt;/b&gt; &lt;/p&gt; &lt;p&gt;Right now, projects such as &lt;a href=&quot;http://www.bio2rdf.org/&quot; id=&quot;link-id0x43bd098&quot;&gt;Bio2RDF&lt;/a&gt;, &lt;a href=&quot;http://neurocommons.org/page/Main_Page&quot; id=&quot;link-id0x5c074b0&quot;&gt;Neurocommons&lt;/a&gt;, and DBPedia produce this data. The processes are in place and are reasonable. Incremental improvement is to be expected. These processes, along with the &lt;a href=&quot;http://www.w3.org/DesignIssues/LinkedData.html&quot; id=&quot;link-id0x72131d0&quot;&gt;linked data meme&lt;/a&gt; generally taking off, drive demand for better &lt;a href=&quot;http://dbpedia.org/resource/Natural_language_processing&quot; id=&quot;link-id0x71e7798&quot;&gt;NLP&lt;/a&gt; (&lt;a href=&quot;http://dbpedia.org/resource/Natural_language_processing&quot; id=&quot;link-id0x7e0e2f0&quot;&gt;Natural Language Processing&lt;/a&gt;), e.g., &lt;a href=&quot;http://dbpedia.org/resource/Entity&quot; id=&quot;link-id0x71ab500&quot;&gt;entity&lt;/a&gt; and relationship extraction, especially extraction that can produce instance data in given ontologies (e.g., events) using common identifiers (e.g., DBPedia URIs).&lt;/p&gt; &lt;p&gt;Mapping of RDBs to RDF is possible, and a W3C working group is developing standards for this. The required baseline level has been reached; the rest is a matter of automating deployment. Within the enterprise, there are advantages to be gained for &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id0x7a8e9a8&quot;&gt;information&lt;/a&gt; integration; e.g., all entities in the CRM space can be integrated with all email and support tickets through giving everything a &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Identifier&quot; id=&quot;link-id0x599f630&quot;&gt;URI&lt;/a&gt;. Some of this information may even be published on an &lt;a href=&quot;http://dbpedia.org/resource/Extranet&quot; id=&quot;link-id0x2a28f98&quot;&gt;extranet&lt;/a&gt; for self-service and web-service interfaces. This has been done at small scales and the rest is a matter of spreading adoption and lowering the entry barrier. Incremental progress will take place, eventually resulting in qualitatively better integration along the value chain when adoption is sufficiently widespread.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;Who is consuming these data and why? Could they do it better? How?&lt;/b&gt; &lt;/p&gt; &lt;p&gt;Consumers are various. The greatest need is for tools that summarize complex data and allow getting a bird&amp;#39;s eye view of what data is in the first instance available. Consuming the data is hindered by the user not even necessarily knowing what data there is. This is somewhat new, as traditionally the business analyst did know the schema of the warehouse and was proficient with &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x5999558&quot;&gt;SQL&lt;/a&gt; report generators and statistics packages.&lt;/p&gt; &lt;p&gt;Where Web 2.0 made the &lt;i&gt;citizen journalist&lt;/i&gt;, the web of linked data will make the &lt;i&gt;citizen analyst&lt;/i&gt;. For this to happen, with benefits for individuals, enterprises, and governments alike, more work in user interfaces, knowledge discovery, and query composition will be useful. We may envision a &amp;quot;meshup economy&amp;quot; where data is plentiful, but the unit of value and exchange is the smart report that crystallizes actionable value from this ocean.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;What industrial sectors in Europe could become more competitive if they became much better at managing data?&lt;/b&gt; &lt;/p&gt; &lt;p&gt;Any sector could benefit. Early adopters are seen in the biomedical field and to an extent in media. &lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;Is the regulation landscape imposing constraints (privacy, compliance ...) that don&amp;#39;t have today good tool support?&lt;/b&gt; &lt;/p&gt; &lt;p&gt;The regulation landscape drives database demand through data retention requirements and the like.&lt;/p&gt; &lt;p&gt;With data integration, especially with privacy-sensitive data (as in medicine), there are issues of whether one dares put otherwise-shareable information online. Regulation is needed to protect individuals, but integration should still be possible for science.&lt;/p&gt; &lt;p&gt;For this, we see a need for progress in applying policy-based approaches (e.g., row level security) to relatively schema-last data such as RDF. This is possible but needs some more work. Also, creating on-the-fly-anonymizing views on data might help.&lt;/p&gt; &lt;p&gt;More research is needed for reconciling the need for security with the advantages of broad-based &lt;i&gt;ad hoc&lt;/i&gt; integration. Ideally, data should be intelligent, aware of its origins and classification and cautious of whom it interacts with, all of this supported under the covers so that the user could ask anything but the data might refuse to answer or might restrict answers according to the user&amp;#39;s profile. This is a tall order and implementing something of the sort is an open question.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;What are the main practical problem identified for individuals and organizations? Please give examples and tell us about the main obstacles and barriers.&lt;/b&gt; &lt;/p&gt; &lt;p&gt;We have come across the following:&lt;/p&gt; &lt;ul&gt; &lt;li&gt;Knowing that the data exists in the first place.&lt;/li&gt; &lt;li&gt;If the data is found, figuring out the provenance, units and precision of measurement, identifiers, and the like.&lt;/li&gt; &lt;li&gt;Compatible subject matter but incompatible representation: For example, one has numbers on a map with different maps for different points in time; another has time series of instrument data with geo-location for the instrument. It is only to be expected that the time interval between measurements is not the same. So there is need for a lot of one-off programming to align data.&lt;/li&gt; &lt;/ul&gt; &lt;p&gt;Other problems have to do with sheer volume, i.e., transfer of data even in a local area network is too slow, let alone over a wide area network. Computation needs to go to the data, and databases need to support this.&lt;/p&gt; &lt;/li&gt; &lt;/ol&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;Services, software stacks, protocols, standards, benchmarks&lt;/b&gt; &lt;/p&gt; &lt;ol type=&quot;a&quot; start=&quot;1&quot;&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;What combinations of components are needed to deal with these problems?&lt;/b&gt; &lt;/p&gt; &lt;p&gt;Recent times have seen a proliferation of special purpose databases. Since the data needs of the future are about combining data with maximum agility and minimum performance hit, there is need to gather the currently-separate functionality into an integrated system with sufficient flexibility. We see some of this in integration of map-reduce and scale-out databases. The former antagonists have become partners. Vertica, &lt;a href=&quot;http://dbpedia.org/resource/Greenplum&quot; id=&quot;link-id0x45ecfa0&quot;&gt;Greenplum&lt;/a&gt;, and OpenLink &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x7f73fc8&quot;&gt;Virtuoso&lt;/a&gt; are example of DBMS featuring work in this direction.&lt;/p&gt; &lt;p&gt;Interoperability and at least &lt;i&gt;de facto&lt;/i&gt; standards in ways of doing this will emerge.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;What data exchange and processing mechanisms will be needed to work across platforms and programming languages?&lt;/b&gt; &lt;/p&gt; &lt;p&gt; &lt;a href=&quot;http://dbpedia.org/resource/Hypertext_Transfer_Protocol&quot; id=&quot;link-id0x776a1a0&quot;&gt;HTTP&lt;/a&gt;, &lt;a href=&quot;http://dbpedia.org/resource/XML&quot; id=&quot;link-id0x2a4e8d0&quot;&gt;XML&lt;/a&gt;, and RDF are in fact very verbose, yet these are the formats and models that have uptake. Thus, these will continue to be used even though one might think binary formats to be more efficient.&lt;/p&gt; &lt;p&gt;There are of course science data set standards that are more compressed and these will continue, hopefully adding a practice of rich metadata in RDF.&lt;/p&gt; &lt;p&gt;For internals of systems, MPI and TCP/IP with proprietary optimized wire formats will continue. Inter-system communication will likely continue to be HTTP, XML, and RDF as appropriate.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;What data environments are today so wastefully messy that they would benefit from the development of standards?&lt;/b&gt; &lt;/p&gt; &lt;p&gt;RDF and &lt;a href=&quot;http://dbpedia.org/resource/Web_Ontology_Language&quot; id=&quot;link-id0x2a35960&quot;&gt;OWL&lt;/a&gt; are not messy but they could use some more performance; we are working on this. &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x12362e8&quot;&gt;SPARQL&lt;/a&gt; is finally acquiring the capabilities of a serious query language, so things are slowly coming together.&lt;/p&gt; &lt;p&gt;Community process for developing application domain specific vocabularies works quite well, even though one could argue it is &lt;i&gt;ad hoc&lt;/i&gt; and not up to what a modeling purist might wish.&lt;/p&gt; &lt;p&gt;Top-down imposition of standards has a mixed history, with long and expensive development and sometimes no or little uptake, consider some WS* standards for example.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;What kind of performance is expected or required of these systems? Who will measure it reliably? How?&lt;/b&gt; &lt;/p&gt; &lt;p&gt;Relational databases have a history of substantial investment in &lt;a href=&quot;http://dbpedia.org/resource/Program_optimization&quot; id=&quot;link-id0x7b2d7c8&quot;&gt;optimization&lt;/a&gt; and some of them are very good for what they do, e.g., the newer generation of analytics databases.&lt;/p&gt; &lt;p&gt;The very large schema-last, no-SQL, sometimes eventually consistent key-value stores have a somewhat shorter history but do fill a real need.&lt;/p&gt; &lt;p&gt;These trends will merge: Extreme scale, schema-last, complex queries, even more complex inference, custom code for in-database machine learning and other bulk processing.&lt;/p&gt; &lt;p&gt;We find RDF augmented with some binary types at this crossroads. This point of the design space will have to provide performance roughly on the level of today&amp;#39;s best relational solution for workloads that fit the relational model. The added cost of schema-last and inference must come down. We are working on this. Research work such as carried out with &lt;a href=&quot;http://dbpedia.org/resource/MonetDB&quot; id=&quot;link-id0x794ee48&quot;&gt;MonetDB&lt;/a&gt; gives clues as to how these aims can be reached.&lt;/p&gt; &lt;p&gt;The separation of query language and inference is artificial. After the concepts are mature, these functions will merge and execute close to the data; there are clear evolutionary pressures in this direction.&lt;/p&gt; &lt;p&gt;Benchmarks are key. Some gain can be had even from repurposing standard relational benchmarks like &lt;a href=&quot;http://www.tpc.org/&quot; id=&quot;link-id0x7d45c58&quot;&gt;TPC&lt;/a&gt;-&lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id0x45b0198&quot;&gt;H&lt;/a&gt;. But the TPC-H rules do not allow official reporting of such.&lt;/p&gt; &lt;p&gt;Development of benchmarks for RDF, complex queries, and inference is needed. A bold challenge to the community, it should be rooted in real-life integration needs and involve high heterogeneity. A key-value store benchmark might also be conceived. A transaction benchmark like TPC-&lt;a href=&quot;http://dbpedia.org/resource/C%2B%2B&quot; id=&quot;link-id0x7e32178&quot;&gt;C&lt;/a&gt; might be the basis, maybe augmented with massive user-generated content like reviews and blogs.&lt;/p&gt; &lt;p&gt;If benchmarks exist and are not too easy nor inaccessibly difficult nor too expensive to run â think of the high end TPC-C results â then TPC-style rules and processes would be quite adequate. The threshold to publish should be lowered: Everybody runs the TPC workloads internally but few publish.&lt;/p&gt; &lt;p&gt;Some EC initiative for benchmarking could make sense, similar to the TREC initiative of the US government. Industry should be consulted for the specific content; possibly the answers to the present questionnaire can provide an approximate direction.&lt;/p&gt; &lt;p&gt;Benchmarks should be run by software vendors on their own systems, tuned by themselves. But there should be a process of disclosure and auditing; the TPC rules give an example. Compliance should not be too expensive or time consuming. Some community development for automating these things would be a worthwhile target for EC funding.&lt;/p&gt; &lt;/li&gt; &lt;/ol&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;Usability and training&lt;/b&gt; &lt;/p&gt; &lt;ol type=&quot;a&quot; start=&quot;1&quot;&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;How difficult will it be for a developer of average competence to deploy components whose core is based on rather deep computer science? Do we all need to understand Monads and Continuations? What can be done to make it ever easier?&lt;/b&gt; &lt;/p&gt; &lt;p&gt;In the database world, huge advances in technology have taken place behind a relatively simple and stable interface: SQL. For the linked data &lt;a href=&quot;http://dbpedia.org/resource/Giant_Global_Graph&quot; id=&quot;link-id0x7e01618&quot;&gt;web&lt;/a&gt;, the same will take place behind SPARQL.&lt;/p&gt; &lt;p&gt;Beyond these, for example, programming with MPI with good utilization of a cluster platform for an arbitrary algorithm, is quite difficult. The casual amateur is hereby warned.&lt;/p&gt; &lt;p&gt;There is no single solution. For automatic parallelization, since explicit, programmatic parallelization of things with MPI for example is very unscalable in terms of required skill, we should favor declarative and/or functional approaches.&lt;/p&gt; &lt;p&gt;Developing a debugger and explanation engine for rule-based and description-logics-based inference would be an idea.&lt;/p&gt; &lt;p&gt;For procedural workloads, things like Erlang may be good in cases and are not overly difficult in principle, especially if there are good debugging facilities.&lt;/p&gt; &lt;p&gt;For shipping functions in a cluster or cloud, the &lt;a href=&quot;http://www.eecs.berkeley.edu/Research/Projects/Data/105733.html&quot; id=&quot;link-id0x43665a8&quot;&gt;BOOM&lt;/a&gt; (&lt;a href=&quot;http://www.eecs.berkeley.edu/Research/Projects/Data/105733.html&quot; id=&quot;link-id0x7718f00&quot;&gt;Berkeley Orders Of Magnitude&lt;/a&gt;) approach or logic programming with explicit specification of compute location seem promising, surely more flexible than map-reduce. The question is whether a &lt;a href=&quot;http://dbpedia.org/resource/PHP&quot; id=&quot;link-id0x7d64f68&quot;&gt;PHP&lt;/a&gt; developer can be made to do logic programming.&lt;/p&gt; &lt;p&gt;This bridge will be crossed only with actual need and even then reluctantly. We may look at the Web 2.0 practice of sharding &lt;a href=&quot;http://dbpedia.org/resource/MySQL&quot; id=&quot;link-id0xbab1ae98&quot;&gt;MySQL&lt;/a&gt;, inconvenient as this may be, for an example. There is inertia and thus re-architecting is a constant process that is generally in reaction to facts, &lt;i&gt;post hoc&lt;/i&gt;, often a point solution. One could argue that planning ahead would be smarter but by and large the world does not work so.&lt;/p&gt; &lt;p&gt;One part of the answer is an infinitely-scalable SQL database that expands and shrinks in the clouds, with the usual semantics, maybe optional eventual consistency and built-in map reduce. If such a thing is inexpensive enough and syntax-level-compatible with present installed base, many developers do not have to learn very much more.&lt;/p&gt; &lt;p&gt;This is maybe good for the bread-and-butter IT, but European competitiveness should not rest on this. Therefore we wish to go for bold new application types for which the client-server database application is not the model. Data-centric languages like BOOM, if they can be made very efficient and have good debugging support, are attractive there. These do require more intellectual investment but that is not a problem since the less-inquisitive part of the developer community is served by the first part of the answer.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;How is a developer of average skills going to learn about these new advanced tools? How can we plan for excellent documentation and training, community mentoring, exchange of good practices, etc... across all EU countries?&lt;/b&gt; &lt;/p&gt; &lt;p&gt;For the most part, developers do not learn things for the sake of learning. When they have learned something and it is adequate, they stay with it for the most part and are even reluctant to engage in cross-camps interaction. The research world is often similarly insular. A new inflection in the application landscape is needed to drive learning. This inflection is provided by the &lt;a href=&quot;https://wiki.mozilla.org/Labs/Ubiquity&quot; id=&quot;link-id0x770df38&quot;&gt;ubiquity&lt;/a&gt; of mobile devices, sensor data, explicit semantics, NLP concept extraction, web of linked data, and such factors.&lt;/p&gt; &lt;p&gt;RDFa is a good example of a new technique piggybacking on something everybody uses, namely HTML. These new things should, within possibility, be deployed in the usual technology stack, &lt;a href=&quot;http://en.wikipedia.org/wiki/LAMP_%28software_bundle%29&quot; id=&quot;link-id0x55596a8&quot;&gt;LAMP&lt;/a&gt; or Java. Of course these do not have to be LAMP or Java or HTML or HTTP themselves but they must manifest through these.&lt;/p&gt; &lt;p&gt;A lot of the &lt;a href=&quot;http://dbpedia.org/resource/Semantic_Web&quot; id=&quot;link-id0x3d5378&quot;&gt;semantic web&lt;/a&gt; potential can be realized within the client-server database application model, thus no fundamental re-architecting, just some new data types and queries.&lt;/p&gt; &lt;p&gt;For data- or processing-intensive tasks, an on-demand hookup to cloud-based servers with Erlang and/or BOOM for programming model would be easy enough to learn and utilize.&lt;/p&gt; &lt;p&gt;The question is one of providing challenges. Addressing actual challenges with these techniques will lead to maturity, documentation, examples, and training. With virtual, Europe-wide distributed teams a reality in many places, Europe-wide dissemination is no longer insurmountable.&lt;/p&gt; &lt;p&gt;As the data overflow proceeds, its victims will multiply and create demand for solutions. The EC could here encourage research project use cases gaining an extended life past the end of research projects, possibly being maintained and multiplied and spun off.&lt;/p&gt; &lt;p&gt;If such things could be mutated into self-sustaining service businesses with pay-per-use revenue, say through a cloud SaaS business model, still primarily leveraging an open source technology stack, we could have self-propagating and self-supporting models for exploiting advanced IT. This would create interest, and interest would drive training and dissemination.&lt;/p&gt; &lt;p&gt;The problem is creating the pull.&lt;/p&gt; &lt;/li&gt; &lt;/ol&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;Challenges&lt;/b&gt; &lt;/p&gt; &lt;ol type=&quot;a&quot; start=&quot;1&quot;&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;What should be, in this domain, the equivalent of the Netflix challenge, Ansari X Prize, &lt;a href=&quot;http://dbpedia.org/resource/Google&quot; id=&quot;link-id0x6a6c2b0&quot;&gt;Google&lt;/a&gt; Lunar X Prize, etc. ... ?&lt;/b&gt; &lt;/p&gt; &lt;p&gt;The EC itself no doubt suffers from data overflow in one function or another. Unless security/secrecy prohibits, simply publishing a large data set and a description of what operations should be done on it would be a start. The more real the data, the better â reality is consistently more complex and surprising than imagination. Since many interesting problems touch on fraud detection and law enforcement, there may be some security obstacles for using these application domains as subject matters of open challenges.&lt;/p&gt; &lt;p&gt;Once there is a good benchmark, as discussed above, there can be some prize money allocated for the winners, specially if the race is tight.&lt;/p&gt; &lt;p&gt;The Semantic Web Challenge and the Billion Triples Challenge exist and are useful as such, but do not seem to have any huge impact.&lt;/p&gt; &lt;p&gt;The incentives should be sufficient and part of the expenses arising from running for such challenges could be funded. Otherwise investing in existing business development will be more interesting to industry. Some industry participation seems necessary; we would wish academia and industry to work closer. Also, having industry supply the baseline guarantees that academia actually does further the state of the art. This is not always certain.&lt;/p&gt; &lt;p&gt;If challenges are based on actual problems, whether of the EC, its member governments, or private entities, and winning the challenge may lead to a contract for supplying an actual solution, these will naturally become more interesting for consortia involving integrators, specialist software vendors, and academia. Such a model would build actual capacity to deploy leading edge technologies in production, which is sorely needed.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;What should one do to set up such a challenge, administer, and monitor it?&lt;/b&gt; &lt;/p&gt; &lt;p&gt;The EC should probably circulate a call for actual problem scenarios involving big data. If the matter of the overflow is as dire as represented, cases should be easy to find. A few should be selected and then anonymized if needed.&lt;/p&gt; &lt;p&gt;The party with the use case would benefit by having hopefully the best work on it. The contestants would benefit from having real world needs guide R&amp;amp;D. The EC would not have to do very much, except possibly use some money for funding the best proposals. The winner would possibly get a large account and related sales and service income. The contestants would have to be teams possibly involving many organizations; for example, development and first-line services and support could come from different companies along a systems integrator model such as is widely used in the US.&lt;/p&gt; &lt;p&gt;There may be a good benchmark at the time, possibly resulting from FP7 itself. In such a case, the EC could offer a prize for winners. Details would have to be worked out case by case. Such a challenge could be repeated a few times, as benchmark-driven progress in databases or TREC for example have taken some years to reach a point of slowdown in progress.&lt;/p&gt; &lt;p&gt;Administrating such an activity should not be prohibitive, as most of the expertise can be found with the stakeholders.&lt;/p&gt; &lt;/li&gt; &lt;/ol&gt; &lt;/li&gt; &lt;/ol&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>VLDB 2009 TPC Workshop (3 of 5)</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2009-09-01#1576</atom:id>
  <atom:published>2009-09-01T15:51:09Z</atom:published>
  <atom:updated>2009-09-01T17:32:30-04:00</atom:updated>
  <atom:content type="html">&lt;p&gt;Michael &lt;a href=&quot;http://dbpedia.org/resource/Michael_Stonebraker&quot; id=&quot;link-id0x15e5efe0&quot;&gt;Stonebraker&lt;/a&gt; gave the keynote at the &lt;a href=&quot;http://www.tpc.org/&quot; id=&quot;link-id0x18cee5f0&quot;&gt;TPC&lt;/a&gt; workshop. His message was that the TPC, at the venerable age of 21, was already a decade late in reinventing itself. From the height of relevance at the time of the debit/credit benchmark twenty years back, it was slipping into the sunset of irrelevance unless it paid attention.&lt;/p&gt; &lt;p&gt;Now we are great fans of the TPC and while we have not published results by the TPC book, we have extensively used TPC material for guiding &lt;a href=&quot;http://dbpedia.org/resource/Program_optimization&quot; id=&quot;link-id0x4e55368&quot;&gt;optimization&lt;/a&gt;, as has pretty much everybody else.&lt;/p&gt; &lt;p&gt;It is true that the rules encourage unrealistic configurations. The emphasis on random access from disk that is built into the rules leads to disk configurations that are very improbable in practice, such as 1PB of disks for 3TB of &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x191cd880&quot;&gt;data&lt;/a&gt;, just so there are enough disk arms in parallel. Stonebraker also pointed out that replication and failover were ubiquitous in real life and that roll forward from logs was unrealistic as a recovery model since it took so long. Benchmarks should therefore include replication.&lt;/p&gt; &lt;p&gt;Further, Stonebraker challenged the TPC to go for the new frontier, which he described as the huge data sets in science and on big web sites. Scientists, the ones who would save our planet from the diverse ills confronting it, do not like relational databases. They avoid them when can. They want arrays for physics, and graphs for biology and chemistry. &lt;a href=&quot;http://dbpedia.org/resource/MapReduce&quot; id=&quot;link-id0x53f6040&quot;&gt;MapReduce&lt;/a&gt; is eating database&amp;#39;s lunch; what will you do about this?&lt;/p&gt; &lt;p&gt;I later suggested incorporating an &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x18902070&quot;&gt;RDF&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/Metadata&quot; id=&quot;link-id0x3990af8&quot;&gt;metadata&lt;/a&gt; benchmark into the TPC suite. We&amp;#39;ll see about this; we&amp;#39;ll first have to come up with a suitable one. There is a great deal of pressure for making good RDF benchmarks but this is not yet in the center of the mainstream that TPC tends to cover.&lt;/p&gt; &lt;p&gt;TPC&amp;#39;s own talk was about the life cycle of benchmarks. A benchmark begins a bit ahead of the mainstream, with a problem that is difficult but not so difficult as to be uncommon. When the solution to this problem becomes commonplace, the benchmark&amp;#39;s relevance gradually drops.&lt;/p&gt; &lt;p&gt;There was a talk on robustness of query plans which was well to the point. Indeed, there are performance cliffs at certain points; for example, when passing from memory-only to disk-pageable data structures, or when switching from indexed access to table scans, or from loop to hash joins. Quite so. The analysis I really would have liked to see would have been one of what happens when passing from single server to a cluster, and from local joins to cross-partition ones. Also contrasting of &lt;a href=&quot;http://dbpedia.org/resource/Cache&quot; id=&quot;link-id0x1942aca8&quot;&gt;cache&lt;/a&gt; fusion and partitioning. We have our own data and experience but we find we don&amp;#39;t have time to measure all the other systems.&lt;/p&gt; &lt;p&gt;Anyway it is good to raise the question of smooth and predictable performance.&lt;/p&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>Updated hardware improves LUBM 8000 load rate in Virtuoso 6</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2009-08-14#1568</atom:id>
  <atom:published>2009-08-14T19:01:30Z</atom:published>
  <atom:updated>2009-08-15T15:27:25-04:00</atom:updated>
  <atom:content type="html">&lt;p&gt;We repeated the &lt;a href=&quot;http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1562&quot; id=&quot;link-id173d3068&quot;&gt;earlier LUBM 8000 experiment&lt;/a&gt; on a newer machine, with 2 x Xeon 5520 and 72G 1333MHz memory, and once again with the 2 machines as a networked cluster. Otherwise the settings were the same.&lt;/p&gt; &lt;p&gt;The load rate is now 160,739 triples-per-second.&lt;/p&gt; &lt;table&gt; &lt;tr&gt; &lt;th&gt;&lt;/th&gt; &lt;td&gt;   &lt;/td&gt; &lt;th align=&quot;center&quot;&gt;&lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x199b9740&quot;&gt;Virtuoso&lt;/a&gt; 6 &lt;br /&gt; (previous run)&lt;/th&gt; &lt;td&gt;   &lt;/td&gt; &lt;th align=&quot;center&quot;&gt;Virtuoso 6 &lt;br /&gt; (new run)&lt;/th&gt; &lt;td&gt;   &lt;/td&gt; &lt;th align=&quot;center&quot;&gt;Virtuoso 6 &lt;br /&gt; (newest run)&lt;/th&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td align=&quot;left&quot;&gt;blades&lt;/td&gt; &lt;td&gt;   &lt;/td&gt; &lt;td align=&quot;center&quot;&gt;1 &lt;/td&gt; &lt;td&gt;   &lt;/td&gt; &lt;td align=&quot;center&quot;&gt;1 &lt;/td&gt; &lt;td&gt;   &lt;/td&gt; &lt;td align=&quot;center&quot;&gt;2&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td align=&quot;left&quot;&gt;processors&lt;/td&gt; &lt;td&gt;   &lt;/td&gt; &lt;td align=&quot;center&quot;&gt;2 x Xeon 5410&lt;/td&gt; &lt;td&gt;   &lt;/td&gt; &lt;td align=&quot;center&quot;&gt;2 x Xeon 5520&lt;/td&gt; &lt;td&gt;   &lt;/td&gt; &lt;td align=&quot;center&quot;&gt; 2 x Xeon 5520 &lt;br /&gt;+ &lt;br /&gt;2 x Xeon 5410 &lt;br /&gt;with 1x1GigE &lt;br /&gt;interconnect &lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td align=&quot;left&quot;&gt;memory&lt;/td&gt; &lt;td&gt;   &lt;/td&gt; &lt;td align=&quot;center&quot;&gt; 16G 667 MHz&lt;/td&gt; &lt;td&gt;   &lt;/td&gt; &lt;td align=&quot;center&quot;&gt;72G 1333 MHz&lt;/td&gt; &lt;td&gt;   &lt;/td&gt; &lt;td align=&quot;center&quot;&gt;72G 1333 MHz &lt;br /&gt;+ &lt;br /&gt; 16G 667 MHz &lt;br /&gt; respectively&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td align=&quot;left&quot;&gt;reported load rate&lt;br /&gt;triples-per-second&lt;/td&gt; &lt;td&gt;   &lt;/td&gt; &lt;td align=&quot;center&quot;&gt; 110,532 &lt;/td&gt; &lt;td&gt;   &lt;/td&gt; &lt;td align=&quot;center&quot;&gt; 160,739 &lt;/td&gt; &lt;td&gt;   &lt;/td&gt; &lt;td align=&quot;center&quot;&gt; 214,188 &lt;/td&gt; &lt;/tr&gt; &lt;/table&gt; &lt;p&gt;Again, if others talk about loading LUBM, so must we. Otherwise, this metric is rather uninteresting.&lt;/p&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>Single Virtuoso host loads 110,500 triples-per-second on LUBM 8000</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2009-06-29#1562</atom:id>
  <atom:published>2009-06-29T16:12:34Z</atom:published>
  <atom:updated>2009-08-15T16:06:42.000001-04:00</atom:updated>
  <atom:content type="html">&lt;p&gt;LUBM load speed still seems to be a metric that is quoted in comparisons of &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id142df6e8&quot;&gt;RDF&lt;/a&gt; stores. Consequently, we too measured the load time of LUBM 8000, 1,068-million triples, on the newest &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id1389dfa0&quot;&gt;Virtuoso&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;The real time for the load was 161m 3s. The rate was 110,532 triples-per-second. The hardware was one machine with 2 x Xeon 5410 (quad core, 2.33 GHz) and 16G 667 MHz RAM. The software was Virtuoso 6 Cluster, configured into 8 partitions (processes) — one partition per CPU core. Each partition had its database striped over 6 disks total; the 6 disks on the system were shared between the 8 database processes.&lt;/p&gt; &lt;p&gt;The load was done on 8 streams, one per server process. At the beginning of the load, the CPU usage was 740% with no disk; at the end, it was around 700% with 25% disk wait. 100% counts here for one CPU core or one disk being constantly busy.&lt;/p&gt; &lt;p&gt;The RDF store was configured with the default two indices over quads, these being GSPO and OGPS. Text indexing of literals was not enabled. No materialization of entailed triples was made.&lt;/p&gt; &lt;p&gt;We think that LUBM loading is not a realistic benchmark for the world but since other people publish such numbers, so do we.&lt;/p&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>Single Virtuoso host loads 110,500 triples-per-second on LUBM 8000</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2009-06-29#1562</atom:id>
  <atom:published>2009-06-29T16:12:34Z</atom:published>
  <atom:updated>2009-08-15T16:06:42.000001-04:00</atom:updated>
  <atom:content type="html">&lt;p&gt;LUBM load speed still seems to be a metric that is quoted in comparisons of &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id142df6e8&quot;&gt;RDF&lt;/a&gt; stores. Consequently, we too measured the load time of LUBM 8000, 1,068-million triples, on the newest &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id1389dfa0&quot;&gt;Virtuoso&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;The real time for the load was 161m 3s. The rate was 110,532 triples-per-second. The hardware was one machine with 2 x Xeon 5410 (quad core, 2.33 GHz) and 16G 667 MHz RAM. The software was Virtuoso 6 Cluster, configured into 8 partitions (processes) — one partition per CPU core. Each partition had its database striped over 6 disks total; the 6 disks on the system were shared between the 8 database processes.&lt;/p&gt; &lt;p&gt;The load was done on 8 streams, one per server process. At the beginning of the load, the CPU usage was 740% with no disk; at the end, it was around 700% with 25% disk wait. 100% counts here for one CPU core or one disk being constantly busy.&lt;/p&gt; &lt;p&gt;The RDF store was configured with the default two indices over quads, these being GSPO and OGPS. Text indexing of literals was not enabled. No materialization of entailed triples was made.&lt;/p&gt; &lt;p&gt;We think that LUBM loading is not a realistic benchmark for the world but since other people publish such numbers, so do we.&lt;/p&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>Virtuoso RDF: A Getting Started Guide for the Developer</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2008-12-17#1504</atom:id>
  <atom:published>2008-12-17T12:31:34Z</atom:published>
  <atom:updated>2008-12-17T12:41:21.000001-05:00</atom:updated>
  <atom:content type="html">&lt;p&gt;It is a long standing promise of mine to dispel the false impression that using &lt;a href=&quot;http://virtuoso.openlinksw.com/&quot; id=&quot;link-id113506d0&quot;&gt;Virtuoso&lt;/a&gt; to work with &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id115d9528&quot;&gt;RDF&lt;/a&gt; is complicated.&lt;/p&gt; &lt;p&gt;The purpose of this presentation is to show a programmer how to put RDF into Virtuoso and how to query it. This is done programmatically, with no confusing user interfaces.&lt;/p&gt; &lt;p&gt;You should have a Virtuoso Open Source tree built and installed. We will look at the LUBM benchmark demo that comes with the package. All you need is a Unix shell. Running the shell under emacs (&lt;code&gt;m-x shell&lt;/code&gt;) is the best. But the open source &lt;code&gt;isql&lt;/code&gt; utility should have command line editing also. The emacs shell is however convenient for cutting and pasting things between shell and files.&lt;/p&gt; &lt;p&gt;To get started, cd into &lt;code&gt;binsrc/tests/lubm&lt;/code&gt;.&lt;/p&gt; &lt;p&gt;To verify that this works, you can do &lt;/p&gt; &lt;blockquote&gt; &lt;pre&gt;./test_server.sh virtuoso-t&lt;/pre&gt;&lt;/blockquote&gt; &lt;p&gt;This will test the server with the LUBM queries. This should report 45 tests passed. After this we will do the tests step-by-step.&lt;/p&gt; &lt;h2&gt;Loading the &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id10f7bd90&quot;&gt;Data&lt;/a&gt; &lt;/h2&gt; &lt;p&gt;The file &lt;code&gt;lubm-load.sql&lt;/code&gt; contains the commands for loading the LUBM single university qualification database.&lt;/p&gt; &lt;p&gt;The data files themselves are in &lt;code&gt;lubm_8000&lt;/code&gt;, 15 files in RDFXML.&lt;/p&gt; &lt;p&gt;There is also a little ontology called &lt;code&gt;inf.nt&lt;/code&gt;. This declares the subclass and subproperty relations used in the benchmark.&lt;/p&gt; &lt;p&gt;So now let&amp;#39;s go through this procedure.&lt;/p&gt; &lt;p&gt;Start the server:&lt;/p&gt; &lt;blockquote&gt; &lt;pre&gt;$ virtuoso-t -f &amp;amp; &lt;/pre&gt;&lt;/blockquote&gt; &lt;p&gt;This starts the server in foreground mode, and puts it in the background of the shell.&lt;/p&gt; &lt;p&gt;Now we connect to it with the isql utility.&lt;/p&gt; &lt;blockquote&gt; &lt;pre&gt;$ isql 1111 dba dba &lt;/pre&gt;&lt;/blockquote&gt; &lt;p&gt;This gives a &lt;code&gt;SQL&amp;gt;&lt;/code&gt; prompt. The default username and password are both &lt;code&gt;dba&lt;/code&gt;.&lt;/p&gt; &lt;p&gt;When a command is &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id1176ce70&quot;&gt;SQL&lt;/a&gt;, it is entered directly. If it is &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id156df468&quot;&gt;SPARQL&lt;/a&gt;, it is prefixed with the keyword &lt;code&gt;sparql&lt;/code&gt;. This is how all the SQL clients work. Any SQL client, such as any &lt;a href=&quot;http://dbpedia.org/resource/Open_Database_Connectivity&quot; id=&quot;link-id152d0a00&quot;&gt;ODBC&lt;/a&gt; or &lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id157ad6a0&quot;&gt;JDBC&lt;/a&gt; application, can use SPARQL if the SQL string starts with this keyword.&lt;/p&gt; &lt;p&gt;The &lt;code&gt;lubm-load.sql&lt;/code&gt; file is quite self-explanatory. It begins with defining an SQL procedure that calls the RDF/XML load function, &lt;code&gt;DB..RDF_LOAD_RDFXML&lt;/code&gt;, for each file in a directory.&lt;/p&gt; &lt;p&gt;Next it calls this function for the &lt;code&gt;lubm_8000&lt;/code&gt; directory under the server&amp;#39;s working directory.&lt;/p&gt; &lt;blockquote&gt; &lt;pre&gt;sparql CLEAR GRAPH &amp;lt;lubm&amp;gt;; sparql CLEAR GRAPH &amp;lt;inf&amp;gt;; load_lubm ( server_root() || &amp;#39;/lubm_8000/&amp;#39; ); &lt;/pre&gt;&lt;/blockquote&gt; &lt;p&gt;Then it verifies that the right number of triples is found in the &amp;lt;lubm&amp;gt; graph.&lt;/p&gt; &lt;blockquote&gt; &lt;pre&gt;sparql SELECT COUNT(*) FROM &amp;lt;lubm&amp;gt; WHERE { ?x ?y ?z } ; &lt;/pre&gt;&lt;/blockquote&gt; &lt;p&gt;The echo commands below this are interpreted by the isql utility, and produce output to show whether the test was passed. They can be ignored for now.&lt;/p&gt; &lt;p&gt;Then it adds some implied &lt;code&gt;subOrganizationOf&lt;/code&gt; triples. This is part of setting up the LUBM test database.&lt;/p&gt; &lt;blockquote&gt; &lt;pre&gt;sparql PREFIX ub: &amp;lt;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#&amp;gt; INSERT INTO GRAPH &amp;lt;lubm&amp;gt; { ?x ub:subOrganizationOf ?z } FROM &amp;lt;lubm&amp;gt; WHERE { ?x ub:subOrganizationOf ?y . ?y ub:subOrganizationOf ?z . }; &lt;/pre&gt;&lt;/blockquote&gt; &lt;p&gt;Then it loads the ontology file, &lt;code&gt;inf.nt&lt;/code&gt;, using the Turtle load function, &lt;code&gt;DB.DBA.TTLP&lt;/code&gt;. The arguments of the function are the text to load, the default namespace prefix, and the &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Identifier&quot; id=&quot;link-id15835550&quot;&gt;URI&lt;/a&gt; of the target graph.&lt;/p&gt; &lt;blockquote&gt; &lt;pre&gt;DB.DBA.TTLP ( file_to_string ( &amp;#39;inf.nt&amp;#39; ), &amp;#39;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl&amp;#39;, &amp;#39;inf&amp;#39; ) ; sparql SELECT COUNT(*) FROM &amp;lt;inf&amp;gt; WHERE { ?x ?y ?z } ; &lt;/pre&gt;&lt;/blockquote&gt; &lt;p&gt;Then we declare that the triples in the &lt;code&gt;&amp;lt;inf&amp;gt;&lt;/code&gt; graph can be used for inference at run time. To enable this, a SPARQL query will declare that it uses the &lt;code&gt;&amp;#39;inft&amp;#39;&lt;/code&gt; rule set. Otherwise this has no effect.&lt;/p&gt; &lt;blockquote&gt; &lt;pre&gt;rdfs_rule_set (&amp;#39;inft&amp;#39;, &amp;#39;inf&amp;#39;); &lt;/pre&gt;&lt;/blockquote&gt; &lt;p&gt;This is just a log checkpoint to finalize the work and truncate the transaction log. The server would also eventually do this in its own time.&lt;/p&gt; &lt;blockquote&gt; &lt;pre&gt;checkpoint; &lt;/pre&gt;&lt;/blockquote&gt; &lt;p&gt;Now we are ready for querying.&lt;/p&gt; &lt;h2&gt;Querying the Data&lt;/h2&gt; &lt;p&gt;The queries are given in 3 different versions: The first file, &lt;code&gt;lubm.sql&lt;/code&gt;, has the queries with most inference open coded as &lt;code&gt;UNIONs&lt;/code&gt;. The second file, &lt;code&gt;lubm-inf.sql&lt;/code&gt;, has the inference performed at run time using the ontology &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id1109faf0&quot;&gt;information&lt;/a&gt; in the &lt;code&gt;&amp;lt;inf&amp;gt;&lt;/code&gt; graph we just loaded. The last, &lt;code&gt;lubm-phys.sql&lt;/code&gt;, relies on having the entailed triples physically present in the &lt;code&gt;&amp;lt;lubm&amp;gt;&lt;/code&gt; graph. These entailed triples are inserted by the SPARUL commands in the &lt;code&gt;lubm-cp.sql&lt;/code&gt; file.&lt;/p&gt; &lt;p&gt;If you wish to run all the commands in a SQL file, you can type &lt;code&gt;load &amp;lt;filename&amp;gt;;&lt;/code&gt; (e.g., &lt;code&gt;load lubm-cp.sql;&lt;/code&gt;) at the &lt;code&gt;SQL&amp;gt;&lt;/code&gt; prompt. If you wish to try individual statements, you can paste them to the command line.&lt;/p&gt; &lt;p&gt;For example: &lt;/p&gt; &lt;blockquote&gt; &lt;pre&gt;SQL&amp;gt; sparql PREFIX ub: &amp;lt;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#&amp;gt; SELECT * FROM &amp;lt;lubm&amp;gt; WHERE { ?x a ub:Publication . ?x ub:publicationAuthor &amp;lt;http://www.Department0.University0.edu/AssistantProfessor0&amp;gt; }; VARCHAR _______________________________________________________________________ http://www.Department0.University0.edu/AssistantProfessor0/Publication0 http://www.Department0.University0.edu/AssistantProfessor0/Publication1 http://www.Department0.University0.edu/AssistantProfessor0/Publication2 http://www.Department0.University0.edu/AssistantProfessor0/Publication3 http://www.Department0.University0.edu/AssistantProfessor0/Publication4 http://www.Department0.University0.edu/AssistantProfessor0/Publication5 6 Rows. -- 4 msec. &lt;/pre&gt;&lt;/blockquote&gt; &lt;p&gt;To stop the server, simply type &lt;code&gt;shutdown;&lt;/code&gt; at the &lt;code&gt;SQL&amp;gt;&lt;/code&gt; prompt.&lt;/p&gt; &lt;p&gt;If you wish to use a &lt;a href=&quot;http://www.w3.org/TR/rdf-sparql-protocol/&quot; id=&quot;link-id11384668&quot;&gt;SPARQL protocol&lt;/a&gt; end point, just enable the HTTP listener. This is done by adding a stanza like â&lt;/p&gt; &lt;blockquote&gt; &lt;pre&gt;[HTTPServer] ServerPort = 8421 ServerRoot = . ServerThreads = 2 &lt;/pre&gt;&lt;/blockquote&gt; &lt;p&gt;â to the end of the &lt;code&gt;virtuoso.ini&lt;/code&gt; file in the &lt;code&gt;lubm&lt;/code&gt; directory. Then shutdown and restart (type &lt;code&gt;shutdown;&lt;/code&gt; at the &lt;code&gt;SQL&amp;gt;&lt;/code&gt; prompt and then &lt;code&gt;virtuoso-t -f &amp;amp;&lt;/code&gt; at the shell prompt).&lt;/p&gt; &lt;p&gt;Now you can connect to the end point with a web browser. The &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Locator&quot; id=&quot;link-id113d02d8&quot;&gt;URL&lt;/a&gt; is &lt;code&gt;http://localhost:8421/sparql&lt;/code&gt;. Without parameters, this will show a human readable form. With parameters, this will execute SPARQL.&lt;/p&gt; &lt;p&gt;We have shown how to load and query RDF with Virtuoso using the most basic SQL tools. Next you can access RDF from, for example, &lt;a href=&quot;http://dbpedia.org/resource/PHP&quot; id=&quot;link-id142d0ba0&quot;&gt;PHP&lt;/a&gt;, using the PHP ODBC interface.&lt;/p&gt; &lt;p&gt;To see how to use &lt;a href=&quot;http://jena.sourceforge.net/&quot; id=&quot;link-id117074f0&quot;&gt;Jena&lt;/a&gt; or &lt;a href=&quot;http://sourceforge.net/projects/sesame/&quot; id=&quot;link-id1103c9b0&quot;&gt;Sesame&lt;/a&gt; with Virtuoso, look at &lt;a href=&quot;http://docs.openlinksw.com/virtuoso/rdfnativestorageproviders.html&quot; id=&quot;link-id15488ce8&quot;&gt;Native RDF Storage Providers&lt;/a&gt;. To see how RDF data types are supported, see &lt;a href=&quot;http://docs.openlinksw.com/virtuoso/VirtuosoDriverJDBC.html#jdbcrdf&quot; id=&quot;link-id15784a40&quot;&gt;Extension datatype for RDF&lt;/a&gt; &lt;/p&gt; &lt;p&gt;To work with large volumes of data, you must add memory to the configuration file and use the row-autocommit mode, i.e., do &lt;code&gt;log_enableÂ (2);&lt;/code&gt; before the load command. Otherwise Virtuoso will do the entire load as a single transaction, and will run out of rollback space. See &lt;a href=&quot;http://docs.openlinksw.com/virtuoso/&quot; id=&quot;link-id111410f0&quot;&gt;documentation&lt;/a&gt; for more.&lt;/p&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>Virtuoso RDF: A Getting Started Guide for the Developer</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2008-12-17#1504</atom:id>
  <atom:published>2008-12-17T12:31:34Z</atom:published>
  <atom:updated>2008-12-17T12:41:21.000001-05:00</atom:updated>
  <atom:content type="html">&lt;p&gt;It is a long standing promise of mine to dispel the false impression that using &lt;a href=&quot;http://virtuoso.openlinksw.com/&quot; id=&quot;link-id113506d0&quot;&gt;Virtuoso&lt;/a&gt; to work with &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id115d9528&quot;&gt;RDF&lt;/a&gt; is complicated.&lt;/p&gt; &lt;p&gt;The purpose of this presentation is to show a programmer how to put RDF into Virtuoso and how to query it. This is done programmatically, with no confusing user interfaces.&lt;/p&gt; &lt;p&gt;You should have a Virtuoso Open Source tree built and installed. We will look at the LUBM benchmark demo that comes with the package. All you need is a Unix shell. Running the shell under emacs (&lt;code&gt;m-x shell&lt;/code&gt;) is the best. But the open source &lt;code&gt;isql&lt;/code&gt; utility should have command line editing also. The emacs shell is however convenient for cutting and pasting things between shell and files.&lt;/p&gt; &lt;p&gt;To get started, cd into &lt;code&gt;binsrc/tests/lubm&lt;/code&gt;.&lt;/p&gt; &lt;p&gt;To verify that this works, you can do &lt;/p&gt; &lt;blockquote&gt; &lt;pre&gt;./test_server.sh virtuoso-t&lt;/pre&gt;&lt;/blockquote&gt; &lt;p&gt;This will test the server with the LUBM queries. This should report 45 tests passed. After this we will do the tests step-by-step.&lt;/p&gt; &lt;h2&gt;Loading the &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id10f7bd90&quot;&gt;Data&lt;/a&gt; &lt;/h2&gt; &lt;p&gt;The file &lt;code&gt;lubm-load.sql&lt;/code&gt; contains the commands for loading the LUBM single university qualification database.&lt;/p&gt; &lt;p&gt;The data files themselves are in &lt;code&gt;lubm_8000&lt;/code&gt;, 15 files in RDFXML.&lt;/p&gt; &lt;p&gt;There is also a little ontology called &lt;code&gt;inf.nt&lt;/code&gt;. This declares the subclass and subproperty relations used in the benchmark.&lt;/p&gt; &lt;p&gt;So now let&amp;#39;s go through this procedure.&lt;/p&gt; &lt;p&gt;Start the server:&lt;/p&gt; &lt;blockquote&gt; &lt;pre&gt;$ virtuoso-t -f &amp;amp; &lt;/pre&gt;&lt;/blockquote&gt; &lt;p&gt;This starts the server in foreground mode, and puts it in the background of the shell.&lt;/p&gt; &lt;p&gt;Now we connect to it with the isql utility.&lt;/p&gt; &lt;blockquote&gt; &lt;pre&gt;$ isql 1111 dba dba &lt;/pre&gt;&lt;/blockquote&gt; &lt;p&gt;This gives a &lt;code&gt;SQL&amp;gt;&lt;/code&gt; prompt. The default username and password are both &lt;code&gt;dba&lt;/code&gt;.&lt;/p&gt; &lt;p&gt;When a command is &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id1176ce70&quot;&gt;SQL&lt;/a&gt;, it is entered directly. If it is &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id156df468&quot;&gt;SPARQL&lt;/a&gt;, it is prefixed with the keyword &lt;code&gt;sparql&lt;/code&gt;. This is how all the SQL clients work. Any SQL client, such as any &lt;a href=&quot;http://dbpedia.org/resource/Open_Database_Connectivity&quot; id=&quot;link-id152d0a00&quot;&gt;ODBC&lt;/a&gt; or &lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id157ad6a0&quot;&gt;JDBC&lt;/a&gt; application, can use SPARQL if the SQL string starts with this keyword.&lt;/p&gt; &lt;p&gt;The &lt;code&gt;lubm-load.sql&lt;/code&gt; file is quite self-explanatory. It begins with defining an SQL procedure that calls the RDF/XML load function, &lt;code&gt;DB..RDF_LOAD_RDFXML&lt;/code&gt;, for each file in a directory.&lt;/p&gt; &lt;p&gt;Next it calls this function for the &lt;code&gt;lubm_8000&lt;/code&gt; directory under the server&amp;#39;s working directory.&lt;/p&gt; &lt;blockquote&gt; &lt;pre&gt;sparql CLEAR GRAPH &amp;lt;lubm&amp;gt;; sparql CLEAR GRAPH &amp;lt;inf&amp;gt;; load_lubm ( server_root() || &amp;#39;/lubm_8000/&amp;#39; ); &lt;/pre&gt;&lt;/blockquote&gt; &lt;p&gt;Then it verifies that the right number of triples is found in the &amp;lt;lubm&amp;gt; graph.&lt;/p&gt; &lt;blockquote&gt; &lt;pre&gt;sparql SELECT COUNT(*) FROM &amp;lt;lubm&amp;gt; WHERE { ?x ?y ?z } ; &lt;/pre&gt;&lt;/blockquote&gt; &lt;p&gt;The echo commands below this are interpreted by the isql utility, and produce output to show whether the test was passed. They can be ignored for now.&lt;/p&gt; &lt;p&gt;Then it adds some implied &lt;code&gt;subOrganizationOf&lt;/code&gt; triples. This is part of setting up the LUBM test database.&lt;/p&gt; &lt;blockquote&gt; &lt;pre&gt;sparql PREFIX ub: &amp;lt;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#&amp;gt; INSERT INTO GRAPH &amp;lt;lubm&amp;gt; { ?x ub:subOrganizationOf ?z } FROM &amp;lt;lubm&amp;gt; WHERE { ?x ub:subOrganizationOf ?y . ?y ub:subOrganizationOf ?z . }; &lt;/pre&gt;&lt;/blockquote&gt; &lt;p&gt;Then it loads the ontology file, &lt;code&gt;inf.nt&lt;/code&gt;, using the Turtle load function, &lt;code&gt;DB.DBA.TTLP&lt;/code&gt;. The arguments of the function are the text to load, the default namespace prefix, and the &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Identifier&quot; id=&quot;link-id15835550&quot;&gt;URI&lt;/a&gt; of the target graph.&lt;/p&gt; &lt;blockquote&gt; &lt;pre&gt;DB.DBA.TTLP ( file_to_string ( &amp;#39;inf.nt&amp;#39; ), &amp;#39;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl&amp;#39;, &amp;#39;inf&amp;#39; ) ; sparql SELECT COUNT(*) FROM &amp;lt;inf&amp;gt; WHERE { ?x ?y ?z } ; &lt;/pre&gt;&lt;/blockquote&gt; &lt;p&gt;Then we declare that the triples in the &lt;code&gt;&amp;lt;inf&amp;gt;&lt;/code&gt; graph can be used for inference at run time. To enable this, a SPARQL query will declare that it uses the &lt;code&gt;&amp;#39;inft&amp;#39;&lt;/code&gt; rule set. Otherwise this has no effect.&lt;/p&gt; &lt;blockquote&gt; &lt;pre&gt;rdfs_rule_set (&amp;#39;inft&amp;#39;, &amp;#39;inf&amp;#39;); &lt;/pre&gt;&lt;/blockquote&gt; &lt;p&gt;This is just a log checkpoint to finalize the work and truncate the transaction log. The server would also eventually do this in its own time.&lt;/p&gt; &lt;blockquote&gt; &lt;pre&gt;checkpoint; &lt;/pre&gt;&lt;/blockquote&gt; &lt;p&gt;Now we are ready for querying.&lt;/p&gt; &lt;h2&gt;Querying the Data&lt;/h2&gt; &lt;p&gt;The queries are given in 3 different versions: The first file, &lt;code&gt;lubm.sql&lt;/code&gt;, has the queries with most inference open coded as &lt;code&gt;UNIONs&lt;/code&gt;. The second file, &lt;code&gt;lubm-inf.sql&lt;/code&gt;, has the inference performed at run time using the ontology &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id1109faf0&quot;&gt;information&lt;/a&gt; in the &lt;code&gt;&amp;lt;inf&amp;gt;&lt;/code&gt; graph we just loaded. The last, &lt;code&gt;lubm-phys.sql&lt;/code&gt;, relies on having the entailed triples physically present in the &lt;code&gt;&amp;lt;lubm&amp;gt;&lt;/code&gt; graph. These entailed triples are inserted by the SPARUL commands in the &lt;code&gt;lubm-cp.sql&lt;/code&gt; file.&lt;/p&gt; &lt;p&gt;If you wish to run all the commands in a SQL file, you can type &lt;code&gt;load &amp;lt;filename&amp;gt;;&lt;/code&gt; (e.g., &lt;code&gt;load lubm-cp.sql;&lt;/code&gt;) at the &lt;code&gt;SQL&amp;gt;&lt;/code&gt; prompt. If you wish to try individual statements, you can paste them to the command line.&lt;/p&gt; &lt;p&gt;For example: &lt;/p&gt; &lt;blockquote&gt; &lt;pre&gt;SQL&amp;gt; sparql PREFIX ub: &amp;lt;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#&amp;gt; SELECT * FROM &amp;lt;lubm&amp;gt; WHERE { ?x a ub:Publication . ?x ub:publicationAuthor &amp;lt;http://www.Department0.University0.edu/AssistantProfessor0&amp;gt; }; VARCHAR _______________________________________________________________________ http://www.Department0.University0.edu/AssistantProfessor0/Publication0 http://www.Department0.University0.edu/AssistantProfessor0/Publication1 http://www.Department0.University0.edu/AssistantProfessor0/Publication2 http://www.Department0.University0.edu/AssistantProfessor0/Publication3 http://www.Department0.University0.edu/AssistantProfessor0/Publication4 http://www.Department0.University0.edu/AssistantProfessor0/Publication5 6 Rows. -- 4 msec. &lt;/pre&gt;&lt;/blockquote&gt; &lt;p&gt;To stop the server, simply type &lt;code&gt;shutdown;&lt;/code&gt; at the &lt;code&gt;SQL&amp;gt;&lt;/code&gt; prompt.&lt;/p&gt; &lt;p&gt;If you wish to use a &lt;a href=&quot;http://www.w3.org/TR/rdf-sparql-protocol/&quot; id=&quot;link-id11384668&quot;&gt;SPARQL protocol&lt;/a&gt; end point, just enable the HTTP listener. This is done by adding a stanza like â&lt;/p&gt; &lt;blockquote&gt; &lt;pre&gt;[HTTPServer] ServerPort = 8421 ServerRoot = . ServerThreads = 2 &lt;/pre&gt;&lt;/blockquote&gt; &lt;p&gt;â to the end of the &lt;code&gt;virtuoso.ini&lt;/code&gt; file in the &lt;code&gt;lubm&lt;/code&gt; directory. Then shutdown and restart (type &lt;code&gt;shutdown;&lt;/code&gt; at the &lt;code&gt;SQL&amp;gt;&lt;/code&gt; prompt and then &lt;code&gt;virtuoso-t -f &amp;amp;&lt;/code&gt; at the shell prompt).&lt;/p&gt; &lt;p&gt;Now you can connect to the end point with a web browser. The &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Locator&quot; id=&quot;link-id113d02d8&quot;&gt;URL&lt;/a&gt; is &lt;code&gt;http://localhost:8421/sparql&lt;/code&gt;. Without parameters, this will show a human readable form. With parameters, this will execute SPARQL.&lt;/p&gt; &lt;p&gt;We have shown how to load and query RDF with Virtuoso using the most basic SQL tools. Next you can access RDF from, for example, &lt;a href=&quot;http://dbpedia.org/resource/PHP&quot; id=&quot;link-id142d0ba0&quot;&gt;PHP&lt;/a&gt;, using the PHP ODBC interface.&lt;/p&gt; &lt;p&gt;To see how to use &lt;a href=&quot;http://jena.sourceforge.net/&quot; id=&quot;link-id117074f0&quot;&gt;Jena&lt;/a&gt; or &lt;a href=&quot;http://sourceforge.net/projects/sesame/&quot; id=&quot;link-id1103c9b0&quot;&gt;Sesame&lt;/a&gt; with Virtuoso, look at &lt;a href=&quot;http://docs.openlinksw.com/virtuoso/rdfnativestorageproviders.html&quot; id=&quot;link-id15488ce8&quot;&gt;Native RDF Storage Providers&lt;/a&gt;. To see how RDF data types are supported, see &lt;a href=&quot;http://docs.openlinksw.com/virtuoso/VirtuosoDriverJDBC.html#jdbcrdf&quot; id=&quot;link-id15784a40&quot;&gt;Extension datatype for RDF&lt;/a&gt; &lt;/p&gt; &lt;p&gt;To work with large volumes of data, you must add memory to the configuration file and use the row-autocommit mode, i.e., do &lt;code&gt;log_enableÂ (2);&lt;/code&gt; before the load command. Otherwise Virtuoso will do the entire load as a single transaction, and will run out of rollback space. See &lt;a href=&quot;http://docs.openlinksw.com/virtuoso/&quot; id=&quot;link-id111410f0&quot;&gt;documentation&lt;/a&gt; for more.&lt;/p&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>Virtuoso Vs. MySQL: Setting the Berlin Record Straight (update 2)</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2008-11-20#1484</atom:id>
  <atom:published>2008-11-20T11:06:11Z</atom:published>
  <atom:updated>2008-11-24T10:15:05-05:00</atom:updated>
  <atom:content type="html">&lt;p&gt;In the context of the &lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id0xa5314d8&quot;&gt;Berlin SPARQL Benchmark&lt;/a&gt;, I have repeatedly written about measurement procedures and steady state. The point is that the numbers at larger scales are unreliable due to cache behavior if one is not careful about measurement and does not have adequate warmup. Thus it came to pass that one cut of the &lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id0x18482c20&quot;&gt;BSBM&lt;/a&gt; paper had 3 seconds for &lt;a href=&quot;http://dbpedia.org/resource/MySQL&quot; id=&quot;link-id0xb8c54de8&quot;&gt;MySQL&lt;/a&gt; and 100 for &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x189b2210&quot;&gt;Virtuoso&lt;/a&gt;, basically through ignoring cache effects.&lt;/p&gt; &lt;p&gt;So we decided to do it ourselves.&lt;/p&gt; &lt;p&gt;The score is (updated with revised &lt;code&gt;innodb_buffer_pool_size&lt;/code&gt; setting, based on advice noted down below):&lt;/p&gt; &lt;table border=&quot;1&quot; cellspacing=&quot;2&quot; cellpadding=&quot;5&quot;&gt; &lt;tr&gt; &lt;th&gt;n-clients&lt;/th&gt; &lt;th&gt;Virtuoso&lt;/th&gt; &lt;th&gt;MySQL &lt;br /&gt; (with increased buffer pool size)&lt;/th&gt; &lt;th&gt;MySQL &lt;br /&gt; (with default buffer poll size)&lt;/th&gt; &lt;/tr&gt; &lt;tr align=&quot;right&quot;&gt; &lt;td&gt;1&lt;/td&gt; &lt;td&gt; 41,161.33&lt;/td&gt; &lt;td&gt; 27,023.11 &lt;/td&gt; &lt;td&gt; 12,171.41&lt;/td&gt; &lt;/tr&gt; &lt;tr align=&quot;right&quot;&gt; &lt;td&gt;4&lt;/td&gt; &lt;td&gt; 127,918.30&lt;/td&gt; &lt;td&gt; (pending) &lt;/td&gt; &lt;td&gt; 37,566.82&lt;/td&gt; &lt;/tr&gt; &lt;tr align=&quot;right&quot;&gt; &lt;td&gt;8&lt;/td&gt; &lt;td&gt; 218,162.29 &lt;/td&gt; &lt;td&gt; 105,524.23 &lt;/td&gt; &lt;td&gt; 51,104.39 &lt;/td&gt; &lt;/tr&gt; &lt;tr align=&quot;right&quot;&gt; &lt;td&gt;16&lt;/td&gt; &lt;td&gt; 214,763.58 &lt;/td&gt; &lt;td&gt; 98,852.42 &lt;/td&gt; &lt;td&gt; 47,589.18 &lt;/td&gt; &lt;/tr&gt; &lt;/table&gt; &lt;p&gt;The metric is the query mixes per hour from the BSBM test driver output. For the interested, the complete output is &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/texts/bsbmres.txt&quot; id=&quot;link-id1119f770&quot;&gt;here&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;The benchmark is pure &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x5257718&quot;&gt;SQL&lt;/a&gt;, nothing to do with &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0xb8c463e0&quot;&gt;SPARQL&lt;/a&gt; or &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x16e68d50&quot;&gt;RDF&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;The hardware is 2 x Xeon 5345 (2 x quad core, 2.33 GHz), 16 G RAM. The OS is 64-bit Debian Linux.&lt;/p&gt; &lt;p&gt;The benchmark was run at a scale of 200,000. Each run had 2000 warm-up query mixes and 500 measured query mixes, which gives steady state, eliminating any effects of OS disk cache and the like. Both databases were configured to use 8G for disk cache. The test effectively runs from memory. We ran an analyze table on each MySQL table but noticed that this had no effect. Virtuoso does the stats sampling on the go; possibly MySQL also since the explicit stats did not make any difference. The MySQL tables were served by the InnoDB engine. MySQL appears to cache results of queries in some cases. This was not apparent in the tests.&lt;/p&gt; &lt;p&gt;The versions are 5.09 for Virtuoso and 5.1.29 for MySQL. You can download and examine --&lt;/p&gt; &lt;ul&gt; &lt;li&gt; &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/texts/virtuoso.ini&quot; id=&quot;link-id14fe17f0&quot;&gt;Virtuoso configuration file&lt;/a&gt; &lt;/li&gt; &lt;li&gt; &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/texts/my.cnf&quot; id=&quot;link-id116fe490&quot;&gt;MySQL configuration file&lt;/a&gt; &lt;/li&gt; &lt;li&gt; &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/texts/create_tables_and_rdf_view.sql&quot; id=&quot;link-id14ce9268&quot;&gt;Table definitions &amp;amp; RDF views&lt;/a&gt; &lt;/li&gt; &lt;li&gt; &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/texts/mysqlinx.sql&quot; id=&quot;link-id1535e298&quot;&gt;Indexes on MySQL tables&lt;/a&gt; &lt;/li&gt; &lt;/ul&gt; &lt;p&gt; &lt;strike&gt;MySQL ought to do better. We suspect that here, just as in the TPC-D experiment we made way back, the query plans are not quite right. Also we rarely saw over 300% CPU utilization for MySQL. It is possible there is a config parameter that affects this. The public is invited to tell us about such.&lt;/strike&gt; &lt;/p&gt; &lt;p&gt; &lt;b&gt;Update:&lt;/b&gt; &lt;/p&gt; &lt;p&gt;Andreas Schultz of the BSBM team advised us to increase the &lt;code&gt;innodb_buffer_pool_size&lt;/code&gt; setting in the MySQL config. We did and it produced some improvement. Indeed, this is more like it, as we now see CPU utilization around 700% instead of the 300% in the previously published run, which rendered it suspect. Also, our experiments with TPC-D led us to expect better. We ran these things a few times so as to have warm cache.&lt;/p&gt; &lt;p&gt;On the first run, we noticed that the Innodb warm up time was somewhere well in excess of 2000 query mixes. Another time, we should make a graph of throughput as a function of time for both MySQL and Virtuoso. We recently made a greedy prefetch hack that should give us some mileage there. For the next BSBM, all we can advise is to run larger scale system for half an hour first and then measure and then measure again. If the second measurement is the same as the first then it is good.&lt;/p&gt; &lt;p&gt;As always, since MySQL is not our specialty, we confidently invite the public to tell us how to make it run faster. So, unless something more turns up, our next trial is a revisit of &lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id0x122eaa00&quot;&gt;TPC-H&lt;/a&gt;.&lt;/p&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>ISWC 2008: Some Questions</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2008-11-04#1479</atom:id>
  <atom:published>2008-11-04T15:54:42Z</atom:published>
  <atom:updated>2008-11-04T14:36:50.000010-05:00</atom:updated>
  <atom:content type="html">&lt;h2&gt;Inference: Is it always forward chaining?&lt;/h2&gt; &lt;p&gt;We got a number of questions about &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x131604a8&quot;&gt;Virtuoso&lt;/a&gt;&amp;#39;s inference support. It seems that we are the odd one out, as we do not take it for granted that inference ought to consist of materializing entailment.&lt;/p&gt; &lt;p&gt;Firstly, of course one can materialize all one wants with Virtuoso. The simplest way to do this is using SPARUL. With the recent transitivity extensions to &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x1422f910&quot;&gt;SPARQL&lt;/a&gt;, it is also easy to materialize implications of transitivity with a single statement. Our point is that for trivial entailment such as subclass, sub-property, single transitive property, and &lt;a href=&quot;http://dbpedia.org/resource/Web_Ontology_Language&quot; id=&quot;link-id0x145894a8&quot;&gt;owl&lt;/a&gt;:sameAs, we do not require materialization, as we can resolve these at run time also, with backward-chaining built into the engine.&lt;/p&gt; &lt;p&gt;For more complex situations, one needs to materialize the entailment. At the present time, we know how to generalize our transitive feature to run arbitrary backward-chaining rules, including recursive ones. We could have a sort of Datalog backward-chaining embedded in our &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x1458a288&quot;&gt;SQL&lt;/a&gt;/SPARQL and could run this with good parallelism, as the transitive feature already works with clustering and partitioning without dying of message latency. Exactly when and how we do this will be seen. Even if users want entailment to be materialized, such a rule system could be used for producing the materialization at good speed.&lt;/p&gt; &lt;p&gt;We had a word with &lt;a href=&quot;http://web.comlab.ox.ac.uk/people/Ian.Horrocks/&quot; id=&quot;link-id117c99d0&quot;&gt;Ian Horrocks&lt;/a&gt; on the question. He noted that it is often naive on behalf of the community to tend to equate description of semantics with description of algorithm. The &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x14cf0b18&quot;&gt;data&lt;/a&gt; need not always be blown up.&lt;/p&gt; &lt;p&gt;The advantage of not always materializing is that the working set stays better. Once the working set is no longer in memory, response times jump disproportionately. Also, if the data changes or is retracted or is unreliable, one can end up doing a lot of extra work with materialization. Consider the effect of one malicious sameAs statement. This can lead to a lot of effects that are hard to retract. On the other hand, if running in memory with static data such as the LUBM benchmark, the queries run some 20% faster if entailment subclasses and sub-properties are materialized rather than done at run time.&lt;/p&gt; &lt;h2&gt;Genetic Algorithms for SPARQL?&lt;/h2&gt; &lt;p&gt;Our compliments for the wildest idea of the conference go to &lt;a href=&quot;http://www.eyaloren.org/&quot; id=&quot;link-id1a203af8&quot;&gt;Eyal Oren&lt;/a&gt;, &lt;a href=&quot;http://www.few.vu.nl/~cgueret/&quot; id=&quot;link-id16208758&quot;&gt;Christophe GuÃ©ret&lt;/a&gt;, and &lt;a href=&quot;http://www.few.vu.nl/~schlobac/&quot; id=&quot;link-id111923e0&quot;&gt;Stefan Schlobach&lt;/a&gt;, &lt;i&gt;et al&lt;/i&gt;, for their &lt;a href=&quot;http://www.informatik.uni-trier.de/~ley/db/conf/semweb/iswc2008.html#OrenGS08&quot; id=&quot;link-id11793540&quot;&gt;paper on using genetic algorithms for guessing how variables in a SPARQL query ought to be instantiated&lt;/a&gt;. Prisoners of our &amp;quot;conventional wisdom&amp;quot; as we are, this might never have occurred to us.&lt;/p&gt; &lt;h2&gt;Schema Last?&lt;/h2&gt; &lt;p&gt;It is interesting to see how the industry comes to the &lt;a href=&quot;http://dbpedia.org/resource/Semantic_Web&quot; id=&quot;link-id0x1154c1b0&quot;&gt;semantic web&lt;/a&gt; conferences talking about schema last while at the same time the traditional semantic web people stress enforcing schema constraints and making more predictably performing and database friendlier logics. So do the extremes converge.&lt;/p&gt; &lt;p&gt;There is a point to schema last. &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x14c6a930&quot;&gt;RDF&lt;/a&gt; is very good for getting a view of ad hoc or unknown data. One can just load and look at what there is. Also, additions of unforeseen optional properties or relations to the schema are easy and efficient. However, it seems that a really high traffic online application would always benefit from having some application specific data structures. Such could also save considerably in hardware.&lt;/p&gt; &lt;p&gt;It is not a sharp divide between RDF and relational application oriented representation. We have the capabilities in our RDB to RDF mapping. We just need to show this and have SPARUL and data loading&lt;/p&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>ISWC 2008: Some Questions</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2008-11-04#1479</atom:id>
  <atom:published>2008-11-04T15:54:42Z</atom:published>
  <atom:updated>2008-11-04T14:36:50.000010-05:00</atom:updated>
  <atom:content type="html">&lt;h2&gt;Inference: Is it always forward chaining?&lt;/h2&gt; &lt;p&gt;We got a number of questions about &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x131604a8&quot;&gt;Virtuoso&lt;/a&gt;&amp;#39;s inference support. It seems that we are the odd one out, as we do not take it for granted that inference ought to consist of materializing entailment.&lt;/p&gt; &lt;p&gt;Firstly, of course one can materialize all one wants with Virtuoso. The simplest way to do this is using SPARUL. With the recent transitivity extensions to &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x1422f910&quot;&gt;SPARQL&lt;/a&gt;, it is also easy to materialize implications of transitivity with a single statement. Our point is that for trivial entailment such as subclass, sub-property, single transitive property, and &lt;a href=&quot;http://dbpedia.org/resource/Web_Ontology_Language&quot; id=&quot;link-id0x145894a8&quot;&gt;owl&lt;/a&gt;:sameAs, we do not require materialization, as we can resolve these at run time also, with backward-chaining built into the engine.&lt;/p&gt; &lt;p&gt;For more complex situations, one needs to materialize the entailment. At the present time, we know how to generalize our transitive feature to run arbitrary backward-chaining rules, including recursive ones. We could have a sort of Datalog backward-chaining embedded in our &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x1458a288&quot;&gt;SQL&lt;/a&gt;/SPARQL and could run this with good parallelism, as the transitive feature already works with clustering and partitioning without dying of message latency. Exactly when and how we do this will be seen. Even if users want entailment to be materialized, such a rule system could be used for producing the materialization at good speed.&lt;/p&gt; &lt;p&gt;We had a word with &lt;a href=&quot;http://web.comlab.ox.ac.uk/people/Ian.Horrocks/&quot; id=&quot;link-id117c99d0&quot;&gt;Ian Horrocks&lt;/a&gt; on the question. He noted that it is often naive on behalf of the community to tend to equate description of semantics with description of algorithm. The &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x14cf0b18&quot;&gt;data&lt;/a&gt; need not always be blown up.&lt;/p&gt; &lt;p&gt;The advantage of not always materializing is that the working set stays better. Once the working set is no longer in memory, response times jump disproportionately. Also, if the data changes or is retracted or is unreliable, one can end up doing a lot of extra work with materialization. Consider the effect of one malicious sameAs statement. This can lead to a lot of effects that are hard to retract. On the other hand, if running in memory with static data such as the LUBM benchmark, the queries run some 20% faster if entailment subclasses and sub-properties are materialized rather than done at run time.&lt;/p&gt; &lt;h2&gt;Genetic Algorithms for SPARQL?&lt;/h2&gt; &lt;p&gt;Our compliments for the wildest idea of the conference go to &lt;a href=&quot;http://www.eyaloren.org/&quot; id=&quot;link-id1a203af8&quot;&gt;Eyal Oren&lt;/a&gt;, &lt;a href=&quot;http://www.few.vu.nl/~cgueret/&quot; id=&quot;link-id16208758&quot;&gt;Christophe GuÃ©ret&lt;/a&gt;, and &lt;a href=&quot;http://www.few.vu.nl/~schlobac/&quot; id=&quot;link-id111923e0&quot;&gt;Stefan Schlobach&lt;/a&gt;, &lt;i&gt;et al&lt;/i&gt;, for their &lt;a href=&quot;http://www.informatik.uni-trier.de/~ley/db/conf/semweb/iswc2008.html#OrenGS08&quot; id=&quot;link-id11793540&quot;&gt;paper on using genetic algorithms for guessing how variables in a SPARQL query ought to be instantiated&lt;/a&gt;. Prisoners of our &amp;quot;conventional wisdom&amp;quot; as we are, this might never have occurred to us.&lt;/p&gt; &lt;h2&gt;Schema Last?&lt;/h2&gt; &lt;p&gt;It is interesting to see how the industry comes to the &lt;a href=&quot;http://dbpedia.org/resource/Semantic_Web&quot; id=&quot;link-id0x1154c1b0&quot;&gt;semantic web&lt;/a&gt; conferences talking about schema last while at the same time the traditional semantic web people stress enforcing schema constraints and making more predictably performing and database friendlier logics. So do the extremes converge.&lt;/p&gt; &lt;p&gt;There is a point to schema last. &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x14c6a930&quot;&gt;RDF&lt;/a&gt; is very good for getting a view of ad hoc or unknown data. One can just load and look at what there is. Also, additions of unforeseen optional properties or relations to the schema are easy and efficient. However, it seems that a really high traffic online application would always benefit from having some application specific data structures. Such could also save considerably in hardware.&lt;/p&gt; &lt;p&gt;It is not a sharp divide between RDF and relational application oriented representation. We have the capabilities in our RDB to RDF mapping. We just need to show this and have SPARUL and data loading&lt;/p&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>ISWC 2008: The Scalable Knowledge Systems Workshop</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2008-11-03#1471</atom:id>
  <atom:published>2008-11-03T13:16:47Z</atom:published>
  <atom:updated>2008-11-03T12:33:49-05:00</atom:updated>
  <atom:content type="html">&lt;p&gt;Mike Dean of &lt;a href=&quot;http://dbpedia.org/resource/BBN_Technologies&quot; id=&quot;link-id0x21d04768&quot;&gt;BBN Technologies&lt;/a&gt; opened the Scalable &lt;a href=&quot;http://dbpedia.org/resource/Knowledge&quot; id=&quot;link-id0x22348c58&quot;&gt;Knowledge&lt;/a&gt; Systems Workshop with an invited talk. He reminded us of the facts of nature as concern the cost of distributed computing and running out of space for the working set. Developers in the &lt;a href=&quot;http://dbpedia.org/resource/Semantic_Web&quot; id=&quot;link-id0x22570328&quot;&gt;semantic web&lt;/a&gt; field deplorably often ignore these facts, or alternatively recognize them and admit that they are unbeatable, that one just can&amp;#39;t join across partitions.&lt;/p&gt; &lt;p&gt;I gave a talk about the &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x23f313f0&quot;&gt;Virtuoso&lt;/a&gt; Cluster edition, wherein I repeated essentially the same ground facts as Mike and outlined how we (in spite of these) profit from distributed memory multiprocessing. To those not intimate with these questions, let me affirm that deriving benefit from threading in a symmetric multiprocessor box, let alone a cluster connected by a network, totally depends on having many relatively long running things going at a time and blocking as seldom as possible.&lt;/p&gt; &lt;p&gt;Further, Mike Dean talked about &lt;a href=&quot;http://www.asio.bbn.com/&quot; id=&quot;link-id0x1d74c108&quot;&gt;ASIO&lt;/a&gt;, the BBN suite of semantic web tools. His most challenging statement was about the storage engine, a network-database-inspired triple-store using memory-mapped files. &lt;/p&gt; &lt;p&gt;Will the &lt;a href=&quot;http://dbpedia.org/resource/CODASYL&quot; id=&quot;link-id0x1f8ee860&quot;&gt;CODASYL&lt;/a&gt; days come back, and will the linked list on disk be the way to store triples/quads? I would say that this will have, especially with a memory-mapped file, probably a better best-case as a B-tree but that this also will be less predictable with fragmentation. With Virtuoso, using a B-tree index, we see about 20-30% of CPU time spent on index lookup when running LUBM queries. With a disk-based memory-mapped linked-list storage, we would see some improvements in this while getting hit probably worse than now in the case of fragmentation. Plus compaction on the fly would not be nearly as easy and surely far less local, if there were pointers between pages. So it is my intuition that trees are a safer bet with varying workloads while linked lists can be faster in a query-dominated in-memory situation.&lt;/p&gt; &lt;p&gt;Chris Bizer presented the &lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id0x1d670da0&quot;&gt;Berlin SPARQL Benchmark&lt;/a&gt; (&lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id0x21928808&quot;&gt;BSBM&lt;/a&gt;), which has already been discussed here in some detail. He did acknowledge that the next round of the race must have a real steady-state rule. This just means that the benchmark must be run long enough for the system under test to reach a state where the cache is full and the performance remains indefinitely at the same level. Reaching steady state can take 20-30 minutes in some cases.&lt;/p&gt; &lt;p&gt;Regardless of steady state, BSBM has two generally valid conclusions: &lt;/p&gt; &lt;ol&gt; &lt;li&gt;mapping relational to &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0xab811020&quot;&gt;RDF&lt;/a&gt;, where possible, is faster than triple storage; and &lt;/li&gt; &lt;li&gt;the equivalent relational solution can be some 10x faster than the pure triples representation.&lt;/li&gt; &lt;/ol&gt; &lt;p&gt;Mike Dean asked whether BSBM was a case of a setup to have triple stores fail. Not necessarily, I would say; we should understand that one motivation of BSBM is testing mapping technologies. Therefore it must have a workload where mapping makes sense. Of course there are workloads where triples are unchallenged â take the &lt;a href=&quot;http://challenge.semanticweb.org/&quot; id=&quot;link-id0x2538c3b8&quot;&gt;Billion Triples Challenge&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x1d673760&quot;&gt;data&lt;/a&gt; set for one.&lt;/p&gt; &lt;p&gt;Also, with BSBM, once should note that the query optimization time plays a fairly large role since most queries touch relatively little data. Also, even if the scale is large, the working set is not nearly the size of the database. This in fact penalizes mapping technologies against native &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0xac16cc10&quot;&gt;SQL&lt;/a&gt; since the difference there is compiling the query, especially since parameters are not used. So, Chris, since we both like to map, let&amp;#39;s make a benchmark that shows mapping closer to native SQL.&lt;/p&gt; &lt;h2&gt;Bridging the 10x Gap?&lt;/h2&gt; &lt;p&gt;When we run Virtuoso relational against Virtuoso triple store with the &lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id0x1d7dc518&quot;&gt;TPC-H&lt;/a&gt; workload, we see that the relational case is significantly faster. These are long queries, thus query optimization time is negligible; we are here comparing memory-based access times. Why is this? The answer is that a single index lookup gives multiple column values with almost no penalty for the extra column. Also, since the number of total joins is lower, the overhead coming from moving from join to next join is likewise lower. This is just a meter of count of executed instructions.&lt;/p&gt; &lt;p&gt;A column store joins in principle just as much as a triple store. However, since the BI workload often consists of scanning over large tables, the joins tend to be local, the needed lookup can often use the previous location as a starting point. A triple store can do the same if queries have high locality. We do this in some SQL situations and can try this with triples also. The RDF workload is typically more random in its access pattern, though. The other factor is the length of control path. A column store has a simpler control flow if it knows that the column will have exactly one value per row. With RDF, this is not a given. Also, the column store&amp;#39;s row is identified by a single number and not a multipart key. These two factors give the column store running with a fixed schema some edge over the more generic RDF quad store.&lt;/p&gt; &lt;p&gt;There was some discussion on how much closer a triple store could come to a relational one. Some gains are undoubtedly possible. We will see. For the ideal row store workload, the &lt;a href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id0x22e5b6f8&quot;&gt;RDBMS&lt;/a&gt; will continue to have some edge. Large online systems typically have a large part of the workload that is simple and repetitive. There is nothing to prevent one having special indices for supporting such workload, even while retaining the possibility of arbitrary triples elsewhere. Some degree of application-specific data structure does make sense. We just need to show how this is done. In this way, we have a continuum and not an either/or choice of triples vs. tables.&lt;/p&gt; &lt;h2&gt;Scale, Where Next?&lt;/h2&gt; &lt;p&gt;Concerning the future direction of the workshop, there were a few directions suggested. One of the more interesting ones was Mike Dean&amp;#39;s suggestion about dealing with a large volume of same-as assertions, specifically a volume where materializing all the entailed triples was no longer practical. Of course, there is the question of scale. This time, we were the only ones focusing on a parallel database with no restrictions on joining.&lt;/p&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>ISWC 2008: The Scalable Knowledge Systems Workshop</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2008-11-03#1471</atom:id>
  <atom:published>2008-11-03T13:16:47Z</atom:published>
  <atom:updated>2008-11-03T12:33:49-05:00</atom:updated>
  <atom:content type="html">&lt;p&gt;Mike Dean of &lt;a href=&quot;http://dbpedia.org/resource/BBN_Technologies&quot; id=&quot;link-id0x21d04768&quot;&gt;BBN Technologies&lt;/a&gt; opened the Scalable &lt;a href=&quot;http://dbpedia.org/resource/Knowledge&quot; id=&quot;link-id0x22348c58&quot;&gt;Knowledge&lt;/a&gt; Systems Workshop with an invited talk. He reminded us of the facts of nature as concern the cost of distributed computing and running out of space for the working set. Developers in the &lt;a href=&quot;http://dbpedia.org/resource/Semantic_Web&quot; id=&quot;link-id0x22570328&quot;&gt;semantic web&lt;/a&gt; field deplorably often ignore these facts, or alternatively recognize them and admit that they are unbeatable, that one just can&amp;#39;t join across partitions.&lt;/p&gt; &lt;p&gt;I gave a talk about the &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x23f313f0&quot;&gt;Virtuoso&lt;/a&gt; Cluster edition, wherein I repeated essentially the same ground facts as Mike and outlined how we (in spite of these) profit from distributed memory multiprocessing. To those not intimate with these questions, let me affirm that deriving benefit from threading in a symmetric multiprocessor box, let alone a cluster connected by a network, totally depends on having many relatively long running things going at a time and blocking as seldom as possible.&lt;/p&gt; &lt;p&gt;Further, Mike Dean talked about &lt;a href=&quot;http://www.asio.bbn.com/&quot; id=&quot;link-id0x1d74c108&quot;&gt;ASIO&lt;/a&gt;, the BBN suite of semantic web tools. His most challenging statement was about the storage engine, a network-database-inspired triple-store using memory-mapped files. &lt;/p&gt; &lt;p&gt;Will the &lt;a href=&quot;http://dbpedia.org/resource/CODASYL&quot; id=&quot;link-id0x1f8ee860&quot;&gt;CODASYL&lt;/a&gt; days come back, and will the linked list on disk be the way to store triples/quads? I would say that this will have, especially with a memory-mapped file, probably a better best-case as a B-tree but that this also will be less predictable with fragmentation. With Virtuoso, using a B-tree index, we see about 20-30% of CPU time spent on index lookup when running LUBM queries. With a disk-based memory-mapped linked-list storage, we would see some improvements in this while getting hit probably worse than now in the case of fragmentation. Plus compaction on the fly would not be nearly as easy and surely far less local, if there were pointers between pages. So it is my intuition that trees are a safer bet with varying workloads while linked lists can be faster in a query-dominated in-memory situation.&lt;/p&gt; &lt;p&gt;Chris Bizer presented the &lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id0x1d670da0&quot;&gt;Berlin SPARQL Benchmark&lt;/a&gt; (&lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id0x21928808&quot;&gt;BSBM&lt;/a&gt;), which has already been discussed here in some detail. He did acknowledge that the next round of the race must have a real steady-state rule. This just means that the benchmark must be run long enough for the system under test to reach a state where the cache is full and the performance remains indefinitely at the same level. Reaching steady state can take 20-30 minutes in some cases.&lt;/p&gt; &lt;p&gt;Regardless of steady state, BSBM has two generally valid conclusions: &lt;/p&gt; &lt;ol&gt; &lt;li&gt;mapping relational to &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0xab811020&quot;&gt;RDF&lt;/a&gt;, where possible, is faster than triple storage; and &lt;/li&gt; &lt;li&gt;the equivalent relational solution can be some 10x faster than the pure triples representation.&lt;/li&gt; &lt;/ol&gt; &lt;p&gt;Mike Dean asked whether BSBM was a case of a setup to have triple stores fail. Not necessarily, I would say; we should understand that one motivation of BSBM is testing mapping technologies. Therefore it must have a workload where mapping makes sense. Of course there are workloads where triples are unchallenged â take the &lt;a href=&quot;http://challenge.semanticweb.org/&quot; id=&quot;link-id0x2538c3b8&quot;&gt;Billion Triples Challenge&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x1d673760&quot;&gt;data&lt;/a&gt; set for one.&lt;/p&gt; &lt;p&gt;Also, with BSBM, once should note that the query optimization time plays a fairly large role since most queries touch relatively little data. Also, even if the scale is large, the working set is not nearly the size of the database. This in fact penalizes mapping technologies against native &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0xac16cc10&quot;&gt;SQL&lt;/a&gt; since the difference there is compiling the query, especially since parameters are not used. So, Chris, since we both like to map, let&amp;#39;s make a benchmark that shows mapping closer to native SQL.&lt;/p&gt; &lt;h2&gt;Bridging the 10x Gap?&lt;/h2&gt; &lt;p&gt;When we run Virtuoso relational against Virtuoso triple store with the &lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id0x1d7dc518&quot;&gt;TPC-H&lt;/a&gt; workload, we see that the relational case is significantly faster. These are long queries, thus query optimization time is negligible; we are here comparing memory-based access times. Why is this? The answer is that a single index lookup gives multiple column values with almost no penalty for the extra column. Also, since the number of total joins is lower, the overhead coming from moving from join to next join is likewise lower. This is just a meter of count of executed instructions.&lt;/p&gt; &lt;p&gt;A column store joins in principle just as much as a triple store. However, since the BI workload often consists of scanning over large tables, the joins tend to be local, the needed lookup can often use the previous location as a starting point. A triple store can do the same if queries have high locality. We do this in some SQL situations and can try this with triples also. The RDF workload is typically more random in its access pattern, though. The other factor is the length of control path. A column store has a simpler control flow if it knows that the column will have exactly one value per row. With RDF, this is not a given. Also, the column store&amp;#39;s row is identified by a single number and not a multipart key. These two factors give the column store running with a fixed schema some edge over the more generic RDF quad store.&lt;/p&gt; &lt;p&gt;There was some discussion on how much closer a triple store could come to a relational one. Some gains are undoubtedly possible. We will see. For the ideal row store workload, the &lt;a href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id0x22e5b6f8&quot;&gt;RDBMS&lt;/a&gt; will continue to have some edge. Large online systems typically have a large part of the workload that is simple and repetitive. There is nothing to prevent one having special indices for supporting such workload, even while retaining the possibility of arbitrary triples elsewhere. Some degree of application-specific data structure does make sense. We just need to show how this is done. In this way, we have a continuum and not an either/or choice of triples vs. tables.&lt;/p&gt; &lt;h2&gt;Scale, Where Next?&lt;/h2&gt; &lt;p&gt;Concerning the future direction of the workshop, there were a few directions suggested. One of the more interesting ones was Mike Dean&amp;#39;s suggestion about dealing with a large volume of same-as assertions, specifically a volume where materializing all the entailed triples was no longer practical. Of course, there is the question of scale. This time, we were the only ones focusing on a parallel database with no restrictions on joining.&lt;/p&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>Virtuoso - Are We Too Clever for Our Own Good? (updated)</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2008-10-26#1465</atom:id>
  <atom:published>2008-10-26T12:15:35Z</atom:published>
  <atom:updated>2008-10-27T12:07:52-04:00</atom:updated>
  <atom:content type="html">&lt;p&gt;&amp;quot;Physician, heal thyself,&amp;quot; it is said. We profess to say what the messaging of the &lt;a href=&quot;http://dbpedia.org/resource/Semantic_Web&quot; id=&quot;link-id0x1fa3da18&quot;&gt;semantic web&lt;/a&gt; ought to be, but is our own perfect?&lt;/p&gt; &lt;p&gt;I will here engage in some critical introspection as well as amplify on some answers given to &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x1e1eecf0&quot;&gt;Virtuoso&lt;/a&gt;-related questions in recent times.&lt;/p&gt; &lt;p&gt;I use some conversations from the &lt;a href=&quot;http://dbpedia.org/resource/Vienna&quot; id=&quot;link-id0x1ec0b2e0&quot;&gt;Vienna&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id0x2045ac10&quot;&gt;Linked Data&lt;/a&gt; Practitioners meeting as a starting point. These views are mine and are limited to the Virtuoso server. These do not apply to the &lt;a href=&quot;http://dbpedia.org/resource/OpenLink_Data_Spaces&quot; id=&quot;link-id0x2045ac38&quot;&gt;ODS&lt;/a&gt; (&lt;a href=&quot;http://dbpedia.org/resource/OpenLink_Data_Spaces&quot; id=&quot;link-id0x14f63c58&quot;&gt;OpenLink Data Spaces&lt;/a&gt;) applications line, &lt;a href=&quot;http://oat.openlinksw.com/&quot; id=&quot;link-id0x14f63c80&quot;&gt;OAT&lt;/a&gt; (&lt;a href=&quot;http://oat.openlinksw.com/&quot; id=&quot;link-id0x1e536928&quot;&gt;OpenLink Ajax Toolkit&lt;/a&gt;), or &lt;a href=&quot;http://ode.openlinksw.com/&quot; id=&quot;link-id0x1eaed7f8&quot;&gt;ODE&lt;/a&gt; (&lt;a href=&quot;http://ode.openlinksw.com/&quot; id=&quot;link-id0x1edfff88&quot;&gt;OpenLink Data Explorer&lt;/a&gt;).&lt;/p&gt; &lt;h3&gt;&amp;quot;It is not always clear what the main thrust is, we get the impression that you are spread too thin,&amp;quot; said &lt;a href=&quot;http://www.informatik.uni-leipzig.de/~auer/foaf.rdf#me&quot; id=&quot;link-id0x1b8a9580&quot;&gt;SÃ¶ren Auer&lt;/a&gt;.&lt;/h3&gt; &lt;p&gt;Well, personally, I am all for core competence. This is why I do not participate in all the online conversations and groups as much as I could, for example. Time and energy are critical resources and must be invested where they make a difference. In this case, the real core competence is running in the database race. This in itself, come to think of it, is a pretty broad concept.&lt;/p&gt; &lt;p&gt;This is why we put a lot of emphasis on Linked Data and the &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x1b85fa38&quot;&gt;Data&lt;/a&gt; Web for now, as this is the emerging game. This is a deliberate choice, not an outside imperative or built-in limitation. More specifically, this means exposing any pre-existing relational data as linked data plus being the definitive &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x1f5b4468&quot;&gt;RDF&lt;/a&gt; store.&lt;/p&gt; &lt;p&gt;We can do this because we own our database and &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x20076468&quot;&gt;SQL&lt;/a&gt; and data access middleware and have a history of connecting to any &lt;a href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id0x1ffd6f98&quot;&gt;RDBMS&lt;/a&gt; out there.&lt;/p&gt; &lt;p&gt;The principal message we have been hearing from the RDF field is the call for scale of triple storage. This is even louder than the call for relational mapping. We believe that in time mapping will exceed triple storage as such, once we get some real production strength mappings deployed, enough to outperform RDF warehousing.&lt;/p&gt; &lt;p&gt;There are also RDF middleware things like RDF-ization and demand-driven web harvesting (i.e, the so-called Sponger). These are &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x1316f720&quot;&gt;SPARQL&lt;/a&gt; options, thus accessed via standard interfaces. We have little desire to create our own languages or APIs, or to tell people how to program. This is why we recently introduced &lt;a href=&quot;http://sourceforge.net/projects/sesame/&quot; id=&quot;link-id0x20756a68&quot;&gt;Sesame&lt;/a&gt;- and &lt;a href=&quot;http://jena.sourceforge.net/&quot; id=&quot;link-id0x1ec01ac0&quot;&gt;Jena&lt;/a&gt;-compatible APIs to our RDF store. From what we hear, these work. On the other hand, we do not hesitate to move beyond the standards when there is obvious value or necessity. This is why we brought SPARQL up to and beyond SQL expressivity. It is not a case of E3 (Embrace, Extend, Extinguish).&lt;/p&gt; &lt;p&gt;Now, this message could be better reflected in our material on the web. This &lt;a href=&quot;http://dbpedia.org/resource/Blog&quot; id=&quot;link-id0x2027b410&quot;&gt;blog&lt;/a&gt; is a rather informal step in this direction; more is to come. For now we concentrate on delivering.&lt;/p&gt; &lt;p&gt;The conventional communications wisdom is to split the message by target audience. For this, we should split the RDF, relational, and web services messages from each other. We believe that a challenger, like the semantic web technology stack, must have a compelling message to tell for it to be interesting. This is not a question of research prototypes. The new technology cannot lack something the installed technology takes for granted.&lt;/p&gt; &lt;p&gt;This is why we do not tend to show things like how to insert and query a few triples: No business out there will insert and query triples for the sake of triples. There must be a more compelling story â for example, turning the whole world into a database. This is why our examples start with things like turning the &lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id0x2051ff98&quot;&gt;TPC-H&lt;/a&gt; database into RDF, queries and all. Anything less is not interesting. Why would an enterprise that has business intelligence and integration issues way more complex than the rather stereotypical TPC-H even look at a technology that pretends to be all for integration and all for expressivity of queries, yet cannot answer the first question of the entry exam?&lt;/p&gt; &lt;p&gt;The world out there is complex. But maybe we ought to make some simple tutorials? So, as a call to the people out there, tell us what a good tutorial would be. The question is more about figuring out what is out there and adapting these and making a sort of compatibility list. Jena and Sesame stuff ought to run as is. We could offer a webinar to all the data web luminaries showing how to promote the data web message with Virtuoso. After all, why not show it on the best platform?&lt;/p&gt; &lt;h3&gt;&amp;quot;You are arrogant. When I read your papers or documentation, the impression I get is that you say you are smart and the reader is stupid.&amp;quot;&lt;/h3&gt; &lt;p&gt;We should answer in multiple parts.&lt;/p&gt; &lt;p&gt;For general collateral, like web sites and documentation:&lt;/p&gt; &lt;p&gt;The web site gives a confused product image. For the Virtuoso product, we should divide at the top into&lt;/p&gt; &lt;ul&gt; &lt;li&gt; Data web and RDF - Host linked data, expose relational assets as linked data;&lt;/li&gt; &lt;li&gt; Relational Database - Full function, high performance, open source, Federated/Virtual Relational DBMS, expose heterogeneous RDB assets through one point of contact for integration;&lt;/li&gt; &lt;li&gt; Web Services - access all the above over standard protocols, dynamic web pages, web hosting.&lt;/li&gt; &lt;/ul&gt; &lt;p&gt;For each point, one simple statement. We all know what the above things mean?&lt;/p&gt; &lt;p&gt;Then we add a new point about scalability that impacts all the above, namely the Virtuoso version 6 Cluster, meaning that you can do all these things at 10 to 1000 times the scale. This means this much more data or in some cases this much more requests per second. This too is clear.&lt;/p&gt; &lt;p&gt;Far as I am concerned, hosting Java or .&lt;a href=&quot;http://dbpedia.org/resource/.NET_Framework&quot; id=&quot;link-id0x1f297540&quot;&gt;NET&lt;/a&gt; does not have to be on the front page. Also, we have no great interest in going against &lt;a href=&quot;http://dbpedia.org/resource/Apache&quot; id=&quot;link-id0x1ea29578&quot;&gt;Apache&lt;/a&gt; when it comes to a web server only situation. The fact that we have a web listener is important for some things but our claim to fame does not rest on this.&lt;/p&gt; &lt;p&gt;Then for documentation and training materials: The documentation should be better. Specifically it should have more of a how-to dimension since nobody reads the whole thing anyhow. About online tutorials, the order of presentation should be different. They do not really reflect what is important at the present moment either.&lt;/p&gt; &lt;p&gt;Now for conference papers: Since taking the data web as a focus area, we have submitted some papers and had some rejected because these do not have enough references and do not explain what is obvious to ourselves.&lt;/p&gt; &lt;p&gt;I think that the communications failure in this case is that we want to talk about end to end solutions and the reviewers expect research. For us, the solution is interesting and exists only if there is an adequate functionality mix for addressing a specific use case. This is why we do not make a paper about query cost model alone because the cost model, while indispensable, is a thing that is taken for granted where we come from. So we mention RDF adaptations to cost model, as these are important to the whole but do not find these to be the justification for a whole paper. If we made papers on this basis, we would have to make five times as many. Maybe we ought to.&lt;/p&gt; &lt;h3&gt;&amp;quot;Virtuoso is very big and very difficult&amp;quot;&lt;/h3&gt; &lt;p&gt;One thing that is not obvious from the Virtuoso packaging is that the minimum installation is an executable under 10MB and a config file. Two files.&lt;/p&gt; &lt;p&gt;This gives you SQL and SPARQL out of the box. Adding &lt;a href=&quot;http://dbpedia.org/resource/Open_Database_Connectivity&quot; id=&quot;link-id0x20a2e7d0&quot;&gt;ODBC&lt;/a&gt; and &lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id0x1e4cceb8&quot;&gt;JDBC&lt;/a&gt; clients is as simple as it gets. After this, there is basic database functionality. Tuning is a matter of a few parameters that are explained on this blog and elsewhere. Also, the full scale installation is available as an Amazon EC2 image, so no installation required.&lt;/p&gt; &lt;p&gt;Now for the difficult side:&lt;/p&gt; &lt;p&gt;Use SQL and SPARQL; use stored procedures whenever there is server side business logic. For some time critical web pages, use VSP. Do not use VSPX. Otherwise, use whatever you are used to â &lt;a href=&quot;http://dbpedia.org/resource/PHP&quot; id=&quot;link-id0x20b03f08&quot;&gt;PHP&lt;/a&gt; or Java or anything else. For web services, simple is best. Stick to basics. &amp;quot;The engineer is one who can invent a simple thing.&amp;quot; Use SQL statements rather than admin UI.&lt;/p&gt; &lt;p&gt;Know that you can start a server with no database file and you get an initial database with nothing extra. The demo database, the way it is produced by installers is cluttered.&lt;/p&gt; &lt;p&gt;We should put this into a couple of use case oriented how-tos.&lt;/p&gt; &lt;p&gt;Also, we should create a network of &amp;quot;friendly local virtuoso geeks&amp;quot; for providing basic training and services so we do not have to explain these things all the time. To all you data-web-ers out there â please sign up and we will provide instructions, etc. Contact YrjÃ¤nÃ¤ Rankka (ghard[at-sign]openlinksw.com), or go through the mailing lists; do not contact me directly.&lt;/p&gt; &lt;h3&gt;&amp;quot;OK, we understand that you may be good at the large end of the spectrum but how do you reconcile this with the lightweight or embedded end, like the semantic desktop?&amp;quot;&lt;/h3&gt; &lt;p&gt;Now, what is good for one end is usually good for the other. Namely, a database, no matter the scale, needs to have space efficient storage, fast index lookup, and correct query plans. Then there are things that occur only at the high-end, like clustering, but these are separate things. For embedding, the initial memory footprint needs to be small. With Virtuoso, this is accomplished by leaving out some 200 built-in tables and 100,000 lines of SQL procedures that are normally in by default, supporting things such as DAV and diverse other protocols. After all, if SPARQL is all one wants these are not needed.&lt;/p&gt; &lt;p&gt;If one really wants to do one&amp;#39;s server logic (like web listener and thread dispatching) oneself, this is not impossible but requires some advice from us. On the other hand, if one wants to have logic for security close to the data, then using stored procedures is recommended; these execute right next to the data, and support inline SPARQL and SQL. Depending on the license status of the other code, some special licensing arrangements may apply.&lt;/p&gt; &lt;p&gt;We are talking about such things with different parties at present.&lt;/p&gt; &lt;h3&gt;&amp;quot;How webby are you? What is webby?&amp;quot;&lt;/h3&gt; &lt;p&gt;&amp;quot;Webby means distributed, heterogeneous, open; not monolithic consolidation of everything.&amp;quot;&lt;/p&gt; &lt;p&gt;We are philosophically webby. We come from open standards; we are after all called OpenLink; our history consists of connecting things. We believe in choice â the user should be able to pick the best of breed for components and have them work together. We cannot and do not wish to force replacement of existing assets. Transforming data on the fly and connecting systems, leaving data where it originally resides, is the first preference. For the data web, the first preference is a federation of independent SPARQL end points. When there is harvesting, we prefer to do it on demand, as with our Sponger. With the immense amount of data out there we believe in finding what is relevant &lt;i&gt;when&lt;/i&gt; it is relevant, preferably close at hand, leveraging things like social networks. With a data web, many things which are now siloized, such as marketplaces and social networks, will return to the open.&lt;/p&gt; &lt;p&gt;Google-style crawling of everything becomes less practical if one needs to run complex &lt;i&gt;ad hoc&lt;/i&gt; queries against the mass of data. For these types of scenarios, if one needs to warehouse, the data cloud will offer solutions where one pays for database on demand. While we believe in loosely coupled federation where possible, we have serious work on the scalability side for the data center and the compute-on-demand cloud.&lt;/p&gt; &lt;h3&gt;&amp;quot;How does OpenLink see the next five years unfolding?&amp;quot;&lt;/h3&gt; &lt;p&gt;Personally, I think we have the basics for the birth of a new inflection in the &lt;a href=&quot;http://dbpedia.org/resource/Knowledge&quot; id=&quot;link-id0x2018bd98&quot;&gt;knowledge&lt;/a&gt; economy. The &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Identifier&quot; id=&quot;link-id0x1ec110d8&quot;&gt;URI&lt;/a&gt; is the unit of exchange; its value and competitive edge lie in the data it links you with. A name without context is worth little, but as a name gets more use, more &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id0x1ecfba08&quot;&gt;information&lt;/a&gt; can be found through that name. This is anything from financial statistics, to legal precedents, to news reporting or government data. Right now, if the SEC just added one line of markup to the XBRL template, this would instantaneously make all SEC-mandated reporting into linked data via GRDDL.&lt;/p&gt; &lt;p&gt;The URI is a carrier of brand. An information brand gets traffic and references, and this can be monetized in diverse ways. The key word is &lt;i&gt;context&lt;/i&gt;. Information overload is here to stay, and only better context offers the needed increase in productivity to stay ahead of the flood.&lt;/p&gt; &lt;p&gt;Semantic technologies on the whole can help with this. Why these should be semantic web or data web technologies as opposed to just semantic is the linked data value proposition. Even smart islands are still islands. Agility, scale, and scope, depend on the possibility of combining things. Therefore common terminologies and dereferenceability and discoverability are important. Without these, we are at best dealing with closed systems even if they were smart. The expert systems of the 1980s are a case in point.&lt;/p&gt; &lt;p&gt;Ever since the .com era, the &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Locator&quot; id=&quot;link-id0x1c4c9248&quot;&gt;URL&lt;/a&gt; has been a brand. Now it becomes a URI. Thus, entirely hiding the URI from the user experience is not always desirable. The URI is a sort of handle on the provenance and where more can be found; besides, people are already used to these.&lt;/p&gt; &lt;p&gt;With linked data, information value-add products become easy to build and deploy. They can be basically just canned SPARQL queries combining data in a useful and insightful manner. And where there is traffic there can be monetization, whether by advertizing, subscription, or other means. Such possibilities are a natural adjunct to the blogosphere. To publish analysis, one no longer needs to be a think tank or media company. We could call this scenario the birth of a meshup economy.&lt;/p&gt; &lt;p&gt;For OpenLink itself, this is our roadmap. The immediate future is about getting our high end offerings like clustered RDF storage generally available, both on the cloud and for private data centers. Ourselves, we will offer the whole &lt;a href=&quot;http://community.linkeddata.org/dataspace/organization/lod#this&quot; id=&quot;link-id0x20791bf0&quot;&gt;Linked Open Data&lt;/a&gt; cloud as a database. The single feature to come in version 2 of this is fully automatic partitioning and repartitioning for on-demand scale; now, you have to choose how many partitions you have.&lt;/p&gt; &lt;p&gt;This makes some things possible that were hard thus far.&lt;/p&gt; &lt;p&gt;On the mapping front, we go for real-scale data integration scenarios where we can show that SPARQL can unify terms and concepts across databases, yet bring no added cost for complex queries. Enterprises can use their existing warehouses and have an added level of abstraction, the possibility of cross systems interlinking, the advantages of using the same taxonomies and ontologies across systems, and so forth.&lt;/p&gt; &lt;p&gt;Then there will be developments in the direction of smarter web harvesting on demand with the Virtuoso &lt;a href=&quot;http://virtuoso.openlinksw.com/Whitepapers/html/VirtSpongerWhitePaper.html&quot; id=&quot;link-id0x1f27e6d8&quot;&gt;Sponger&lt;/a&gt;, and federation of heterogeneous SPARQL end points. The federation is not so unlike clustering, except the time scales are 2 orders of magnitude longer. The work on SPARQL end point statistics and data set description and discovery is a good development in the community.&lt;/p&gt; &lt;p&gt;Then there will be NLP integration, as exemplified by the Open Calais linked data wrapper and more.&lt;/p&gt; &lt;p&gt;Can we pull this off or is this being spread too thin? We know from experience that all this can be accomplished. Scale is already here; we show it with the billion triples set. Mapping is here; we showed it last in the Berlin Benchmark. We will also show some TPC-H results after we get a little quiet after the ISWC event. Then there is ongoing maintenance but with this we have shown a steady turnaround and quick time to fix for pretty much anything.&lt;/p&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>Virtuoso Update, Billion Triples and Outlook</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2008-10-02#1448</atom:id>
  <atom:published>2008-10-02T09:31:17Z</atom:published>
  <atom:updated>2008-10-02T12:47:02.000002-04:00</atom:updated>
  <atom:content type="html">&lt;p&gt;I will say a few things about what we have been doing and where we can go.&lt;/p&gt; &lt;p&gt;Firstly, we have a fairly scalable platform with &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0xa412e450&quot;&gt;Virtuoso&lt;/a&gt; 6 Cluster. It was most recently tested with the workload discussed in the previous &lt;a href=&quot;http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1445&quot; id=&quot;link-id1638a5b8&quot;&gt;Billion Triples post&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;There is an updated version of &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/2008webscale_rdf.pdf&quot; id=&quot;link-id16280a68&quot;&gt;the paper about this&lt;/a&gt;. This will be presented at the web scale workshop of ISWC 2008 in Karlsruhe.&lt;/p&gt; &lt;p&gt;Right now, we are polishing some things in Virtuoso 6 -- some optimizations for smarter balancing of interconnect traffic over multiple network interfaces, and some more &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x1c1c5f48&quot;&gt;SQL&lt;/a&gt; optimizations specific to &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x1bcb6108&quot;&gt;RDF&lt;/a&gt;. The must-have basics, like parallel running of sub-queries and aggregates, and all-around unrolling of loops of every kind into large partitioned batches, is all there and proven to work.&lt;/p&gt; &lt;p&gt;We spent a lot of time around the &lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id0x3a4e17c8&quot;&gt;Berlin SPARQL Benchmark&lt;/a&gt; story, so we got to the more advanced stuff like the &lt;a href=&quot;http://challenge.semanticweb.org/&quot; id=&quot;link-id0x1a66c568&quot;&gt;Billion Triples Challenge&lt;/a&gt; rather late. We did along the way also run &lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id0x188c2608&quot;&gt;BSBM&lt;/a&gt; with an &lt;a href=&quot;http://dbpedia.org/resource/Oracle_Database&quot; id=&quot;link-id0x1aa97f98&quot;&gt;Oracle&lt;/a&gt; back-end, with Virtuoso mapping &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x1abd87a0&quot;&gt;SPARQL&lt;/a&gt; to SQL. This merits its own analysis in the near future. This will be the basic how-to of mapping OLTP systems to RDF. Depending on the case, one can use this for lookups in real-time or ETL.&lt;/p&gt; &lt;p&gt;RDF will deliver value in complex situations. An example of a complex relational mapping use case came from Ordnance Survey, presented at the &lt;a href=&quot;http://www.w3.org/2005/Incubator/rdb2rdf/&quot; id=&quot;link-id0x1a941678&quot;&gt;RDB2RDF XG&lt;/a&gt;. Examples of complex warehouses include the &lt;a href=&quot;http://neurocommons.org/page/Main_Page&quot; id=&quot;link-id0x1aa5a9f8&quot;&gt;Neurocommons&lt;/a&gt; database, the Billion Triples Challenge, and the &lt;a href=&quot;http://www.garlik.com/&quot; id=&quot;link-id0x372df7b0&quot;&gt;Garlik DataPatrol&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;In comparison, the Berlin workload is really simple and one where RDF is not at its best, as amply discussed on the &lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id0x1a671cf0&quot;&gt;Linked Data&lt;/a&gt; forum. BSBM&amp;#39;s primary value is as a demonstrator for the basic mapping tasks that will be repeated over and over for pretty much any online system when presence on the &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x1ab83dd0&quot;&gt;data&lt;/a&gt; web becomes as indispensable as presence on the HTML web.&lt;/p&gt; &lt;p&gt;I will now talk about the complex warehouse/web-harvesting side. I will come to the mapping in another post.&lt;/p&gt; &lt;p&gt;Now, all the things shown in the &lt;a href=&quot;http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1445&quot; id=&quot;link-id14de1d18&quot;&gt;Billion Triples post&lt;/a&gt; can be done with a relational system specially built for each purpose. Since we are a general purpose &lt;a href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id0x340d3470&quot;&gt;RDBMS&lt;/a&gt;, we use this capability where it makes sense. For example, storing statistics about which tags or interests occur with which other tags or interests as RDF blank nodes makes no sense. We do not even make the experiment; we know ahead of time that the result is at least an order of magnitude in favor of the relational row-oriented solution in both space and time.&lt;/p&gt; &lt;p&gt;Whenever there is a data structure specially made for answering one specific question, like joint occurrence of tags, RDB and mapping is the way to go. With Virtuoso, this can fully-well coexist with physical triples, and can still be accessed in SPARQL and mixed with triples. This is territory that we have not extensively covered yet, but we will be giving some examples about this later.&lt;/p&gt; &lt;p&gt;The real value of RDF is in agility. When there is no time to design and load a new warehouse for every new question, RDF is unparalleled. Also SPARQL, once it has the necessary extensions of aggregating and sub-queries, is nicer than SQL, especially when we have sub-classes and sub-properties, transitivity, and &amp;quot;same as&amp;quot; enabled. These things have some run time cost and if there is a report one is hitting absolutely all the time, then chances are that resolving terms and identity at load-time and using materialized views in SQL is the reasonable thing. If one is inventing a new report every time, then RDF has a lot more convenience and flexibility.&lt;/p&gt; &lt;p&gt;We are just beginning to explore what we can do with data sets such as the online conversation space, linked data, and the open ontologies of &lt;a href=&quot;http://umbel.org/about/&quot; id=&quot;link-id0x19cabf38&quot;&gt;UMBEL&lt;/a&gt; and &lt;a href=&quot;http://dbpedia.org/resource/Cyc&quot; id=&quot;link-id0x19cecd10&quot;&gt;OpenCyc&lt;/a&gt;. It is safe to say that we can run with real world scale without loss of query expressivity. There is an incremental cost for performance but this is not prohibitive. Serving the whole billion triples set from memory would cost about $32K in hardware. $8K will do if one can wait for disk part of the time. One can use these numbers as a basis for costing larger systems. For online search applications, one will note that running the indexes pretty much from memory is necessary for flat response time. For back office analytics this is not necessarily as critical. It all depends on the use case.&lt;/p&gt; &lt;p&gt;We expect to be able to combine geography, social proximity, subject matter, and &lt;a href=&quot;http://dbpedia.org/resource/Named_entity_recognition&quot; id=&quot;link-id0x1a8202e8&quot;&gt;named entities&lt;/a&gt;, with hierarchical taxonomies and traditional full text, and to present this through a simple user interface.&lt;/p&gt; &lt;p&gt;We expect to do this with online response times if we have a limited set of starting points and do not navigate more than 2 or 3 steps from each starting point. An example would be to have a full text pattern and news group, and get the cloud of interests from the authors of matching posts. Another would be to make a faceted view of the properties of the 1000 people most closely connected to one person.&lt;/p&gt; &lt;p&gt;Queries like finding the fastest online responders to questions about romance across the global board-scape, or finding the person who initiates the most long running conversations about crime, take a bit longer but are entirely possible.&lt;/p&gt; &lt;p&gt;The genius of RDF is to be able to do these things within a general purpose database, ad hoc, in a single query language, mostly without materializing intermediate results. Any of these things could be done with arbitrary efficiency in a custom built system. But what is special now is that the cost of access to this type of &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id0x1ab0a918&quot;&gt;information&lt;/a&gt; and far beyond drops dramatically as we can do these things in a far less labor intensive way, with a general purpose system, with no redesigning and reloading of warehouses at every turn. The query becomes a commodity.&lt;/p&gt; &lt;p&gt;Still, one must know what to ask. In this respect, the self-describing nature of RDF is unmatched. A query like &lt;i&gt;list the top 10 attributes with the most distinct values for all persons&lt;/i&gt; cannot be done in SQL. SQL simply does not allow the columns to be variable.&lt;/p&gt; &lt;p&gt;Further, we can accept queries as text, the way people are used to supplying them, and use structure for drill-down or result-relevance, and also recognize named entities and subject matter concepts in query text. Very simple NLP will go a long way towards keeping SPARQL out of the user experience.&lt;/p&gt; &lt;p&gt;The other way of keeping query complexity hidden is to publish hand-written SPARQL as parameter-fed canned reports.&lt;/p&gt; &lt;p&gt;Between now and ISWC 2008, the last week of October, we will put out demos showing some of these things. Stay tuned.&lt;/p&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>A quick look at SP2B, the SPARQL Performance Benchmark</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2008-08-27#1422</atom:id>
  <atom:published>2008-08-27T16:00:07Z</atom:published>
  <atom:updated>2008-09-02T09:49:55-04:00</atom:updated>
  <atom:content type="html">&lt;p&gt;I finally got around to running the &lt;a href=&quot;http://dbis.informatik.uni-freiburg.de/index.php?project=SP2B&quot; id=&quot;link-id17bac628&quot;&gt;SP&lt;sup&gt;2&lt;/sup&gt;B SPARQL Performance Benchmark&lt;/a&gt; on the current &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x1d2a6838&quot;&gt;Virtuoso&lt;/a&gt; Open Source Edition, v5.0.8.&lt;/p&gt; &lt;p&gt;I ran it with the 5M triples scale, which is the highest scale for which the authors give numbers.&lt;/p&gt; &lt;p&gt;I got a run time of 25 minutes for the 12 queries, giving an arithmetic mean of the query time of 125 seconds. This is better than the 800 or so seconds that the authors had measured. Also, Q6 of the set had failed for the authors, but we have since fixed this; the fix is in the v5.0.8 cut.&lt;/p&gt; &lt;p&gt;I also tried it with a scale of 25M, but this became I/O bound and took a bit longer. I will try this with v6 and v7 cluster later, which are vastly better at anything I/O bound.&lt;/p&gt; &lt;p&gt;The machine was a 2GHz Xeon with 8G RAM. The query text was the one from the authors, with an explicit &lt;code&gt;FROM&lt;/code&gt; clause added; the client was the command line Interactive &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x19e74ce0&quot;&gt;SQL&lt;/a&gt; (iSQL).&lt;/p&gt; &lt;p&gt;If one does the test with the default index layout without specifying a graph, things will not work very well. Also, returning the million-row results of these queries over the &lt;a href=&quot;http://www.w3.org/TR/rdf-sparql-protocol/&quot; id=&quot;link-id0x1c4231a0&quot;&gt;SPARQL protocol&lt;/a&gt; is not practical.&lt;/p&gt; &lt;p&gt;I will say something more about SP&lt;sup&gt;2&lt;/sup&gt;B when I get to have a closer look.&lt;/p&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>Configuring Virtuoso for Benchmarking</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2008-08-25#1418</atom:id>
  <atom:published>2008-08-25T14:05:46Z</atom:published>
  <atom:updated>2008-08-25T15:29:04-04:00</atom:updated>
  <atom:content type="html">&lt;p&gt;I will here summarize what should be known about running benchmarks with &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0xc53af18&quot;&gt;Virtuoso&lt;/a&gt;.&lt;/p&gt; &lt;h2&gt;Physical Memory&lt;/h2&gt; &lt;p&gt;For 8G RAM, in the &lt;code&gt;[Parameters]&lt;/code&gt; stanza of &lt;code&gt;virtuoso.ini&lt;/code&gt;, set â&lt;/p&gt; &lt;blockquote&gt; &lt;code&gt; [Parameters]&lt;br /&gt; ...&lt;br /&gt; NumberOfBuffers = 550000 &lt;/code&gt; &lt;/blockquote&gt; &lt;p&gt;For 16G RAM, double thisâ&lt;/p&gt; &lt;blockquote&gt; &lt;code&gt; [Parameters]&lt;br /&gt; ...&lt;br /&gt; NumberOfBuffers = 1100000 &lt;/code&gt; &lt;/blockquote&gt; &lt;h2&gt;Transaction Isolation&lt;/h2&gt; &lt;p&gt;For most cases, certainly all &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0xc2f07a0&quot;&gt;RDF&lt;/a&gt; cases, &lt;i&gt;Read Committed&lt;/i&gt; should be the default transaction isolation. In the &lt;code&gt;[Parameters]&lt;/code&gt; stanza of &lt;code&gt;virtuoso.ini&lt;/code&gt;, set â&lt;/p&gt; &lt;blockquote&gt; &lt;code&gt; [Parameters]&lt;br /&gt; ...&lt;br /&gt; DefaultIsolation = 2 &lt;/code&gt; &lt;/blockquote&gt; &lt;h2&gt;Multiuser Workload&lt;/h2&gt; &lt;p&gt;If &lt;a href=&quot;http://dbpedia.org/resource/Open_Database_Connectivity&quot; id=&quot;link-id0xc1c7178&quot;&gt;ODBC&lt;/a&gt;, &lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id0xd16fb40&quot;&gt;JDBC&lt;/a&gt;, or similarly connected client applications are used, there must be more &lt;code&gt;ServerThreads&lt;/code&gt; available than there will be client connections. In the &lt;code&gt;[Parameters]&lt;/code&gt; stanza of &lt;code&gt;virtuoso.ini&lt;/code&gt;, set â&lt;/p&gt; &lt;blockquote&gt; &lt;code&gt; [Parameters]&lt;br /&gt; ...&lt;br /&gt; ServerThreads = 100 &lt;/code&gt; &lt;/blockquote&gt; &lt;p&gt;With web clients (unlike ODBC, JDBC, or similar clients), it may be justified to have fewer &lt;code&gt;ServerThreads&lt;/code&gt; than there are concurrent clients. The &lt;code&gt;MaxKeepAlives&lt;/code&gt; should be the maximum number of expected web clients. This can be more than the &lt;code&gt;ServerThreads&lt;/code&gt; count. In the &lt;code&gt;[HTTPServer]&lt;/code&gt; stanza of &lt;code&gt;virtuoso.ini&lt;/code&gt;, set â&lt;/p&gt; &lt;blockquote&gt; &lt;code&gt; [HTTPServer]&lt;br /&gt; ...&lt;br /&gt; ServerThreads = 100 &lt;br /&gt; MaxKeepAlives = 1000 &lt;br /&gt; KeepAliveTimeout = 10 &lt;/code&gt; &lt;/blockquote&gt; &lt;p&gt; &lt;i&gt;&lt;b&gt;Note&lt;/b&gt; â The &lt;code&gt;[HTTPServer] ServerThreads&lt;/code&gt; are taken from the total pool made available by the &lt;code&gt;[Parameters] ServerThreads&lt;/code&gt;. Thus, the &lt;code&gt;[Parameters] ServerThreads&lt;/code&gt; should always be at least as large as (and is best set greater than) the &lt;code&gt;[HTTPServer] ServerThreads&lt;/code&gt;, and if using the closed-source Commercial Version, should not exceed the licensed thread count.&lt;/i&gt; &lt;/p&gt; &lt;h2&gt;Disk Use&lt;/h2&gt; &lt;p&gt;The basic rule is to use one stripe (file) per distinct physical device (not per file system), using no RAID. For example, one might stripe a database over 6 files (6 physical disks), with an initial size of 60000 pages (the files will grow as needed). &lt;/p&gt; &lt;p&gt;For the above described example, in the &lt;code&gt;[Database]&lt;/code&gt; stanza of &lt;code&gt;virtuoso.ini&lt;/code&gt;, set â&lt;/p&gt; &lt;blockquote&gt; &lt;code&gt; [Database]&lt;br /&gt; ...&lt;br /&gt; Striping = 1&lt;br /&gt; MaxCheckpointRemap = 2000000 &lt;/code&gt; &lt;/blockquote&gt; &lt;p&gt;â and in the &lt;code&gt;[Striping]&lt;/code&gt; stanza, on one line per &lt;code&gt;SegmentName&lt;/code&gt;, set â&lt;/p&gt; &lt;blockquote&gt; &lt;code&gt; [Striping]&lt;br /&gt; ...&lt;br /&gt; Segment1 = 60000 , /virtdev/db/virt-seg1.db = q1 , /data1/db/virt-seg1-str2.db = q2 , /data2/db/virt-seg1-str3.db = q3 , /data3/db/virt-seg1-str4.db = q4 , /data4/db/virt-seg1-str5.db = q5 , /data5/db/virt-seg1-str6.db = q6&lt;/code&gt; &lt;/blockquote&gt; &lt;p&gt;As can be seen here, each file gets a background IO thread (the &lt;code&gt;= q&lt;i&gt;xxx&lt;/i&gt;&lt;/code&gt; clause). It should be noted that all files on the same physical device should have the same &lt;code&gt;q&lt;i&gt;xxx&lt;/i&gt;&lt;/code&gt; value. This is not directly relevant to the benchmarking scenario above, because we have only one file per device, and thus only one file per IO queue.&lt;/p&gt; &lt;h2&gt; &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0xc9fa298&quot;&gt;SQL&lt;/a&gt; Optimization&lt;/h2&gt; &lt;p&gt;If queries have lots of joins but access little &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0xb4e0aa0&quot;&gt;data&lt;/a&gt;, as with the &lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id0xb2de990&quot;&gt;Berlin SPARQL Benchmark&lt;/a&gt;, the SQL compiler must be told not to look for better plans if the best plan so far is quicker than the compilation time expended so far. Thus, in the &lt;code&gt;[Parameters]&lt;/code&gt; stanza of &lt;code&gt;virtuoso.ini&lt;/code&gt;, set â&lt;/p&gt; &lt;blockquote&gt; &lt;code&gt; [Parameters]&lt;br /&gt; ...&lt;br /&gt; StopCompilerWhenXOverRunTime = 1 &lt;/code&gt; &lt;/blockquote&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>BSBM With Triples and Mapped Relational Data</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2008-08-06#1409</atom:id>
  <atom:published>2008-08-06T19:35:27Z</atom:published>
  <atom:updated>2008-08-06T16:29:40-04:00</atom:updated>
  <atom:content type="html">&lt;p&gt;The special contribution of the &lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id10039db0&quot;&gt;Berlin SPARQL Benchmark&lt;/a&gt; (&lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id106b2538&quot;&gt;BSBM&lt;/a&gt;) to the &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id101a75f8&quot;&gt;RDF&lt;/a&gt; world is to raise the question of doing OLTP with &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0xb230eb0&quot;&gt;RDF&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;Of course, here we immediately hit the question of comparisons with relational databases. To this effect, &lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id0xa832da8&quot;&gt;BSBM&lt;/a&gt; also specifies a relational schema and can generate the &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id1206c378&quot;&gt;data&lt;/a&gt; as either triples or &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id1667f040&quot;&gt;SQL&lt;/a&gt; inserts.&lt;/p&gt; &lt;p&gt;The benchmark effectively simulates the case of exposing an existing &lt;a href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id10a93518&quot;&gt;RDBMS&lt;/a&gt; as RDF. &lt;a href=&quot;http://www.openlinksw.com/dataspace/organization/openlink#this&quot; id=&quot;link-id13e46d80&quot;&gt;OpenLink Software&lt;/a&gt; calls this &lt;i&gt;RDF Views&lt;/i&gt;. &lt;a href=&quot;http://dbpedia.org/resource/Oracle_Database&quot; id=&quot;link-id12027578&quot;&gt;Oracle&lt;/a&gt; is beginning to call this &lt;i&gt;semantic covers&lt;/i&gt;. The &lt;a href=&quot;http://www.w3.org/2005/Incubator/rdb2rdf/&quot; id=&quot;link-id161dc678&quot;&gt;RDB2RDF XG&lt;/a&gt;, a W3C incubator group, has been active in this area since Spring, 2008.&lt;/p&gt; &lt;h3&gt;But why an OLTP workload with RDF to begin with?&lt;/h3&gt; &lt;p&gt;We believe this is relevant because RDF promises to be the interoperability factor between potentially all of traditional IS. If &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0xabe48a0&quot;&gt;data&lt;/a&gt; is online for human consumption, it may be online via a &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id106a8908&quot;&gt;SPARQL&lt;/a&gt; end-point as well. The economic justification will come from discoverability and from applications integrating multi-source structured data. Online shopping is a fine use case.&lt;/p&gt; &lt;p&gt;Warehousing all the world&amp;#39;s publishable data as RDF is not our first preference, nor would it be the publisher&amp;#39;s. Considerations of duplicate infrastructure and maintenance are reason enough. Consequently, we need to show that mapping can outperform an RDF warehouse, which is what we&amp;#39;ll do here.&lt;/p&gt; &lt;h3&gt;What We Got &lt;/h3&gt; &lt;p&gt;First, we found that &lt;a href=&quot;http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1400&quot; id=&quot;link-id150ea748&quot;&gt;making the query plan took much too long&lt;/a&gt; in proportion to the run time. With BSBM this is an issue because the queries have lots of joins but access relatively little data. So we made a faster compiler and along the way retouched the cost model a bit.&lt;/p&gt; &lt;p&gt;But the really interesting part with BSBM is mapping relational data to RDF. For us, BSBM is a great way of showing that mapping can outperform even the best triple store. A relational row store is as good as unbeatable with the query mix. And when there is a clear mapping, there is no reason the &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x96bb5e0&quot;&gt;SPARQL&lt;/a&gt; could not be directly translated.&lt;/p&gt; &lt;p&gt;If Chris Bizer et al launched the mapping ship, we will be the ones to pilot it to harbor!&lt;/p&gt; &lt;p&gt;We filled two &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id12dbdc70&quot;&gt;Virtuoso&lt;/a&gt; instances with a BSBM200000 data set, for 100M triples. One was filled with physical triples; the other was filled with the equivalent relational data plus mapping to triples. Performance figures are given in &amp;quot;query mixes per hour&amp;quot;. (An update or follow-on to this post will provide elapsed times for each test run.)&lt;/p&gt; &lt;p&gt;With the unmodified benchmark we got:&lt;/p&gt; &lt;blockquote&gt; &lt;table&gt; &lt;tr&gt; &lt;td&gt;&lt;i&gt;Physical Triples:&lt;/i&gt; &lt;/td&gt; &lt;td&gt;Â  Â &lt;/td&gt; &lt;td&gt;1297 qmph&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;&lt;i&gt;Mapped Triples:&lt;/i&gt; &lt;/td&gt; &lt;td&gt;Â  Â &lt;/td&gt; &lt;td&gt;&lt;b&gt;3144 qmph&lt;/b&gt; &lt;/td&gt; &lt;/tr&gt; &lt;/table&gt; &lt;/blockquote&gt; &lt;p&gt;In both cases, most of the time was spent on Q6, which looks for products with one of three words in the label. We altered Q6 to use text index for the mapping, and altered the databases accordingly. (There is no such thing as an e-commerce site without a text index, so we are amply justified in making this change.)&lt;/p&gt; &lt;p&gt;The following were measured on the second run of a 100 query mix series, single test driver, warm cache.&lt;/p&gt; &lt;blockquote&gt; &lt;table&gt; &lt;tr&gt; &lt;td&gt;&lt;i&gt;Physical Triples:&lt;/i&gt; &lt;/td&gt; &lt;td&gt;Â  Â &lt;/td&gt; &lt;td&gt; 5746 qmph&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;&lt;i&gt;Mapped Triples:&lt;/i&gt; &lt;/td&gt; &lt;td&gt;Â  Â &lt;/td&gt; &lt;td&gt; &lt;b&gt;7525 qmph&lt;/b&gt; &lt;/td&gt; &lt;/tr&gt; &lt;/table&gt; &lt;/blockquote&gt; &lt;p&gt;We then ran the same with 4 concurrent instances of the test driver. The qmph here is 400 / the longest run time.&lt;/p&gt; &lt;blockquote&gt; &lt;table&gt; &lt;tr&gt; &lt;td&gt;&lt;i&gt;Physical Triples:&lt;/i&gt; &lt;/td&gt; &lt;td&gt;Â  Â &lt;/td&gt; &lt;td&gt; 19459 qmph&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;&lt;i&gt;Mapped Triples:&lt;/i&gt; &lt;/td&gt; &lt;td&gt;Â  Â &lt;/td&gt; &lt;td&gt; &lt;b&gt;24531 qmph&lt;/b&gt; &lt;/td&gt; &lt;/tr&gt; &lt;/table&gt; &lt;/blockquote&gt; &lt;p&gt;The system used was 64-bit Linux, 2GHz dual-Xeon 5130 (8 cores) with 8G RAM. The concurrent throughputs are a little under 4 times the single thread throughput, which is normal for SMP due to memory contention. The numbers do not evidence significant overhead from thread synchronization.&lt;/p&gt; &lt;p&gt;The query compilation represents about 1/3 of total server side CPU. In an actual online application of this type, queries would be parameterized, so the throughputs would be accordingly higher. We used the &lt;code&gt;StopCompilerWhenXOverRunTime = 1&lt;/code&gt; option here to cut needless compiler overhead, the queries being straightforward enough.&lt;/p&gt; &lt;p&gt;We also see that the advantage of mapping can be further increased by more compiler optimizations, so we expect in the end mapping will lead RDF warehousing by a factor of 4 or so.&lt;/p&gt; &lt;h3&gt;Suggestions for BSBM&lt;/h3&gt; &lt;ul&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;Reporting Rules.&lt;/b&gt; The benchmark spec should specify a form for disclosure of test run data, TPC style. This includes things like configuration parameters and exact text of queries. There should be accepted variants of query text, as with the TPC.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;Multiuser operation.&lt;/b&gt; The test driver should get a stream number as parameter, so that each client makes a different query sequence. Also, disk performance in this type of benchmark can only be reasonably assessed with a naturally parallel multiuser workload.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;Add business intelligence.&lt;/b&gt; SPARQL has aggregates now, at least with &lt;a href=&quot;http://jena.sourceforge.net/&quot; id=&quot;link-id11a25ac0&quot;&gt;Jena&lt;/a&gt; and &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0xa83f490&quot;&gt;Virtuoso&lt;/a&gt;, so let&amp;#39;s use these. The BSBM business intelligence metric should be a separate metric off the same data. Adding synthetic sales figures would make more interesting queries possible. For example, producing recommendations like &amp;quot;customers who bought this also bought xxx.&amp;quot;&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;For the SPARQL community&lt;/b&gt;, BSBM sends the message that one ought to support parameterized queries and stored procedures. This would be a &lt;a href=&quot;http://www.w3.org/TR/rdf-sparql-protocol/&quot; id=&quot;link-id109e2448&quot;&gt;SPARQL protocol&lt;/a&gt; extension; the SPARUL syntax should also have a way of calling a procedure. Something like &lt;code&gt;select proc (??, ??)&lt;/code&gt; would be enough, where &lt;code&gt;??&lt;/code&gt; is a parameter marker, like &lt;code&gt;?&lt;/code&gt; in &lt;a href=&quot;http://dbpedia.org/resource/Open_Database_Connectivity&quot; id=&quot;link-id13febf48&quot;&gt;ODBC&lt;/a&gt;/&lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id120416a8&quot;&gt;JDBC&lt;/a&gt;.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;Add transactions.&lt;/b&gt;Especially if we are contrasting mapping vs. storing triples, having an update flow is relevant. In practice, this could be done by having the test driver send web service requests for order entry and the SUT could implement these as updates to the triples or a mapped relational store. This could use stored procedures or logic in an app server.&lt;/p&gt; &lt;/li&gt; &lt;/ul&gt; &lt;h3&gt;Comments on Query Mix&lt;/h3&gt; &lt;p&gt;The time of most queries is less than linear to the scale factor. Q6 is an exception if it is not implemented using a text index. Without the text index, Q6 will inevitably come to dominate query time as the scale is increased, and thus will make the benchmark less relevant at larger scales.&lt;/p&gt; &lt;h2&gt;Next&lt;/h2&gt; &lt;p&gt;We include the sources of our RDF view definitions and other material for running BSBM with our forthcoming Virtuoso Open Source 5.0.8 release. This also includes all the query optimization work done for BSBM. This will be available in the coming days.&lt;/p&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>Virtuoso Optimizations for the Berlin SPARQL Benchmark</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2008-07-30#1400</atom:id>
  <atom:published>2008-07-30T18:17:54Z</atom:published>
  <atom:updated>2008-08-06T16:29:37.000003-04:00</atom:updated>
  <atom:content type="html">&lt;p&gt;We had a look at Chris Bizer&amp;#39;s initial results with the &lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id105c9f78&quot;&gt;Berlin SPARQL Benchmark&lt;/a&gt; (&lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id102d62b0&quot;&gt;BSBM&lt;/a&gt;) on &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id13eb9780&quot;&gt;Virtuoso&lt;/a&gt;. The first results were rather bad, as nearly all of the run time was spent optimizing the &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id14a51258&quot;&gt;SPARQL&lt;/a&gt; statements and under 10% actually running them.&lt;/p&gt; &lt;p&gt;So I spent a couple of days on the &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0xa5a8d0e8&quot;&gt;SPARQL&lt;/a&gt;/&lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id108745b0&quot;&gt;SQL&lt;/a&gt; compiler, to the effect of making it do a better guess of initial execution plan and streamlining some operations. In fact, many of the queries in &lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id0xaf04af8&quot;&gt;BSBM&lt;/a&gt; are not particularly sensitive to execution plan, as they access a very small portion of the database. So to close the matter, I put in a flag that makes the &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x1e8d2360&quot;&gt;SQL&lt;/a&gt; compiler give up on devising new plans if the time of the best plan so far is less than the time spent compiling so far.&lt;/p&gt; &lt;p&gt;With these changes, available now as a diff on top of 5.0.7, we run quite well, several times better than initially. With the compiler time cut-off in place (ini parameter &lt;code&gt;StopCompilerWhenXOverRunTime = 1&lt;/code&gt;), we get the following times, output from the BSBM test driver:&lt;/p&gt; &lt;blockquote&gt; &lt;pre&gt; Starting test... 0: 1031.22 ms, total: 1151 ms 1: 982.89 ms, total: 1040 ms 2: 923.27 ms, total: 968 ms 3: 898.37 ms, total: 932 ms 4: 855.70 ms, total: 865 ms Scale factor: 10000 Number of query mix runs: 5 times min/max Query mix runtime: 0.8557 s / 1.0312 s Total runtime: 4.691 seconds QMpH: 3836.77 query mixes per hour CQET: 0.93829 seconds average runtime of query mix CQET (geom.): 0.93625 seconds geometric mean runtime of query mix Metrics for Query 1: Count: 5 times executed in whole run AQET: 0.012212 seconds (arithmetic mean) AQET(geom.): 0.009934 seconds (geometric mean) QPS: 81.89 Queries per second minQET/maxQET: 0.00684000s / 0.03115700s Average result count: 7.0 min/max result count: 3 / 10 Metrics for Query 2: Count: 35 times executed in whole run AQET: 0.030490 seconds (arithmetic mean) AQET(geom.): 0.029776 seconds (geometric mean) QPS: 32.80 Queries per second minQET/maxQET: 0.02467300s / 0.06753000s Average result count: 22.5 min/max result count: 15 / 30 Metrics for Query 3: Count: 5 times executed in whole run AQET: 0.006947 seconds (arithmetic mean) AQET(geom.): 0.006905 seconds (geometric mean) QPS: 143.95 Queries per second minQET/maxQET: 0.00580000s / 0.00795100s Average result count: 4.0 min/max result count: 0 / 10 Metrics for Query 4: Count: 5 times executed in whole run AQET: 0.008858 seconds (arithmetic mean) AQET(geom.): 0.008829 seconds (geometric mean) QPS: 112.89 Queries per second minQET/maxQET: 0.00804400s / 0.01019500s Average result count: 3.4 min/max result count: 0 / 10 Metrics for Query 5: Count: 5 times executed in whole run AQET: 0.087542 seconds (arithmetic mean) AQET(geom.): 0.087327 seconds (geometric mean) QPS: 11.42 Queries per second minQET/maxQET: 0.08165600s / 0.09889200s Average result count: 5.0 min/max result count: 5 / 5 Metrics for Query 6: Count: 5 times executed in whole run AQET: 0.131222 seconds (arithmetic mean) AQET(geom.): 0.131216 seconds (geometric mean) QPS: 7.62 Queries per second minQET/maxQET: 0.12924200s / 0.13298200s Average result count: 3.6 min/max result count: 3 / 5 Metrics for Query 7: Count: 20 times executed in whole run AQET: 0.043601 seconds (arithmetic mean) AQET(geom.): 0.040890 seconds (geometric mean) QPS: 22.94 Queries per second minQET/maxQET: 0.01984400s / 0.06012600s Average result count: 26.4 min/max result count: 5 / 96 Metrics for Query 8: Count: 10 times executed in whole run AQET: 0.018168 seconds (arithmetic mean) AQET(geom.): 0.016205 seconds (geometric mean) QPS: 55.04 Queries per second minQET/maxQET: 0.01097600s / 0.05066900s Average result count: 12.8 min/max result count: 6 / 20 Metrics for Query 9: Count: 20 times executed in whole run AQET: 0.043813 seconds (arithmetic mean) AQET(geom.): 0.043807 seconds (geometric mean) QPS: 22.82 Queries per second minQET/maxQET: 0.04274900s / 0.04504100s Average result count: 0.0 min/max result count: 0 / 0 Metrics for Query 10: Count: 15 times executed in whole run AQET: 0.030697 seconds (arithmetic mean) AQET(geom.): 0.029651 seconds (geometric mean) QPS: 32.58 Queries per second minQET/maxQET: 0.02072000s / 0.03975700s Average result count: 1.1 min/max result count: 0 / 4 real 0 m 5.485 s user 0 m 2.233 s sys 0 m 0.170 s &lt;/pre&gt;&lt;/blockquote&gt; &lt;p&gt;Of the approximately 5.5 seconds of running five query mixes, the test driver spends 2.2 s. The server side processing time is 3.1 s, of which SQL compilation is 1.35 s. The rest is miscellaneous system time. The measurement is on 64-bit Linux, 2GHz dual-Xeon 5130 (8 cores) with 8G RAM. &lt;/p&gt; &lt;p&gt;We note that this type of workload would be done with stored procedures or prepared, parameterized queries in the SQL world.&lt;/p&gt; &lt;p&gt;There will be some further tuning still but this addresses the bulk of the matter. There will be a separate message about the patch containing these improvements.&lt;/p&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>DBpedia Benchmark Revisited</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2008-05-09#1358</atom:id>
  <atom:published>2008-05-09T19:27:00Z</atom:published>
  <atom:updated>2008-05-12T11:24:36-04:00</atom:updated>
  <atom:content type="html">&lt;p&gt;We ran the &lt;a href=&quot;http://dbpedia.org/resource/DBpedia&quot; id=&quot;link-id0x1b7f9688&quot;&gt;DBpedia&lt;/a&gt; benchmark queries again with different configurations of &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x1cca2e00&quot;&gt;Virtuoso&lt;/a&gt;. I had not studied the details of the matter previously but now did have a closer look at the queries.&lt;/p&gt; &lt;p&gt;Comparing numbers given by different parties is a constant problem. In the case reported here, we loaded the full DBpedia 3, all languages, with about 198M triples, onto Virtuoso v5 and Virtuoso Cluster v6, all on the same 4 core 2GHz Xeon with 8G RAM. All databases were striped on 6 disks. The Cluster configuration was with 4 processes in the same box.&lt;/p&gt; &lt;p&gt;We ran the queries in two variants:&lt;/p&gt; &lt;ul&gt; &lt;li&gt;With graph specified in the &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x1b77f758&quot;&gt;SPARQL&lt;/a&gt; &lt;code&gt;FROM&lt;/code&gt; clause, using the default indices.&lt;/li&gt; &lt;li&gt;With no graph specified anywhere, using an alternate indexing scheme.&lt;/li&gt; &lt;/ul&gt; &lt;p&gt;The times below are for the sequence of 5 queries; individual query times are not reported. I did not do a line-by-line review of the execution plans since they seem to run well enough. We could get some extra mileage from cost model tweaks, especially for the numeric range conditions, but we will do this when somebody comes up with better times.&lt;/p&gt; &lt;p&gt;First, about Virtuoso v5: Because there is a query in the set that specifies no condition on S or O and only P, this simply cannot be done with the default indices. With Virtuoso Cluster v6 it sort-of can, because v6 is more space efficient.&lt;/p&gt; &lt;p&gt;So we added the index:&lt;/p&gt; &lt;blockquote&gt; &lt;code&gt; create bitmap index &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x1cb0b180&quot;&gt;rdf&lt;/a&gt;_quad_pogs on rdf_quad (p, o, g, s); &lt;/code&gt; &lt;/blockquote&gt; &lt;table&gt; &lt;tr&gt; &lt;td&gt;Â &lt;/td&gt; &lt;td align=&quot;center&quot;&gt;&lt;b&gt;Virtuoso v5 with&lt;br /&gt; gspo, ogps, pogs&lt;/b&gt; &lt;/td&gt; &lt;td align=&quot;center&quot;&gt;&lt;b&gt;Virtuoso Cluster v6 with &lt;br /&gt;gspo, ogps&lt;/b&gt; &lt;/td&gt; &lt;td align=&quot;center&quot;&gt;&lt;b&gt;Virtuoso Cluster v6 with &lt;br /&gt;gspo, ogps, pogs&lt;/b&gt; &lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;&lt;b&gt;cold&lt;/b&gt; &lt;/td&gt; &lt;td align=&quot;center&quot;&gt;210 s&lt;/td&gt; &lt;td align=&quot;center&quot;&gt;136 s&lt;/td&gt; &lt;td align=&quot;center&quot;&gt;33.4 s&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;&lt;b&gt;warm&lt;/b&gt; &lt;/td&gt; &lt;td align=&quot;center&quot;&gt;0.600 s&lt;/td&gt; &lt;td align=&quot;center&quot;&gt;4.01 s&lt;/td&gt; &lt;td align=&quot;center&quot;&gt;0.628 s&lt;/td&gt; &lt;/tr&gt; &lt;/table&gt; &lt;p&gt;OK, so now let us do it without a graph being specified. For all platforms, we drop any existing indices, and --&lt;/p&gt; &lt;blockquote&gt; &lt;code&gt; create table r2 (g iri_id_8, s, iri_id_8, p iri_id_8, o any, primary key (s, p, o, g)) &lt;br /&gt; alter index R2 on R2 partition (s int (0hexffff00)); &lt;br /&gt; &lt;br /&gt; log_enable (2); &lt;br /&gt; insert into r2 (g, s, p, o) select g, s, p, o from rdf_quad; &lt;br /&gt; &lt;br /&gt; drop table rdf_quad; &lt;br /&gt; alter table r2 rename RDF_QUAD; &lt;br /&gt; create bitmap index rdf_quad_opgs on rdf_quad (o, p, g, s) partition (o varchar (-1, 0hexffff)); &lt;br /&gt; create bitmap index rdf_quad_pogs on rdf_quad (p, o, g, s) partition (o varchar (-1, 0hexffff)); &lt;br /&gt; create bitmap index rdf_quad_gpos on rdf_quad (g, p, o, s) partition (o varchar (-1, 0hexffff)); &lt;/code&gt; &lt;/blockquote&gt; &lt;p&gt;The code is identical for v5 and v6, except that with v5 we use &lt;code&gt;iri_id (32 bit)&lt;/code&gt; for the type, not &lt;code&gt;iri_id_8 (64 bit)&lt;/code&gt;. We note that we run out of IDs with v5 around a few billion triples, so with v6 we have double the ID length and still manage to be vastly more space efficient.&lt;/p&gt; &lt;p&gt;With the above 4 indices, we can query the &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x6339b80&quot;&gt;data&lt;/a&gt; pretty much in any combination without hitting a full scan of any index. We note that all indices that do not begin with s end with s as a bitmap. This takes about 60% of the space of a non-bitmap index for data such as DBpedia.&lt;/p&gt; &lt;p&gt;If you intend to do completely arbitrary RDF queries in Virtuoso, then chances are you are best off with the above index scheme.&lt;/p&gt; &lt;table&gt; &lt;tr&gt; &lt;td&gt;Â &lt;/td&gt; &lt;td align=&quot;center&quot;&gt;&lt;b&gt; Virtuoso v5 with&lt;br /&gt; gspo, ogps, pogs&lt;/b&gt; &lt;/td&gt; &lt;td align=&quot;center&quot;&gt;&lt;b&gt; Virtuoso Cluster v6 with &lt;br /&gt; spog, pogs, opgs, gpos &lt;/b&gt; &lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;&lt;b&gt;warm&lt;/b&gt; &lt;/td&gt; &lt;td align=&quot;center&quot;&gt;0.595 s&lt;/td&gt; &lt;td align=&quot;center&quot;&gt;0.617 s&lt;/td&gt; &lt;/tr&gt; &lt;/table&gt; &lt;p&gt;The cold times were about the same as above, so not reproduced.&lt;/p&gt; &lt;h3&gt;Graph or No Graph?&lt;/h3&gt; &lt;p&gt;It is in the SPARQL spirit to specify a graph and for pretty much any application, there are entirely sensible ways of keeping the data in graphs and specifying which ones are concerned by queries. This is why Virtuoso is set up for this by default.&lt;/p&gt; &lt;p&gt;On the other hand, for the open web scenario, dealing with an unknown large number of graphs, enumerating graphs is not possible and questions like which graph of which source asserts x become relevant. We have two distinct use cases which warrant different setups of the database, simple as that.&lt;/p&gt; &lt;p&gt;The latter use case is not really within the SPARQL spec, so implementations may or may not support this. For example &lt;a href=&quot;http://dbpedia.org/resource/Oracle_Database&quot; id=&quot;link-id0x11ed7028&quot;&gt;Oracle&lt;/a&gt; or Vertica would not do this well since they partition data according to graph or predicate, respectively. On the other hand, stores that work with one quad table, which is most of the ones out there, should do it maybe with some configuring, as shown above.&lt;/p&gt; &lt;p&gt;Frameworks like Jena are not to my &lt;a href=&quot;http://dbpedia.org/resource/Knowledge&quot; id=&quot;link-id0x1a49ded0&quot;&gt;knowledge&lt;/a&gt; geared towards having a wildcard for graph, although I would suppose this can be arranged by adding some &amp;quot;super-graph&amp;quot; object, a graph of all graphs. I don&amp;#39;t think this is directly supported and besides most apps would not need it.&lt;/p&gt; &lt;p&gt;Once the indices are right, there is no difference between specifying a graph and not specifying a graph with the queries considered. With more complex queries, specifying a graph or set of graphs does allow some optimizations that cannot be done with no graph specified. For example, bitmap intersections are possible only when all leading key parts are given.&lt;/p&gt; &lt;h3&gt;Conclusions&lt;/h3&gt; &lt;p&gt;The best warm cache time is with v5; the five queries run under 600 ms after the first go. This is noted to show that all-in-memory with a single thread of execution is hard to beat.&lt;/p&gt; &lt;p&gt;Cluster v6 performs the same queries in 623 ms. What is gained in parallelism is lost in latency if all operations complete in microseconds. On the other hand, Cluster v6 leaves v5 in the dust in any situation that has less than 100% hit rate. This is due to actual benefit from parallelism if operations take longer than a few microseconds, such as in the case of disk reads. Cluster v6 has substantially better data layout on disk, as well as fewer pages to load for the same content.&lt;/p&gt; &lt;p&gt;This makes it possible to run the queries without the pogs index on Cluster v6 even when v5 takes prohibitively long.&lt;/p&gt; &lt;p&gt;The morale of the story is to have a lot of RAM and space-efficient data representation.&lt;/p&gt; &lt;p&gt;The DBpedia benchmark does not specify any random access pattern that would give a measure of sustained throughput under load, so we are left with the extremes of cold and warm cache of which neither is quite realistic.&lt;/p&gt; &lt;p&gt;Chris Bizer and I have talked on and off about benchmarks and I have made suggestions that we will see incorporated into the Berlin SPARQL benchmark, which will, I believe, be much more informative.&lt;/p&gt; &lt;h3&gt;Appendix: Query Text&lt;/h3&gt; &lt;p&gt;For reference, the query texts specifying the graph are below. To run without specifying the graph, just drop the &lt;code&gt;FROM &amp;lt;&lt;a href=&quot;http://dbpedia.org/resource/Hypertext_Transfer_Protocol&quot; id=&quot;link-id0x1905bfd0&quot;&gt;http&lt;/a&gt;://dbpedia.org&amp;gt;&lt;/code&gt; from each query. The returned row counts are indicated below each query&amp;#39;s text.&lt;/p&gt; &lt;blockquote&gt; &lt;code&gt;&lt;pre&gt; sparql SELECT ?p ?o FROM &amp;lt;http://dbpedia.org&amp;gt; WHERE { &amp;lt;http://dbpedia.org/resource/Metropolitan_Museum_of_Art&amp;gt; ?p ?o }; -- 1337 rows sparql PREFIX p: &amp;lt;http://dbpedia.org/property/&amp;gt; SELECT ?film1 ?actor1 ?film2 ?actor2 FROM &amp;lt;http://dbpedia.org&amp;gt; WHERE { ?film1 p:starring &amp;lt;http://dbpedia.org/resource/Kevin_Bacon&amp;gt; . ?film1 p:starring ?actor1 . ?film2 p:starring ?actor1 . ?film2 p:starring ?actor2 . }; -- 23910 rows sparql PREFIX p: &amp;lt;http://dbpedia.org/property/&amp;gt; SELECT ?artist ?artwork ?museum ?director FROM &amp;lt;http://dbpedia.org&amp;gt; WHERE { ?artwork p:artist ?artist . ?artwork p:museum ?museum . ?museum p:director ?director }; -- 303 rows sparql PREFIX geo: &amp;lt;http://www.w3.org/2003/01/geo/wgs84_pos#&amp;gt; PREFIX foaf: &amp;lt;http://xmlns.com/foaf/0.1/&amp;gt; PREFIX xsd: &amp;lt;http://www.w3.org/2001/XMLSchema#&amp;gt; SELECT ?s ?homepage FROM &amp;lt;http://dbpedia.org&amp;gt; WHERE { &amp;lt;http://dbpedia.org/resource/Berlin&amp;gt; geo:lat ?berlinLat . &amp;lt;http://dbpedia.org/resource/Berlin&amp;gt; geo:long ?berlinLong . ?s geo:lat ?lat . ?s geo:long ?long . ?s foaf:homepage ?homepage . FILTER ( ?lat &amp;lt;= ?berlinLat + 0.03190235436 &amp;amp;&amp;amp; ?long &amp;gt;= ?berlinLong - 0.08679199218 &amp;amp;&amp;amp; ?lat &amp;gt;= ?berlinLat - 0.03190235436 &amp;amp;&amp;amp; ?long &amp;lt;= ?berlinLong + 0.08679199218) }; -- 56 rows sparql PREFIX geo: &amp;lt;http://www.w3.org/2003/01/geo/wgs84_pos#&amp;gt; PREFIX foaf: &amp;lt;http://xmlns.com/foaf/0.1/&amp;gt; PREFIX xsd: &amp;lt;http://www.w3.org/2001/XMLSchema#&amp;gt; PREFIX p: &amp;lt;http://dbpedia.org/property/&amp;gt; SELECT ?s ?a ?homepage FROM &amp;lt;http://dbpedia.org&amp;gt; WHERE { &amp;lt;http://dbpedia.org/resource/New_York_City&amp;gt; geo:lat ?nyLat . &amp;lt;http://dbpedia.org/resource/New_York_City&amp;gt; geo:long ?nyLong . ?s geo:lat ?lat . ?s geo:long ?long . ?s p:architect ?a . ?a foaf:homepage ?homepage . FILTER ( ?lat &amp;lt;= ?nyLat + 0.3190235436 &amp;amp;&amp;amp; ?long &amp;gt;= ?nyLong - 0.8679199218 &amp;amp;&amp;amp; ?lat &amp;gt;= ?nyLat - 0.3190235436 &amp;amp;&amp;amp; ?long &amp;lt;= ?nyLong + 0.8679199218) }; -- 13 rows &lt;/pre&gt; &lt;/code&gt; &lt;/blockquote&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>Virtuoso 5.0.6 Updates</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2008-03-25#1326</atom:id>
  <atom:published>2008-03-25T16:59:08Z</atom:published>
  <atom:updated>2008-03-26T11:59:02-04:00</atom:updated>
  <atom:content type="html">&lt;p&gt;I will here summarize the developments since the last &lt;a href=&quot;http://data.openlinksw.com/oplweb/product_family/virtuoso#this&quot; id=&quot;link-id15843368&quot;&gt;Virtuoso&lt;/a&gt; 5 Open Source release.&lt;/p&gt; &lt;p&gt;On the &lt;a href=&quot;http://dbpedia.org/resource/RDF&quot; id=&quot;link-id101cae58&quot;&gt;RDF&lt;/a&gt; side, the bitmap intersection join has been improved quite a bit so that it is now almost always more than 2x more efficient than the equivalent nested loop join.&lt;/p&gt; &lt;p&gt;XML trees in the object position in &lt;a href=&quot;http://dbpedia.org/resource/RDF&quot; id=&quot;link-id1172c108&quot;&gt;RDF&lt;/a&gt; quads were in some cases incorrectly indexed, leading to failure to retrieve quads.  This is fixed and should problems occur in existing databases, they can be corrected by simply dropping and re-creating an index.&lt;/p&gt; &lt;p&gt;Also the cost model has been further tuned.  We have run the &lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-idd65a998&quot;&gt;TPC&lt;/a&gt;-&lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id11a2bf48&quot;&gt;H&lt;/a&gt; queries with larger databases and have profiled it extensively.  There are improvements to locking, especially for concurrency of transactions with large shared lock sets, as is the case in the &lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id12cfd690&quot;&gt;TPC&lt;/a&gt;-&lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id15891ae0&quot;&gt;H&lt;/a&gt; queries.  The rules stipulate that these have to be run with repeatable read.  There are also optimizations for decimal floating point.&lt;/p&gt; &lt;p&gt;A sampling of &lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id15b12eb0&quot;&gt;TPC&lt;/a&gt;-&lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id1172c740&quot;&gt;H&lt;/a&gt; queries translated into &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id15533aa8&quot;&gt;SPARQL&lt;/a&gt; comes with the new demo database.  These show a live sample of the &lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id15b82cd8&quot;&gt;TPC&lt;/a&gt;-&lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id15521d50&quot;&gt;H&lt;/a&gt; schema translated into linked data, complete with &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id15ae14d8&quot;&gt;SPARQL&lt;/a&gt; translations of the original queries.  Some work is still ongoing there but the relational to &lt;a href=&quot;http://dbpedia.org/resource/RDF&quot; id=&quot;link-id101b1240&quot;&gt;RDF&lt;/a&gt; mapping is mature enough for real business intelligence applications now.&lt;/p&gt; &lt;p&gt;On the closed source side, we have some adjustments to the virtual database.  When using &lt;a href=&quot;http://data.openlinksw.com/oplweb/product_family/virtuoso#this&quot; id=&quot;link-id10888df0&quot;&gt;Virtuoso&lt;/a&gt; as a front end to Oracle, using the &lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id134a9378&quot;&gt;TPC&lt;/a&gt;-&lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id10366320&quot;&gt;H&lt;/a&gt; queries as a metric, the virtual database overhead is minimal.  Previously, we had some overhead because some queries were rewritten in a way that Oracle would not optimize as well as the original &lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id15536320&quot;&gt;TPC&lt;/a&gt;-&lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-idfd4e278&quot;&gt;H&lt;/a&gt; text.  Specifically, turning an IN sub-query predicate into an equivalent EXISTS did not sit well with Oracle.&lt;/p&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>TPC H as Linked Data (Updated 2)</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2008-03-06#1321</atom:id>
  <atom:published>2008-03-06T16:22:03Z</atom:published>
  <atom:updated>2008-08-28T11:25:55-04:00</atom:updated>
  <atom:content type="html">&lt;p&gt;We have a new demo online at &lt;a href=&quot;http://demo.openlinksw.com/tpc-h&quot; id=&quot;link-id1829c9a0&quot;&gt;http://demo.openlinksw.com/tpc-h&lt;/a&gt;. This takes the industry standard &lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id0xeb7e460&quot;&gt;TPC-H&lt;/a&gt; benchmark &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0xb40fcb8&quot;&gt;data&lt;/a&gt; and presents it as &lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id0x9edbd128&quot;&gt;linked data&lt;/a&gt; with a &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0xf566a50&quot;&gt;SPARQL&lt;/a&gt; end point and dereferenceable URIs. &lt;/p&gt; &lt;p&gt;This is an example of using &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x11e59f80&quot;&gt;Virtuoso&lt;/a&gt;&amp;#39;s relational-to-&lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0xfc93c70&quot;&gt;RDF&lt;/a&gt; mapping for publishing business data, for browsing using the linked data principles and opening it to analytics queries in SPARQL.&lt;/p&gt; &lt;p&gt; As noted before, we have extended SPARQL with aggregation and nested queries, thus making it a viable &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0xffe4520&quot;&gt;SQL&lt;/a&gt; substitute for decision support queries. &lt;/p&gt; &lt;p&gt;The article at &lt;a href=&quot;http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VOSTPCHLinkedData&quot; id=&quot;link-id10799d10&quot;&gt;http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VOSTPCHLinkedData&lt;/a&gt; gives details and the source code for the implementation.&lt;/p&gt; &lt;p&gt; We are still working on some aspects of the more complex TPC-H queries, thus the demo is not complete with all the 22 queries. This is however enough to see a representative sample of how analytics queries work with SPARQL and Virtuoso&amp;#39;s SQL-to-RDF mapping. The demo will be part of the next Virtuoso Open Source download, probably out next week.&lt;/p&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>What&#39;s Wrong With LUBM?</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2008-02-05#1312</atom:id>
  <atom:published>2008-02-05T11:47:11Z</atom:published>
  <atom:updated>2008-03-25T14:43:37-04:00</atom:updated>
  <atom:content type="html">&lt;p&gt;In the interest of participating in a community benchmark development process, I will here outline some desiderata and explain how we could improve on LUBM. I will also touch on the message such an effort ought to convey.&lt;/p&gt; &lt;p&gt;A blow-by-blow analysis of the performance of a complex system such as a DBMS is more than fits within the scope of human attention at one go. This is why this all must be abbreviated into a single metric. Only when thus abbreviated, can this information be used in context. The metric&amp;#39;s practical value is relative to how well it predicts the performance of the system in some real task. This means a task not likely to be addressed by an alternative technology, unless the challenger clearly beats the incumbent.&lt;/p&gt; &lt;p&gt;A benchmark is promotional material, both well as for the technology being benchmarked as a whole. This is why the benchmark, whatever it does, should do something that the technology does well, surely better than any alternative technology. A case in point is that one ought not to take a pure relational workload and &lt;a href=&quot;http://dbpedia.org/resource/RDF&quot; id=&quot;link-id0x189e2b18&quot;&gt;RDF&lt;/a&gt;-ize it, for then the relational variant is likely to come out on top.&lt;/p&gt; &lt;p&gt;In this regard LUBM is not so bad because its reliance on class and property hierarchies and the occasional transitivity or inference rule makes the workload typically &lt;a href=&quot;http://dbpedia.org/resource/RDF&quot; id=&quot;link-id0x17fe8588&quot;&gt;RDF&lt;/a&gt;, a little ways apart from a purely relational implementation of the task.&lt;/p&gt; &lt;p&gt; &lt;a href=&quot;http://dbpedia.org/resource/RDF&quot; id=&quot;link-id0x1802e258&quot;&gt;RDF&lt;/a&gt;&amp;#39;s claim to fame is linked data. This means giving things globally unique names and thereby making anything joinable with anything else, insofar there is agreement on the names. &lt;a href=&quot;http://dbpedia.org/resource/RDF&quot; id=&quot;link-id0xa04d16c0&quot;&gt;RDF&lt;/a&gt; is a key to a new class of problems, call it web scale database. Web scale here refers first to heterogeneity and multiplicity of independent sources and secondly to volume of data.&lt;/p&gt; &lt;p&gt;Now there are plenty of relational applications with very large volumes of data. On the non-relational side, there are even larger applications, such as web search engines. All these have a set schema and a specific workload they are meant to address. &lt;a href=&quot;http://dbpedia.org/resource/RDF&quot; id=&quot;link-id0x1971d010&quot;&gt;RDF&lt;/a&gt; versions of such are conceivable but hold no intrinsic advantage if considered in the specific niche alone.&lt;/p&gt; &lt;p&gt;The claim to fame of &lt;a href=&quot;http://dbpedia.org/resource/RDF&quot; id=&quot;link-id0x18d9ace0&quot;&gt;RDF&lt;/a&gt; is not to outperform these on their home turf but to open another turf altogether, allowing agile joining and composing of all these resources.&lt;/p&gt; &lt;p&gt;This is why a benchmark, i.e., an an advertisement for the &lt;a href=&quot;http://dbpedia.org/resource/RDF&quot; id=&quot;link-id0x1be2af60&quot;&gt;RDF&lt;/a&gt; value proposition, should not just take a relational workload and &lt;a href=&quot;http://dbpedia.org/resource/RDF&quot; id=&quot;link-id0x18041730&quot;&gt;RDF&lt;/a&gt;-ize it. The benchmark should carry some of the web in it.&lt;/p&gt; &lt;p&gt;If we just intend to measure how well an &lt;a href=&quot;http://dbpedia.org/resource/RDF&quot; id=&quot;link-id0x1cb49658&quot;&gt;RDF&lt;/a&gt; store joins triples to other triples, LUBM is almost good enough. If it defined a query mix with different frequencies for short and long queries and a concurrent query metric, it would be pretty much there. Our adaptation of it is adequate for counting joins per second. But joins per second is not a value proposition.&lt;/p&gt; &lt;p&gt;So we have two questions: &lt;/p&gt; &lt;ol&gt; &lt;li&gt; &lt;p&gt;If we just take the &lt;a href=&quot;http://dbpedia.org/resource/RDF&quot; id=&quot;link-id0x19cec640&quot;&gt;RDF&lt;/a&gt; model and &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x1998cc70&quot;&gt;SPARQL&lt;/a&gt;, how do we make a benchmark that fills in what LUBM does not cover?&lt;/p&gt; &lt;/li&gt; &lt;li&gt;How do we make a benchmark that displays &lt;a href=&quot;http://dbpedia.org/resource/RDF&quot; id=&quot;link-id0xbd74610&quot;&gt;RDF&lt;/a&gt;&amp;#39;s strengths against a comparable relational solution? A priori, by going somewhere where SQL has trouble reaching.&lt;/li&gt; &lt;/ol&gt; &lt;p&gt;The answers to the first are not very complex:&lt;/p&gt; &lt;ul&gt; &lt;li&gt; &lt;p&gt;Add some optionals. Have different frequencies of occurrence for some properties.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt;Add different graphs. Make queries joining between graphs and drawing on different graphs. Querying against all graphs of the store is not a part of the language. Still this would be useful but leave it out for now.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt;Add some filters and arithmetic. Not much can be done there, though because expressions cannot be returned and there is no aggregation or grouping.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt;Split the workload into short and long queries. The short should be typical for online use and the long ones for analysis. Different execution frequencies for different queries is a must. Analysis is limited by lack of grouping, expressions or aggregation. Still, something can be contrived by looking for a pattern that does not exist or occurs extremely rarely. Producing result sets of millions of rows is not realistic.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt;Many of the LUBM queries return thousands of rows, even when scoped to a single university. This is not very realistic. No user interface displays that sort of quantity. Of course, the intermediate results can be large as you please but the output must be somehow ranked. &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0xd623d40&quot;&gt;SPARQL&lt;/a&gt; has order by and limit, so these will have to be used. &lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id0x18032180&quot;&gt;TPC&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id0x1a130188&quot;&gt;H&lt;/a&gt; for example has almost always a group by/order by combination and sometimes a result rows limit.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt;The degree of inference in LUBM is about right, mostly sub-classes and sub-properties, nothing complex. We certainly regard this as a database benchmark more than a knowledge representation or rule system one. &lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt;LUBM does an OK job of defining a scale factor. I think that a concurrent query metric can just be so many queries per time at a given scale. The number of clients, I would say, can be decided by the test sponsor, taking whatever works best. A load balancer or web server can always be tuned to enforce some limit on concurrency. I don&amp;#39;t think that a scale rule like in TPC C, where it says that only so many transactions per minute are allowed per warehouse is needed here. The effect of this is that when reporting a higher throughput, one has to automatically have a bigger database.&lt;/p&gt; &lt;/li&gt; &lt;/ul&gt; &lt;p&gt;There is nothing to prevent these improvements from being put into a subsequent version of LUBM.&lt;/p&gt; &lt;p&gt;Building something that shows &lt;a href=&quot;http://dbpedia.org/resource/RDF&quot; id=&quot;link-id0x18034830&quot;&gt;RDF&lt;/a&gt; at its best is a slightly different proposition. For this, we cannot be limited to the &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x18d990a0&quot;&gt;SPARQL&lt;/a&gt; recommendation and must allow custom application code and language extensions. Examples would be scripting similar to SQL stored procedures and extensions such as we have made for sub-queries and aggregation, explained a couple of posts back. &lt;/p&gt; &lt;p&gt;Maybe the Billion Triples challenge produces some material that we can use for this. We need to go for spaces that are not easily reached with SQL, have distributed computing, federation, discovery, demand driven import of data and such like. &lt;/p&gt; &lt;p&gt;I&amp;#39;ll write more about ways of making &lt;a href=&quot;http://dbpedia.org/resource/RDF&quot; id=&quot;link-id0x19c403c8&quot;&gt;RDF&lt;/a&gt; shine in some future post. &lt;/p&gt; &lt;p&gt;There are two kinds of workloads: online and offline. Online is what must be performed in an interactive situation, without significant human perceptible delay, i.e. within 500 ms. Anything else is offline. &lt;/p&gt; &lt;p&gt;Because this is how any online system is designed, this should be reflected in the benchmark. Ideally we would make two benchmarks.&lt;/p&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>LUBM results with Virtuoso 6.0</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2008-02-04#1308</atom:id>
  <atom:published>2008-02-04T09:58:03Z</atom:published>
  <atom:updated>2008-08-28T12:06:04-04:00</atom:updated>
  <atom:content type="html">&lt;p&gt;We have now run the LUBM benchmark on &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x1a6cb3c8&quot;&gt;Virtuoso&lt;/a&gt; v6, with the same configuration &lt;a href=&quot;http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1302&quot; id=&quot;link-id107f0238&quot;&gt;as discussed last Friday&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;We had a database of 8000 universities, and we ran 8 clients on slices of 100, 1000 and 8000 universities — same &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x12ac6cc8&quot;&gt;data&lt;/a&gt; but different sizes of working set.&lt;/p&gt; &lt;blockquote&gt; &lt;pre&gt; 100 universities: 35.3 qps 1000 universities: 26.3 qps 8000 universities: 13.1 qps&lt;/pre&gt;&lt;/blockquote&gt; &lt;p&gt;The 100 universities slice is about the same as with v5.0.5 (35.3 vs 33.1 qps). &lt;br /&gt;The 8000 universities set is almost 3x better (13.1 vs. 4.8 qps).&lt;/p&gt; &lt;p&gt;This comes from the fact that the v6 database takes half of the space of the v5.0.5 one.  Further, this is with 64-bit IDs for everything.  If the 5.5 database were with 64-bit IDs, we&amp;#39;d have a difference of over 3x.  This is worth something if it lets you get by with only 1 terabyte of RAM for the 100 billion  triple application, instead of 3 TB.&lt;/p&gt; &lt;p&gt; &lt;a href=&quot;http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1358&quot; id=&quot;link-id15fb4d38&quot;&gt;In a few more days&lt;/a&gt;, we&amp;#39;ll give the results for Virtuoso v6 Cluster.&lt;/p&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>LUBM results with Virtuoso 6.0</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2008-02-04#1308</atom:id>
  <atom:published>2008-02-04T09:58:03Z</atom:published>
  <atom:updated>2008-08-28T12:06:04-04:00</atom:updated>
  <atom:content type="html">&lt;p&gt;We have now run the LUBM benchmark on &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x1a6cb3c8&quot;&gt;Virtuoso&lt;/a&gt; v6, with the same configuration &lt;a href=&quot;http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1302&quot; id=&quot;link-id107f0238&quot;&gt;as discussed last Friday&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;We had a database of 8000 universities, and we ran 8 clients on slices of 100, 1000 and 8000 universities — same &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x12ac6cc8&quot;&gt;data&lt;/a&gt; but different sizes of working set.&lt;/p&gt; &lt;blockquote&gt; &lt;pre&gt; 100 universities: 35.3 qps 1000 universities: 26.3 qps 8000 universities: 13.1 qps&lt;/pre&gt;&lt;/blockquote&gt; &lt;p&gt;The 100 universities slice is about the same as with v5.0.5 (35.3 vs 33.1 qps). &lt;br /&gt;The 8000 universities set is almost 3x better (13.1 vs. 4.8 qps).&lt;/p&gt; &lt;p&gt;This comes from the fact that the v6 database takes half of the space of the v5.0.5 one.  Further, this is with 64-bit IDs for everything.  If the 5.5 database were with 64-bit IDs, we&amp;#39;d have a difference of over 3x.  This is worth something if it lets you get by with only 1 terabyte of RAM for the 100 billion  triple application, instead of 3 TB.&lt;/p&gt; &lt;p&gt; &lt;a href=&quot;http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1358&quot; id=&quot;link-id15fb4d38&quot;&gt;In a few more days&lt;/a&gt;, we&amp;#39;ll give the results for Virtuoso v6 Cluster.&lt;/p&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>Latest LUBM Benchmark results for Virtuoso</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2008-02-01#1304</atom:id>
  <atom:published>2008-02-01T14:39:04Z</atom:published>
  <atom:updated>2008-08-28T12:06:01-04:00</atom:updated>
  <atom:content type="html">&lt;p&gt;We have now taken a close look at the query side of the LUBM benchmark, &lt;a href=&quot;http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1296&quot; id=&quot;link-id10a98120&quot;&gt;as promised a couple of blog posts ago.&lt;/a&gt; &lt;/p&gt; &lt;p&gt;We load 8000 universities and run a query mix consisting of the 14 LUBM queries with different numbers of clients against different portions of the database.&lt;/p&gt; &lt;p&gt;When it is all in memory, we get 33 queries per second with 8 concurrent clients; when it is so I/O bound that 7.7 of 8 threads wait for disk, we get 5 qps. This was run in 8G RAM with 2 Xeon 5130.&lt;/p&gt; &lt;p&gt;We adapted some of the queries so that they do not run over the whole database. In terms of retrieving triples per second, this would be about 330000 for the rate of 33 qps, with 4 cores at 2GHz. This is a combination of random access and linear scans and bitmap merge intersections; lookups for non-found triples are not counted. The rate of random lookups alone based on known G, S, P, O, without any query logic, is about 250000 random lookups per core per second.&lt;/p&gt; &lt;p&gt;The article &lt;a href=&quot;http://virtuoso.openlinksw.com/wiki/main/Main/VOSArticleLUBMBenchmark&quot; id=&quot;link-id10237708&quot;&gt;LUBM and Virtuoso&lt;/a&gt; gives the details.&lt;/p&gt; &lt;p&gt;In the process of going through the workload we made some cost model adjustments and optimized the bitmap intersection join. In this way we can quickly determine which subjects are, for example, professors holding a degree from a given university. So the benchmark served us well in that it provided an incentive to further optimize some things.&lt;/p&gt; &lt;p&gt;Now, what has been said about &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x104257c0&quot;&gt;RDF&lt;/a&gt; benchmarking previously still holds. What does it mean to do so many LUBM queries per second? What does this say about the capacity to run an online site off RDF &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x7376478&quot;&gt;data&lt;/a&gt;? Or about &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id0x13fd3f30&quot;&gt;information&lt;/a&gt; integration? Not very much. But then this was not the aim of the authors either.&lt;/p&gt; &lt;p&gt;So we still need to make a benchmark for online queries and search, and another for E-science and business intelligence. But we are getting there.&lt;/p&gt; &lt;p&gt;In the immediate future, we have the general availability of &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x193509e8&quot;&gt;Virtuoso&lt;/a&gt; Open Source 5.0.5 early next week. This comes with a LUBM test driver and a test suite running against the LUBM qualification database.&lt;/p&gt; &lt;p&gt;After this we will give some numbers for the cluster edition with LUBM and &lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id0x1b8d1348&quot;&gt;TPC-H&lt;/a&gt;.&lt;/p&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>Latest LUBM Benchmark results for Virtuoso</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2008-02-01#1304</atom:id>
  <atom:published>2008-02-01T14:39:04Z</atom:published>
  <atom:updated>2008-08-28T12:06:01-04:00</atom:updated>
  <atom:content type="html">&lt;p&gt;We have now taken a close look at the query side of the LUBM benchmark, &lt;a href=&quot;http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1296&quot; id=&quot;link-id10a98120&quot;&gt;as promised a couple of blog posts ago.&lt;/a&gt; &lt;/p&gt; &lt;p&gt;We load 8000 universities and run a query mix consisting of the 14 LUBM queries with different numbers of clients against different portions of the database.&lt;/p&gt; &lt;p&gt;When it is all in memory, we get 33 queries per second with 8 concurrent clients; when it is so I/O bound that 7.7 of 8 threads wait for disk, we get 5 qps. This was run in 8G RAM with 2 Xeon 5130.&lt;/p&gt; &lt;p&gt;We adapted some of the queries so that they do not run over the whole database. In terms of retrieving triples per second, this would be about 330000 for the rate of 33 qps, with 4 cores at 2GHz. This is a combination of random access and linear scans and bitmap merge intersections; lookups for non-found triples are not counted. The rate of random lookups alone based on known G, S, P, O, without any query logic, is about 250000 random lookups per core per second.&lt;/p&gt; &lt;p&gt;The article &lt;a href=&quot;http://virtuoso.openlinksw.com/wiki/main/Main/VOSArticleLUBMBenchmark&quot; id=&quot;link-id10237708&quot;&gt;LUBM and Virtuoso&lt;/a&gt; gives the details.&lt;/p&gt; &lt;p&gt;In the process of going through the workload we made some cost model adjustments and optimized the bitmap intersection join. In this way we can quickly determine which subjects are, for example, professors holding a degree from a given university. So the benchmark served us well in that it provided an incentive to further optimize some things.&lt;/p&gt; &lt;p&gt;Now, what has been said about &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x104257c0&quot;&gt;RDF&lt;/a&gt; benchmarking previously still holds. What does it mean to do so many LUBM queries per second? What does this say about the capacity to run an online site off RDF &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x7376478&quot;&gt;data&lt;/a&gt;? Or about &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id0x13fd3f30&quot;&gt;information&lt;/a&gt; integration? Not very much. But then this was not the aim of the authors either.&lt;/p&gt; &lt;p&gt;So we still need to make a benchmark for online queries and search, and another for E-science and business intelligence. But we are getting there.&lt;/p&gt; &lt;p&gt;In the immediate future, we have the general availability of &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x193509e8&quot;&gt;Virtuoso&lt;/a&gt; Open Source 5.0.5 early next week. This comes with a LUBM test driver and a test suite running against the LUBM qualification database.&lt;/p&gt; &lt;p&gt;After this we will give some numbers for the cluster edition with LUBM and &lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id0x1b8d1348&quot;&gt;TPC-H&lt;/a&gt;.&lt;/p&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>LUBM and Virtuoso 5.5</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2008-02-01#1302</atom:id>
  <atom:published>2008-02-01T12:37:53Z</atom:published>
  <atom:updated>2008-03-25T14:43:30.000004-04:00</atom:updated>
  <atom:content type="html">&lt;p&gt;We have now taken a close look at the query side of the LUBM benchmark, as promised a couple of blog posts ago.&lt;/p&gt; &lt;p&gt;We load 8000 universities and run a query mix consisting of the 14 LUBM queries with different numbers of clients against different portions of the database.&lt;/p&gt; &lt;p&gt;When it is all in memory, we get 33 queries per second with 8 concurrent clients; when it is so I/O bound that 7.7 of 8 threads wait for disk, we get 5 qps. This was run in 8G RAM with 2 Xeon 5130.&lt;/p&gt; &lt;p&gt;We adapted some of the queries so that they do not run over the whole database. In terms of retrieving triples per second, this would be about 330000 for the rate of 33 qps, with 4 cores at 2GHz. This is a combination of random access and linear scans and bitmap merge intersections; lookups for non-found triples are not counted. The rate of random lookups alone based on known G, S, P, O, without any query logic, is about 250000 random lookups per core per second.&lt;/p&gt; &lt;p&gt;The article &lt;a href=&quot;http://virtuoso.openlinksw.com/wiki/main/Main/VOSArticleLUBMBenchmark&quot; id=&quot;link-id121f54d8&quot;&gt;LUBM and Virtuoso&lt;/a&gt; gives the details.&lt;/p&gt; &lt;p&gt;In the process of going through the workload, we made some cost model adjustments and optimized the bitmap intersection join. In this way we can quickly determine which subjects are, for example, professors holding a degree from a given university. So the benchmark served us well in that it provided an incentive to further optimize some things.&lt;/p&gt; &lt;p&gt;Now, what has been said about &lt;a href=&quot;http://dbpedia.org/resource/RDF&quot; id=&quot;link-id0x186bcec0&quot;&gt;RDF&lt;/a&gt; benchmarking previously still holds. What does it mean to do so many LUBM queries per second? What does this say about the capacity to run an online site off &lt;a href=&quot;http://dbpedia.org/resource/RDF&quot; id=&quot;link-id0xa1e11918&quot;&gt;RDF&lt;/a&gt; data? Or about information integration? Not very much. But then this was not the aim of the authors either.&lt;/p&gt; &lt;p&gt;So we still need to make a benchmark for online queries and search, and another for E-science and business intelligence. But we are getting there.&lt;/p&gt; &lt;p&gt;In the immediate future, we have the general availability of &lt;a href=&quot;http://data.openlinksw.com/oplweb/product_family/virtuoso#this&quot; id=&quot;link-id0x189b11b8&quot;&gt;Virtuoso&lt;/a&gt; Open Source 5.0.5 early next week. This comes with a LUBM test driver and a test suite running against the LUBM qualification database.&lt;/p&gt; &lt;p&gt;After this we will give some numbers for the cluster edition with LUBM and &lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id0x1b948dd8&quot;&gt;TPC&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id0x189e58a8&quot;&gt;H&lt;/a&gt;.&lt;/p&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>SPARQL Extensions for Subqueries</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2008-01-16#1296</atom:id>
  <atom:published>2008-01-16T15:11:00Z</atom:published>
  <atom:updated>2008-03-25T14:43:28.000001-04:00</atom:updated>
  <atom:content type="html">&lt;p&gt;Last time I said we had extended &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x1c668bb0&quot;&gt;SPARQL&lt;/a&gt; for sub-queries. As a preview of the new functionality, let us look at a query from &lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id0xd469258&quot;&gt;TPC&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id0x1c68fe58&quot;&gt;H&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;Below is the &lt;a href=&quot;http://data.openlinksw.com/oplweb/product_family/virtuoso#this&quot; id=&quot;link-id0xcabe1f0&quot;&gt;Virtuoso&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x1d3ba3d0&quot;&gt;SPARQL&lt;/a&gt; version of Q2.&lt;/p&gt; &lt;blockquote&gt; &lt;pre&gt; &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0xd45b8c0&quot;&gt;sparql&lt;/a&gt; define sql:signal-void-variables 1 prefix tpcd: &amp;lt;http://www.openlinksw.com/schemas/tpcd#&amp;gt; prefix oplsioc: &amp;lt;http://www.openlinksw.com/schemas/oplsioc#&amp;gt; prefix sioc: &amp;lt;http://rdfs.org/sioc/ns#&amp;gt; prefix foaf: &amp;lt;http://xmlns.com/foaf/0.1/&amp;gt; select ?supp+&amp;gt;tpcd:acctbal, ?supp+&amp;gt;tpcd:name, ?supp+&amp;gt;tpcd:has_nation+&amp;gt;tpcd:name as ?nation_name, ?part+&amp;gt;tpcd:partkey, ?part+&amp;gt;tpcd:mfgr, ?supp+&amp;gt;tpcd:address, ?supp+&amp;gt;tpcd:phone, ?supp+&amp;gt;tpcd:comment from &amp;lt;http://example.com/tpcd&amp;gt; where { ?ps a tpcd:partsupp ; tpcd:has_supplier ?supp ; tpcd:has_part ?part . ?supp+&amp;gt;tpcd:has_nation+&amp;gt;tpcd:has_region tpcd:name &amp;#39;EUROPE&amp;#39; . ?part tpcd:size 15 . ?ps tpcd:supplycost ?minsc . { select ?p min(?ps+&amp;gt;tpcd:supplycost) as ?minsc where { ?ps a tpcd:partsupp ; tpcd:has_part ?p ; tpcd:has_supplier ?ms . ?ms+&amp;gt;tpcd:has_nation+&amp;gt;tpcd:has_region tpcd:name &amp;#39;EUROPE&amp;#39; . } } filter (?part+&amp;gt;tpcd:type like &amp;#39;%BRASS&amp;#39;) } order by desc (?supp+&amp;gt;tpcd:acctbal) ?supp+&amp;gt;tpcd:has_nation+&amp;gt;tpcd:name ?supp+&amp;gt;tpcd:name ?part+&amp;gt;tpcd:partkey ; &lt;/pre&gt;&lt;/blockquote&gt; &lt;p&gt;Note the pattern &lt;code&gt;{ ?ms+&amp;gt;tpcd:has_nation+&amp;gt;tpcd:has_region tpcd:name &amp;#39;EUROPE&amp;#39; }&lt;/code&gt; which is a shorthand for &lt;code&gt;{ ?ms tpcd:has_nation ?t1 . ?t1 tpcd:has-region ?t2 . ?t2 tpcd:has_region ?t3 . ?t3 tpcd:name &amp;quot;EUROPE&amp;quot; } &lt;/code&gt; &lt;/p&gt; &lt;p&gt;Also note a sub-query is used for determining the lowest supply cost for a part. &lt;/p&gt; &lt;p&gt;The SQL text of the query can be found in the &lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id0xb945ed8&quot;&gt;TPC&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id0x1c832038&quot;&gt;H&lt;/a&gt; benchmark specification, reproduced below: &lt;/p&gt; &lt;blockquote&gt; &lt;pre&gt;select s_acctbal, s_name, n_name, p_partkey, p_mfgr, s_address, s_phone, s_comment from part, supplier, partsupp, nation, region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and p_size = 15 and p_type like &amp;#39;%BRASS&amp;#39; and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = &amp;#39;EUROPE&amp;#39; and ps_supplycost = ( select min(ps_supplycost) from partsupp, supplier, nation, region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = &amp;#39;EUROPE&amp;#39;) order by s_acctbal desc, n_name, s_name, p_partkey; &lt;/pre&gt;&lt;/blockquote&gt; &lt;p&gt; For brevity we have omitted the declarations for mapping the &lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id0x1d365d90&quot;&gt;TPC&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id0x17b6f668&quot;&gt;H&lt;/a&gt; schema to its &lt;a href=&quot;http://dbpedia.org/resource/RDF&quot; id=&quot;link-id0x1898eaf0&quot;&gt;RDF&lt;/a&gt; equivalent. The mapping is straightforward, with each column mapping to a predicate and each table to a class.&lt;/p&gt; &lt;p&gt; This is now part of the next &lt;a href=&quot;http://data.openlinksw.com/oplweb/product_family/virtuoso#this&quot; id=&quot;link-id0x1aaaa3f0&quot;&gt;Virtuoso&lt;/a&gt; Open Source cut, due around next week. &lt;/p&gt; &lt;p&gt; As of this writing we are going through the &lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id0x1950f1c0&quot;&gt;TPC&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id0xd0d8bc0&quot;&gt;H&lt;/a&gt; query by query and testing with mapping going to &lt;a href=&quot;http://data.openlinksw.com/oplweb/product_family/virtuoso#this&quot; id=&quot;link-id0x18b88f08&quot;&gt;Virtuoso&lt;/a&gt; and Oracle databases. &lt;/p&gt; &lt;p&gt; Also we have been busy measuring &lt;a href=&quot;http://data.openlinksw.com/oplweb/product_family/virtuoso#this&quot; id=&quot;link-id0x17b3fbb0&quot;&gt;Virtuoso&lt;/a&gt; 6. Even after switching from 32-bit to 64-bit IDs for IRIs and objects, the new databases are about half the size of the same &lt;a href=&quot;http://data.openlinksw.com/oplweb/product_family/virtuoso#this&quot; id=&quot;link-id0xbe2ec80&quot;&gt;Virtuoso&lt;/a&gt; 5.0.2 databases. This does not include any stream compression like gzip for disk pages. The load and query speeds are higher because of better working set. For all in memory, they are about even with 5.0.2. So now on an 8G box, we load 1067 million LUBM triples at 39.7 Kt/s instead of 29 Kt/s with 5.0.2. Right now we experimenting with clusters at Amazon EC2. We&amp;#39;ll write about that in a bit. &lt;/p&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>Retrospective and Outlook for 2008</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2007-12-18#1286</atom:id>
  <atom:published>2007-12-18T10:53:40Z</atom:published>
  <atom:updated>2008-04-14T14:02:31.000004-04:00</atom:updated>
  <atom:content type="html">&lt;p&gt;At this close of the year, I&amp;#39;ll give a little recap of the past year in terms of &lt;a href=&quot;http://virtuoso.openlinksw.com/&quot; id=&quot;link-idfe5ba58&quot;&gt;Virtuoso&lt;/a&gt; development, and take a look at where we are headed for 2008.&lt;/p&gt; &lt;p&gt;A year ago, I was in the middle of redoing the &lt;a href=&quot;http://data.openlinksw.com/oplweb/product_family/virtuoso#this&quot;&gt;Virtuoso&lt;/a&gt; database engine for better SMP performance. We redid the way traversal of index structures and cache buffers was serialized for SMP, and generally compared Virtuoso and Oracle engines function by function. We had just returned from the ISWC 2006 in Athens, Georgia, and the Virtuoso database was becoming a usable triple store.&lt;/p&gt; &lt;p&gt;Soon thereafter, we confirmed that all this worked when we put out the first cut of &lt;a href=&quot;http://dbpedia.org/&quot; id=&quot;link-id149dcff8&quot;&gt;Dbpedia&lt;/a&gt; with Chris Bizer, et al, and were working with Alan Ruttenberg on what would become &lt;a href=&quot;http://esw.w3.org/topic/HCLS/Banff2007Demo&quot; id=&quot;link-id10c25b50&quot;&gt;the Banff health care and life sciences demo&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;The &lt;a href=&quot;http://www2007.org/&quot; id=&quot;link-idfdbd0e0&quot;&gt;WWW 2007 conference in Banff&lt;/a&gt;, Canada, was a sort of kick-off for the &lt;a href=&quot;http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData&quot; id=&quot;link-id10e54940&quot;&gt;Linking Open Data&lt;/a&gt; movement, which started as a community project under &lt;a href=&quot;http://www.w3.org/2001/sw/sweo/&quot; id=&quot;link-idfd99988&quot;&gt;SWEO&lt;/a&gt;, the W3C interest group for Semantic Web Education and Outreach, and has gained a life of its own since.&lt;/p&gt; &lt;p&gt;Right after WWW 2007, the Virtuoso development effort split onto two tracks: one for enhancing the then new 5.0 release; and one for building a new generation of Virtuoso, notably featuring clustering and double storage density for &lt;a href=&quot;http://dbpedia.org/resource/RDF&quot;&gt;RDF&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;The first track produced constant improvements to the relational-to-RDF mapping functionality, &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot;&gt;SPARQL&lt;/a&gt; enhancements, and Redland-, Jena- and Sesame-compatible client libraries with Virtuoso as as a triple store. These things have been out with testers for a while and are all generally available as of this writing.&lt;/p&gt; &lt;p&gt;The second track started with adding key compression to the storage engine, specifically with regard to RDF, even though there are some gains in relational applications as well. With RDF, the space consumption drops to about half, all without recourse to any non-random access compatible compression like gzip. Since the start of August, we turned to clustering and are now code complete, pretty much with all the tricks one would expect, of course full function SQL and taking advantage of co-located joins and doing aggregation and generally all possible processing where the data is. I have covered details of this along the way in previous posts. The key point is that now the thing is written and works with test cases. &lt;/p&gt; &lt;p&gt;In late October, we were at the W3C workshop for mapping relational data to RDF. For us, this confirmed the importance of mapping and scalability in general. Ivan Herman proposed forming a W3C incubator group on benchmarking. Also a W3C incubator group of relational to RDF mapping is being formed. &lt;/p&gt; &lt;p&gt;Now, scalability has two sides. One is dealing with volume, and the other is dealing with complexity. Volume alone will not help if interesting queries cannot be formulated. Hence, we recently extended SPARQL with sub-queries so that we can now express at least any SQL workloads, which was previously not the case. It is sort of a contradiction in terms to say that SPARQL is the universal language for information integration while not being able to express, for example, the &lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot;&gt;TPC-H&lt;/a&gt; queries. Well, we fixed this. A separate post will highlight how. The W3C process will eventually follow, as the necessity of these things is undeniable, on the unimpeachable authority of the whole SQL world. Anyway, for now, SPARQL as it is ought to become a recommendation and extensions can be addressed later.&lt;/p&gt; &lt;p&gt;For now, the only RDF benchmark that seems to be out there is the loading part of the LUBM. We did a couple of enhancements of our own for that just recently, but much bigger things are on the way. Also, the billion triples challenge is an interesting initiative in the area. We all recognize that loading any number of triples is a finite problem with known solutions. The challenge is running interesting queries on large volumes. &lt;/p&gt; &lt;p&gt;Our present emphasis is demonstrating both RDF data warehousing and RDF mapping with complex queries and large data. We start with the TPC-H benchmark and doing the queries both through mapping to SQL against any RDBMS â Oracle, DB2, Virtuoso or other â and by querying the physical RDF rendition of the data in Virtuoso. From there, we move to querying a collection of RDBMS hosting similar data. &lt;/p&gt; &lt;p&gt;Doing this with performance at the level of direct SQL in the case of mapping and not very much slower with physical triples is an important milestone on the way to a real world enterprise data web. Real life has harder and more unexpected issues than a benchmark, but at any rate doing the benchmark without breaking a sweat is a step on the way. We sent a paper to ESWC 2008 about that but it was rather incomplete. By the time of the VLDB submissions deadline in March we&amp;#39;ll have more meat.&lt;/p&gt; &lt;p&gt;Another tack soon to start is a re-architecting of &lt;a href=&quot;http://zitgist.com/&quot; id=&quot;link-id10e8b9f0&quot;&gt;Zitgist&lt;/a&gt; around clustered Virtuoso. Aside matters of scale, we will make a number of qualitatively new things possible. Again, more will be released in the first quarter of 2008.&lt;/p&gt; &lt;p&gt;Beyond these short and mid-term goals we have the introduction of entirely dynamic and demand driven partitioning, &lt;i&gt;a la&lt;/i&gt; Google Bigtable or Amazon Dynamo. Now, regular partitioning will do for a while yet but this is the future when we move the the vision of linked data everywhere.&lt;/p&gt; &lt;p&gt;In conclusion, this year we have built the basis and the next year is about deployment. The bulk of really new development is behind us and now we start applying. Also, the community will find adoption easier due to our recent support of the common RDF APIs. &lt;/p&gt; &lt;a href=&quot;index.vspx?tag=database&quot; rel=&quot;tag&quot; style=&quot;display:none;&quot;&gt;database&lt;/a&gt;&lt;a href=&quot;index.vspx?tag=databases&quot; rel=&quot;tag&quot; style=&quot;display:none;&quot;&gt;databases&lt;/a&gt;&lt;a href=&quot;index.vspx?tag=lubm&quot; rel=&quot;tag&quot; style=&quot;display:none;&quot;&gt;lubm&lt;/a&gt;&lt;a href=&quot;index.vspx?tag=benchmarking&quot; rel=&quot;tag&quot; style=&quot;display:none;&quot;&gt;benchmarking&lt;/a&gt;&lt;a href=&quot;index.vspx?tag=semanticweb&quot; rel=&quot;tag&quot; style=&quot;display:none;&quot;&gt;semanticweb&lt;/a&gt;&lt;a href=&quot;index.vspx?tag=sparql&quot; rel=&quot;tag&quot; style=&quot;display:none;&quot;&gt;sparql&lt;/a&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>Virtuoso LUBM Load Update</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2007-12-06#1284</atom:id>
  <atom:published>2007-12-06T13:35:21Z</atom:published>
  <atom:updated>2008-04-14T14:02:31-04:00</atom:updated>
  <atom:content type="html">&lt;p&gt;As part of the recent conversation on benchmarking &lt;a href=&quot;http://dbpedia.org/resource/RDF&quot; id=&quot;link-id0x17cd8cf8&quot;&gt;RDF&lt;/a&gt; stores, we re-ran the LUBM 8000 load test (1067 million triples) with the current &lt;a href=&quot;http://data.openlinksw.com/oplweb/product_family/virtuoso#this&quot; id=&quot;link-id0x17581ae0&quot;&gt;Virtuoso&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;We did it on two different machines, one with 2 Xeon 5130 2Ghz and 8G RAM and one with 2 Xeon 5330 2GHZ and 16G RAM. Both had 6 x 7800 rpm SATA-2 drives. The load rate on the 16G configuration was 36.8 Ktriples per second. The load rate on the 8G configuration was 29.7 Ktriples per second. Both loads were made using 6 concurrent load streams. Some small changes to the numbers may be released later as a result of changing tuning.&lt;/p&gt; &lt;p&gt;The Virtuoso version was 5.0, in the update to be released on the week of Dec 10, 2007. This is an incremental release of Virtuoso 5.0 and has the same engine as the prior 5.0s, with some optimizations for RDF loading and diverse bug fixes, notably in RDF mapping of relational data. This release will be further described in a separate post.&lt;/p&gt; &lt;p&gt;The load does not include forward chaining but then Virtuoso supports sub-class and sub-property without materializing the entailed triples.&lt;/p&gt; &lt;p&gt;Most of the LUBM entailed triples represent sub-classes and sub-properties. The LUBM query and forward chaining side deserves a separate treatment but this is for another time.&lt;/p&gt; &lt;p&gt;Most recent posts on this blog refer to Virtuoso 6, which is presently under development. We will publish results with the 6.0 engine later. Also, further enhancements to triple store performance will take place on the Virtuoso 6 platform.&lt;/p&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>Storage News</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2007-07-12#1225</atom:id>
  <atom:published>2007-07-12T14:16:40Z</atom:published>
  <atom:updated>2008-04-24T13:22:32-04:00</atom:updated>
  <atom:content type="html">&lt;p&gt;I have been away from the world for a few weeks, concentrating on technology.&lt;/p&gt; &lt;p&gt;We have now implemented an entirely new storage layout. With &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x9fd979d0&quot;&gt;RDF&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x1ccee438&quot;&gt;data&lt;/a&gt;, we have now successfully doubled the working set.&lt;/p&gt; &lt;p&gt;This means that the number of triples that will fit in memory is doubled for any configuration. For any database in the hundreds of millions of triples, this is very significant. For LUBM data, we go from 75b to 35b per triple with the default indices.&lt;/p&gt; &lt;p&gt;This is obtained without using gzip or some other stream compression. Thus no decompression is needed at read time. Random access speeds are within 5% of those of &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x1e135ff0&quot;&gt;Virtuoso&lt;/a&gt; v5.0.1, but the space requirement is halved and you can still locate a random triple in cache in a few microseconds.&lt;/p&gt; &lt;p&gt;What is better still, when using 8-byte IDs for IRIs instead of 4-byte ones, the space consumption stays almost the same since unique values are stored only once per page.&lt;/p&gt; &lt;p&gt;When applying gzip to the new storage layout, we usually get 3x compression. This means that 99% of 8K pages fit in 3K after compression. This is no real surprise since an index is repetitive pretty much by definition, even if the repeated sections are now shorter than in v5.0.1.&lt;/p&gt; &lt;p&gt;Gzip applied to pages does nothing for the working set since a page must remain random accessible for fast search but will cut disk usage to between half and a third. We will make this an option later. There are other tricks to be done with compression, like using a separate dictionary for non key text columns in relational applications. This would improve the working set in &lt;a href=&quot;http://dbpedia.org/resource/TPC-C&quot; id=&quot;link-id0x9fbf44c0&quot;&gt;TPC-C&lt;/a&gt; and TPC-D quite a bit so we may do this also while on the subject.&lt;/p&gt; &lt;p&gt;Right now we are writing the clustering support, revising all internal APIs to run with batches of rows instead of single rows. We will most likely release clustering and the new storage layout together, towards the end of summer, at least in internal deployments.&lt;/p&gt; &lt;p&gt;I will &lt;a href=&quot;http://dbpedia.org/resource/Blog&quot; id=&quot;link-id0xba13470&quot;&gt;blog&lt;/a&gt; about results as and when they are obtained, over the next few weeks.&lt;/p&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>Comparison of Open Source Databases with TPC D Queries</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2007-02-05#1131</atom:id>
  <atom:published>2007-02-05T11:45:32Z</atom:published>
  <atom:updated>2008-04-17T21:04:29-04:00</atom:updated>
  <atom:content type="html">&lt;p&gt; &lt;a href=&quot;http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1116&quot; id=&quot;link-id10598cc0&quot;&gt;Last time&lt;/a&gt; we talked about database engine and transactions. Now we have come to the realm of query processing in our revisiting of the DBMS side of &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x1a4279e8&quot;&gt;Virtuoso&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;Now the well established, respectable standard benchmark for the basics of query processing is TPC D with its derivatives H and R. So we have, for testing how different &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x17ce3a18&quot;&gt;SQL&lt;/a&gt; optimizers manage the 22 queries, run a mini version of the D queries with a 1% scale database, some 30M of &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x17370eb0&quot;&gt;data&lt;/a&gt;, all in memory. This basically catches whether SQL implementations miss some of the expected tricks and how efficient in memory loop and hash joins and aggregation are.&lt;/p&gt; &lt;p&gt;When we get to our next stop, high volume I/O, we will run the same with D databases in the 10G ballpark.&lt;/p&gt; &lt;p&gt;The databases were tested on the same machine, with warm cache, taking the best run of 3. All had full statistics and were running with read committed isolation, where applicable. The data was generated using the procedures from the Virtuoso test suite. The Virtuoso version tested was 5.0, to be released shortly. The &lt;a href=&quot;http://dbpedia.org/resource/MySQL&quot; id=&quot;link-id0xe435ad8&quot;&gt;MySQL&lt;/a&gt; was 5.0.27, the PostgreSQL was 8.1.6. &lt;/p&gt; &lt;table style=&quot;width: 334px; height: 556px; &quot; border=&quot;1&quot;&gt; &lt;tbody&gt; &lt;tr&gt; &lt;th rowspan=&quot;2&quot;&gt;Query&lt;/th&gt; &lt;th colspan=&quot;4&quot;&gt;Query Times in Milliseconds&lt;/th&gt; &lt;/tr&gt; &lt;tr&gt; &lt;th&gt; Virtuoso &lt;/th&gt; &lt;th&gt; PostgreSQL &lt;/th&gt; &lt;th&gt; MySQL &lt;/th&gt; &lt;th&gt; MySQL with InnoDB &lt;/th&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt; Q1 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; &lt;b&gt;206&lt;/b&gt; &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 763 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 312 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 198 &lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt; Q2 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 4 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 6 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; &lt;b&gt;3&lt;/b&gt; &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; &lt;b&gt;3&lt;/b&gt; &lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt; Q3 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; &lt;b&gt;13&lt;/b&gt; &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 51 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 254 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 64 &lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt; Q4 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; &lt;b&gt;4&lt;/b&gt; &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 16 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 24 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 60 &lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt; Q5 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; &lt;b&gt;15&lt;/b&gt; &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 22 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 64 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 68 &lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt; Q6 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; &lt;b&gt;9&lt;/b&gt; &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 70 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 189 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 65 &lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt; Q7 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; &lt;b&gt;52&lt;/b&gt; &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 143 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 211 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 84 &lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt; Q8 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 29 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 31 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 13 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; &lt;b&gt;11&lt;/b&gt; &lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt; Q9 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; &lt;b&gt;36&lt;/b&gt; &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 114 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 97 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 61 &lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt; Q10 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; &lt;b&gt;32&lt;/b&gt; &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 51 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 117 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 57 &lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt; Q11 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 16 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; &lt;b&gt;9&lt;/b&gt; &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 12 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 10 &lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt; Q12 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; &lt;b&gt;8&lt;/b&gt; &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 21 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 18 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 130 &lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt; Q13 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; &lt;b&gt;18&lt;/b&gt; &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 74 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; - &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; - &lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt; Q14 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; &lt;b&gt;7&lt;/b&gt; &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 21 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 418 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 1425 &lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt; Q15 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; &lt;b&gt;14&lt;/b&gt; &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 43 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 389 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 122 &lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt; Q16 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; &lt;b&gt;16&lt;/b&gt; &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 22 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 18 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 25 &lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt; Q17 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; &lt;b&gt;1&lt;/b&gt; &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 54 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 26 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 10 &lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt; Q18 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; &lt;b&gt;82&lt;/b&gt; &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 120 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; - &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; - &lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt; Q19 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 19 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 8 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; &lt;b&gt;2&lt;/b&gt; &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 17 &lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt; Q20 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; &lt;b&gt;7&lt;b&gt; &lt;/b&gt;&lt;/b&gt;&lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 15 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 66 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 52 &lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt; Q21 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; &lt;b&gt;34&lt;/b&gt; &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 86 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 524 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 278 &lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt; Q22 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; &lt;b&gt;4&lt;/b&gt; &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 323 &lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 3311&lt;/td&gt; &lt;td align=&quot;right&quot;&gt; 805 &lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;Total (msec)&lt;/td&gt; &lt;td align=&quot;right&quot;&gt;&lt;b&gt;626&lt;/b&gt;&lt;/td&gt; &lt;td align=&quot;right&quot;&gt;2063&lt;/td&gt; &lt;td align=&quot;right&quot;&gt;6068&lt;/td&gt; &lt;td align=&quot;right&quot;&gt;3545&lt;/td&gt; &lt;/tr&gt; &lt;/tbody&gt; &lt;/table&gt; &lt;p&gt;We lead by a fair margin but MySQL is hampered by obviously getting some execution plans wrong and not doing Q13 and Q18 at all, at least not under several tens of seconds; so we left these out of the table in the interest of having comparable totals.&lt;/p&gt; &lt;p&gt;As usual, we also ran the workload on &lt;a href=&quot;http://dbpedia.org/resource/Oracle_Database&quot; id=&quot;link-id0xe957b80&quot;&gt;Oracle&lt;/a&gt; 10g R2. Since Oracle does not like their numbers being published without explicit approval, we will just say that we are even with them within the parameters described above. Oracle has a more efficient decimal type so it wins where that is central, as on Q1. Also it seems to notice that the &lt;code&gt;GROUP BY&lt;/code&gt;s of Q18 are produced in order of grouping columns, so it needs no intermediate table for storing the aggregates. If we addressed these matters, we&amp;#39;d lead by some 15% whereas now we are even. A faster decimal arithmetic implementation may be in the release after next.&lt;/p&gt; &lt;p&gt;In the next posts, we will look at IO and disk allocation, and also return to &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0xe074e40&quot;&gt;RDF&lt;/a&gt; and LUBM.&lt;/p&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>Virtuoso 5.0 Preview</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2007-01-10#1116</atom:id>
  <atom:published>2007-01-10T15:08:43Z</atom:published>
  <atom:updated>2008-04-17T21:04:24-04:00</atom:updated>
  <atom:content type="html">&lt;p&gt;As &lt;a href=&quot;http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1108&quot; id=&quot;link-id10c66e68&quot;&gt;previously said&lt;/a&gt;, we have a &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x1a5caeb8&quot;&gt;Virtuoso&lt;/a&gt; with brand new engine multithreading. It is now complete and passes its regular test suite. This is the basis for Virtuoso 5.0, to be available as the open source and commercial cuts as before.&lt;/p&gt; &lt;p&gt;As one benchmark, we used the &lt;a href=&quot;http://dbpedia.org/resource/TPC-C&quot; id=&quot;link-id0x15f8cbd8&quot;&gt;TPC-C&lt;/a&gt; test driver that has always been bundled with Virtuoso. We ran 100000 new orders worth of the TPC-C transaction mix first with one client and then with 4 clients, each client going to its own warehouse, so there was not much lock contention. We did this on a 4 core Intel, the working set in RAM. With the old one, 1 client took 1m43 and 4 clients took 3m47. With the new one, one client took 1m30 and 4 clients took 2m37. So, 400000 new orders in 2m37, for 152820 new orders per minute as opposed to 105720 per minute previously. Do not confuse with the official tpmC metric, that one involves a whole bunch of further rules.&lt;/p&gt; &lt;p&gt;TPC-C has activity spread over a few different tables. With tests dealing with fewer tables, improvements in parallelism are far greater.&lt;/p&gt; &lt;p&gt;Aside from better parallelism, we have other features. One of them is a change in the read committed isolation, so that we now return the previous committed state for uncommitted changed rows instead of waiting for the updating transaction to terminate. This is similar to what &lt;a href=&quot;http://dbpedia.org/resource/Oracle_Database&quot; id=&quot;link-id0x18184c08&quot;&gt;Oracle&lt;/a&gt; does for read committed. Also we now do log checkpoints without having to abort pending write transactions.&lt;/p&gt; &lt;p&gt;When we have faster inserts, we actually see the &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0xde6fca0&quot;&gt;RDF&lt;/a&gt; bulk loader run slower. This is really backwards. The reason is that while one thread parses, other threads insert and if the inserting threads are done they go to wait on a semaphore and this whole business of context switching absolutely kills performance. With slower inserts, the parser keeps ahead so there is less context switching, hence better overall throughput. I still do not get it how the OS can spend between 1.5 and 6 microseconds, several thousand instructions, deciding what to do next when there are only 3-4 eligible threads and all the rest is background which goes with a few dozen slices per second. Solaris is a little better than Linux at this but not dramatically so. Mac OS X is way worse.&lt;/p&gt; &lt;p&gt;As said, we use Oracle 10G2 on the same platform (Linux FC5 64 bit) for sparring. It is really a very good piece of software. We have written the TPC C transactions in &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x15b33600&quot;&gt;SQL&lt;/a&gt;/PL. What is surprising is that these procedures run amazingly slowly, even with a single client. Otherwise the Oracle engine is very fast. Well, as I recall, the official TPC C runs with Oracle use an OCI client and no stored procedures. Strange. While Virtuoso for example fills the initial TPC C state a little faster than Oracle, the procedures run 5-10 times slower with Oracle than with Virtuoso, all &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0xd9d1150&quot;&gt;data&lt;/a&gt; in warm cache and a single client. While some parts of Oracle are really well optimized, all basic joins and aggregates etc, we are surprised at how they could have neglected such a central piece as the PL.&lt;/p&gt; &lt;p&gt;Also, we have looked at transaction semantics. Serializable is mostly serializable with Oracle but does not always keep a steady count. Also it does not prevent inserts into a space that has been found empty by a serializable transaction. True, it will not show these inserts to the serializable transaction, so in this it follows the rules. Also, to make a read really repeatable, it seems that the read has to be FOR UPDATE. Otherwise one can not implement a reliable resource transaction, like changing the balance of an account.&lt;/p&gt; &lt;p&gt;Anyway, the Virtuoso engine overhaul is now mostly complete. This is of course an open ended topic but the present batch is nearing completion. We have gone through as many as 3 implementations of hash joins, some things have yet to be finished there. Oracle has very good hash joins. The only way we could match that was to do it all in memory, dropping any persistent storage of the hash. This is of course OK if the hash is not very large and anyway hash joins go sour if the hash does not fit in working set.&lt;/p&gt; &lt;p&gt;As next topics, we have more RDF and the LUBM benchmark to finish. Also we should revisit TPC-D.&lt;/p&gt; &lt;p&gt;Databases are really quite complicated and extensive pieces of software. Much more so than the casual observer might think.&lt;/p&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>Ideas on RDF Store Benchmarking</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2006-11-21#1084</atom:id>
  <atom:published>2006-11-21T14:09:21Z</atom:published>
  <atom:updated>2008-04-16T16:53:22-04:00</atom:updated>
  <atom:content type="html">&lt;p&gt;This post presents some ideas and use cases for &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x18a21d88&quot;&gt;RDF&lt;/a&gt; store benchmarking.&lt;/p&gt; &lt;h4&gt;Use Cases&lt;/h4&gt; &lt;ul&gt; &lt;li&gt;Basic triple storage and retrieval. The LUBM benchmark captures many aspects of this.&lt;/li&gt; &lt;li&gt;Recursive rule application. The simpler cases of this are things like transitive closure.&lt;/li&gt; &lt;li&gt;Mapping of relational &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0xdd0f4b0&quot;&gt;data&lt;/a&gt; to RDF. Since relational benchmarks are well established, as in the TPC benchmarks, the schemas and test data generation can come from there. The problem is that the D/H/R benchmarks consist of aggregates and grouping exclusively but &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0xdd8b060&quot;&gt;SPARQL&lt;/a&gt; does not have these. &lt;/li&gt; &lt;/ul&gt; &lt;h4&gt;Benchmarking Triple Stores&lt;/h4&gt; &lt;p&gt;An RDF benchmark suite should meet the following criteria:&lt;/p&gt; &lt;ul&gt; &lt;li&gt;Have a single scale factor.&lt;/li&gt; &lt;li&gt;Produce a single metric, queries per unit of time, for example. The metric should be concisely expressible, for example 10 qpsR at 100M, options 1, 2, 3. Due to the heterogeneous nature of the systems under test, the result&amp;#39;s short form likely needs to specify the metric, scale and options included in the test.&lt;/li&gt; &lt;li&gt;Have optional parts, such as different degrees of inferencing and maybe language extensions such as full text, as this is a likely component of any social software.&lt;/li&gt; &lt;li&gt;Have a specification for a full disclosure report, TPC style, even though we can skip the auditing part in the interest of making it easy for vendors to publish results and be listed.&lt;/li&gt; &lt;li&gt;Have a subject domain where real data are readily available and which is broadly understood by the community. For example, SIOC data about on-line communities seems appropriate. Typical degree of connectedness, number of triples per person etc can be measured from real files .&lt;/li&gt; &lt;li&gt;Have a diverse enough workload. This should include initial bulk load of data, some adding of triples during the run and continuous query load.&lt;/li&gt;Â &lt;/ul&gt; &lt;p&gt;The query load should illustrate the following types of operations:&lt;/p&gt; &lt;ul&gt; &lt;li&gt;Basic lookups, such as would be made for filling in a person&amp;#39;s home page in a social networks app. List data of user plus names and emails of friends. Relatively short joins, unions, and optionals.&lt;/li&gt; &lt;li&gt;Graph operations like shortest path from individual to individual in a social network.&lt;/li&gt; &lt;li&gt;Selecting data with drill down, as in faceted browsing. For example, start with articles having &lt;a href=&quot;http://dbpedia.org/resource/Tag&quot; id=&quot;link-id0x1a11f478&quot;&gt;tag&lt;/a&gt; t, see distinct tags of articles with tag t, select another tag t2 to see the distinct tags of articles with both t and t2 and so forth.&lt;/li&gt; &lt;li&gt;Retrieving all closely related nodes, as in composing a SIOC snapshot over a person&amp;#39;s post in different communities, the recent activity report for a forum etc. These will be construct or describe queries. The coverage of describe is unclear, hence construct may be better.&lt;/li&gt; &lt;/ul&gt; &lt;p&gt;If we take an application like LinkedIn as a model, we can get a reasonable estimate of the relative frequency of different queries. For the queries per second metric, we can define the mix similarly to &lt;a href=&quot;http://dbpedia.org/resource/TPC-C&quot; id=&quot;link-id0x19ba4830&quot;&gt;TPC C&lt;/a&gt;. We count executions of the main query and divide by running time. Within this time, for every 10 executions of the main query there are varying numbers of executions of secondary queries, typically more complex ones.&lt;/p&gt; &lt;h4&gt;Full Disclosure Report&lt;/h4&gt; &lt;p&gt;The report contains basic TPC-like items such as:&lt;/p&gt; &lt;ul&gt; &lt;li&gt;Metric qps/scale/options&lt;/li&gt; &lt;li&gt;Software used, DBMS, RDF toolkit if separate&lt;/li&gt; &lt;li&gt;Hardware. Number, clock and type of CPUs per machine, number of machines in cluster, RAM per machine, disks per machine, manufacturer, price of hardware/software&lt;/li&gt; &lt;/ul&gt; &lt;p&gt;These can go into a summary spreadsheet that is just like the TPC ones.&lt;/p&gt; &lt;p&gt;Additionally, the full report should include:&lt;/p&gt; &lt;ul&gt; &lt;li&gt;Configuration files for DBMS, web server, other components.&lt;/li&gt; &lt;li&gt;Parameters for test driver, i.e., number of clicks, how many concurrent clicks. The tester determines the degree of parallelism that gets the best throughput and should indicate this in the report. Making a graph of throughput as function of concurrent clients is a lot of work and maybe not necessary here.&lt;/li&gt; &lt;li&gt;Duration in real time. Since for any large database with a few G of working set the warm up time is easily 30 minutes, the warm up time should be mentioned but not included in the metric. The measured interval should not be less than 1h in duration and should reflect a &amp;quot;steady state,&amp;quot; as defined in the TPC rules.&lt;/li&gt; &lt;li&gt;Source code of server side application logic. This can be inference rules, stored procedures, dynamic web pages or any other server side software-like thing that exists or is modified for the purpose of the test.&lt;/li&gt; &lt;li&gt;Specification of test driver. If there is a commonly used test driver, its type, parameters and version. If the test driver is custom, reference to its source code.&lt;/li&gt; &lt;li&gt;Database sizes. For a preallocated database of n G, how much was free after the initial load, how much after the test run? How many bytes per triple.&lt;/li&gt; &lt;li&gt;CPU/IO. This may not always be readily measurable but is interesting still. Maybe a realistic spec is listing the sum of CPU minutes across allÂ  server machines and server processes. For IO, maybe the system totals from iostat before and after the full run, including load and warm-up. If the DBMS and RDF toolkits are separate, it is interesting to know the division of CPU time between them. &lt;/li&gt; &lt;/ul&gt; &lt;h4&gt;Test Drivers&lt;/h4&gt; &lt;p&gt;OpenLink has a multithreaded C program that simulates n web users multiplexed over m threads. For example, 10000 users with 100 threads, each user with its own state, so that they carry out their respective usage patterns independently, getting served as soon as the server is available, still having no more than m requests going at any time. The usage pattern is something like go check the mail, browse the catalogue, add to shopping cart etc. This can be modified to browse a social network database and produce the desired query mix. This generates &lt;a href=&quot;http://dbpedia.org/resource/Hypertext_Transfer_Protocol&quot; id=&quot;link-id0x19f1f538&quot;&gt;HTTP&lt;/a&gt; requests, hence would work against a SPARQL end point or any set of dynamic web pages.&lt;/p&gt; &lt;p&gt;The program produces a running report of the clicks per second rate and statistics at the end, listing the min/avg/max times per operation.&lt;/p&gt; &lt;p&gt;This can be packaged as a separate open source download once the test spec is agreed upon.&lt;/p&gt; &lt;p&gt;For generating test data, a modification of the LUBM generator is probably the most convenient choice.&lt;/p&gt; &lt;h4&gt;Benchmarking Relational to RDF Mapping&lt;/h4&gt; &lt;p&gt;This area is somewhat more complex than triple storage.&lt;/p&gt; &lt;p&gt;At least the following factors enter into the evaluation:Â  &lt;/p&gt; &lt;ul&gt; &lt;li&gt;Degree of SPARQL compliance. For example, can one have a variable as predicate? Are there limits on optionals and unions?&lt;/li&gt; &lt;li&gt;Are the data being queried split over multiple &lt;a href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id0x1aea8f70&quot;&gt;RDBMS&lt;/a&gt; and joined between them?&lt;/li&gt; &lt;li&gt;Type of use case. Is this about navigational lookups or about statistics? OLTP or OLAP? It would be the former, as SPARQL does not really have aggregation. Still, many of the interesting queries are about comparing large data sets.&lt;/li&gt; &lt;/ul&gt; &lt;p&gt;The rationale for mapping relational data to RDF is often data integration. Even in simple cases like the &lt;a href=&quot;http://dbpedia.org/resource/OpenLink_Data_Spaces&quot; id=&quot;link-id0xd69ed58&quot;&gt;OpenLink Data Spaces&lt;/a&gt; applications, a single SPARQL query will often result in a union of queries over distinct relational schemas, each somewhat similar but different in its details.&lt;/p&gt; &lt;p&gt;A test for mapping should represent this aspect. Of course, translating a column into a predicate is easy and useful, specially when copying data. Still, the full power of mapping seems to involve a single query over disparate sources with disparate schemas.&lt;/p&gt; &lt;p&gt;A real world case is OpenLink&amp;#39;s ongoing work for mapping &lt;a href=&quot;http://dbpedia.org/resource/WordPress&quot; id=&quot;link-id0x19c5c5c0&quot;&gt;WordPress&lt;/a&gt;, &lt;a href=&quot;http://dbpedia.org/resource/MediaWiki&quot; id=&quot;link-id0x19a0c5d0&quot;&gt;Mediawiki&lt;/a&gt;, phpBB, &lt;a href=&quot;http://dbpedia.org/resource/Drupal&quot; id=&quot;link-id0xd106ef8&quot;&gt;Drupal&lt;/a&gt;, and possibly other popular web applications into SIOC.&lt;/p&gt; &lt;p&gt;Using this as a benchmark might make sense because the source schemas are widely known, there is a lot of real world data in these systems, and the test driver might even be the same as with the above proposed triple store benchmark. The query mix might have to be somewhat tailored.&lt;/p&gt; &lt;p&gt;Another &amp;quot;enterprise style&amp;quot; scenario might be to take the TPC C and TPC D databases â after all both have products, customers and orders â and map them into a common ontology. Then there could be queries sometimes running on only one, sometimes joining both.&lt;/p&gt; &lt;p&gt;Considering the times and the audience, the WordPress/Mediawiki scenario might be culturally more interesting and more fun to demo.&lt;/p&gt; &lt;p&gt;The test has two aspects: Throughput and coverage. I think these should be measured separately.&lt;/p&gt; &lt;p&gt;The throughput can be measured with queries that are generally sensible, such as &amp;quot;get articles by an author that I know with tags t1 and t2.&amp;quot;&lt;/p&gt; &lt;p&gt;Then there are various pathological queries that work specially poorly with mapping. For example, if the types of subjects are not given, if the predicate is known at run time only, if the graph is not given, we get a union of everything joined with another union of everything and many of the joins between the terms of the different unions are identically empty but the software may not know this.&lt;/p&gt; &lt;p&gt;In a real world case, I would simply forbid such queries. In the benchmarking case, these may be of some interest. If the mapping is clever enough, it may survive cases like &amp;quot;list all predicates and objects of everything called gizmo where the predicate is in the product ontology&amp;quot;.&lt;/p&gt; &lt;p&gt;It may be good to divide the test into a set of straightforward mappings and special cases and measure them separately. The former will be queries that a reasonably written application would do for producing user reports.&lt;/p&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>More RDF scalability tests</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2006-11-01#1074</atom:id>
  <atom:published>2006-11-01T19:26:40Z</atom:published>
  <atom:updated>2008-04-16T16:53:18-04:00</atom:updated>
  <atom:content type="html">&lt;p&gt;We have lately been busy with &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x17524ab8&quot;&gt;RDF&lt;/a&gt; scalability. We work with the 8000 university LUBM &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0xd4ba910&quot;&gt;data&lt;/a&gt; set, a little over a billion triples. We can load it in 23h 46m on a box with 8G RAM. With 16G we probably could get it in 16h.&lt;/p&gt; &lt;p&gt;The resulting database is 75G, 74 bytes per triple which is not bad. It will shrink a little more if explicitly compacted by merging adjacent partly filled pages. See &lt;a href=&quot;http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VOSBitmapIndexing&quot; id=&quot;link-id105e5cf8&quot;&gt;Advances in Virtuoso RDF Triple Storage&lt;/a&gt; for an in-depth treatment of the subject.&lt;/p&gt; &lt;p&gt;The real question of RDF scalability is finding a way of having more than one CPU on the same index tree without them hitting the prohibitive penalty of waiting for a mutex. The sure solution is partitioning, would probably have to be by range of the whole key. but before we go to so much trouble, we&amp;#39;ll look at dropping a couple of critical sections from index random access. Also some kernel parameters may be adjustable, like a spin count before calling the scheduler when trying to get an occupied mutex. Still we should not waste too much time on platform specifics. We&amp;#39;ll see.&lt;/p&gt; &lt;p&gt;We just updated the &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x189d64b8&quot;&gt;Virtuoso&lt;/a&gt; Open Source cut. The latest RDF refinements are not in, so maybe the cut will have to be refreshed shortly.&lt;/p&gt; &lt;p&gt;We are also now applying the relational to RDF mapping discussed in &lt;a href=&quot;http://virtuoso.openlinksw.com/wiki/main/Main/VOSSQLRDF&quot; id=&quot;link-id10677bb8&quot;&gt;Declarative SQL Schema to RDF Ontology Mapping&lt;/a&gt; to the &lt;a href=&quot;http://dbpedia.org/resource/OpenLink_Data_Spaces&quot; id=&quot;link-id0xa0f5fde0&quot;&gt;ODS&lt;/a&gt; applications.&lt;/p&gt; &lt;p&gt;There is a form of the mapping in the VOS cut on the net but it is not quite ready yet. We must first finish testing it through mapping all the relational schemas of the ODS apps before we can really recommend it. This is another reason for a VOS update in the near future.&lt;/p&gt; &lt;p&gt;We will be looking at the query side of LUBM after the ISWC 2006 conference. So far, we find queries compile OK for many SIOC use cases with the cost model that there is now. A more systematic review of the cost model for &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x19b96630&quot;&gt;SPARQL&lt;/a&gt; will come when we get to the queries.&lt;/p&gt; &lt;p&gt;We put some ideas about inferencing in the Advances in Triple Storage paper. The question is whether we should forward chain such things as class subsumption and subproperties. If we build these into the &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x19bbd098&quot;&gt;SQL&lt;/a&gt; engine used for running SPARQL, we probably can do these as unions at run time with good performance and better working set due to not storing trivial entailed triples. Some more thought and experimentation needs to go into this.&lt;/p&gt;</atom:content>
 </atom:entry>
 <atom:entry>
  <atom:title>RDF Bulk Loading Revisited</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?date=2006-09-28#1058</atom:id>
  <atom:published>2006-09-28T11:39:52Z</atom:published>
  <atom:updated>2008-04-16T16:53:17.000007-04:00</atom:updated>
  <atom:content type="html">&lt;p&gt;We have made new benchmarks with loading the 47 million triples of the Wikipedia links &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x1713ab70&quot;&gt;data&lt;/a&gt; set. So far, our best result is 40 minutes with a dual core Xeon with 8G memory. This comes to about 18000 triples per second with between 1.2 and 2 CPU cores busy, slightly depending on configuration parameters. Our previous best result was with a dual 1.6GHz SPARC with 7700 triples per second on loading the 2M triple Wordnet data set.&lt;/p&gt; &lt;p&gt;These are memory based speeds. We have implemented an automatic background compaction for database tables and have tried the Wikipedia load with and without. The CPU cost of the compaction was about 10% with a slight gain in real time due to less IO.&lt;/p&gt; &lt;p&gt;But the real deal remains IO. With the compaction on, we got 91 bytes per triple, all included, i.e., two indices on the triples table, dictionaries from IRI IDs to URIs, etc. The compaction is rather simple — it just detects adjacent dirty pages about to be written to disk and sees if the set of contiguous dirty pages would fit on fewer pages than they now take. If so, it rewrites the pages and frees the ones left over. It does not touch clean pages. With some more logic it could also compact clean pages, provided the result did not have more dirty pages than the initial situation. With more aggressive compaction we will get about 75 bytes per triple. We will try this.&lt;/p&gt; &lt;p&gt;But the real gains will come from index compression with bitmaps. For the Wikipedia data set, this will cut one of the indices to about a third of its current size. This is also the index with the more random access, so the benefit is compounded in terms of working set. At that point we will be looking at about 50 bytes per triple. We will see next week how this works with the LUBM &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x19af50b0&quot;&gt;RDF&lt;/a&gt; benchmark.&lt;/p&gt;</atom:content>
 </atom:entry>
</atom:feed>