<?xml version="1.0" encoding="UTF-8" ?>
<!--ATOM based XML document generated By OpenLink Virtuoso-->
<atom:feed xmlns:atom="http://www.w3.org/2005/Atom" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:vi="http://www.openlinksw.com/weblog/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/" xmlns:itunes="http://www.itunes.com/DTDs/Podcast-1.0.dtd" xmlns:dc="http://purl.org/dc/elements/1.1/">
<atom:id>http://www.openlinksw.com/weblog/oerling/</atom:id>
<atom:title>Orri Erling&#39;s Weblog</atom:title>
<atom:link href="http://www.openlinksw.com/weblog/oerling/" type="text/html" rel="alternate" />
<atom:link href="http://www.openlinksw.com/GData/oerling-blog-0" type="application/atom+xml" rel="self" />
 <atom:author>
  <atom:name>Orri Erling</atom:name>
  <atom:email>oerling@openlinksw.com</atom:email>
  </atom:author>
<atom:updated>2009-11-07T21:46:51Z</atom:updated>
<atom:generator>Virtuoso Universal Server 05.12.3041</atom:generator>
<atom:logo>http://www.openlinksw.com/weblog/public/images/vbloglogo.gif</atom:logo>
 <atom:entry>
  <atom:title>European Commission and the Data Overflow</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?id=1585</atom:id>
  <atom:link href="http://www.openlinksw.com/weblog/oerling/?id=1585" type="text/html" rel="alternate" />
  <atom:link href="http://www.openlinksw.com/GData/oerling-blog-0/1585/1" rel="edit" />
  <atom:published>2009-10-27T18:29:51Z</atom:published>
  <atom:content type="html">&lt;p&gt;The European Commission recently circulated a questionnaire to selected experts on what could be done for the future of big &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x79cfe58&quot;&gt;data&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;Since the &lt;a href=&quot;http://cordis.europa.eu/fp7/ict/content-knowledge/consultation_en.html&quot; id=&quot;link-id1191c0f8&quot;&gt;questionnaire is public&lt;/a&gt;, I am publishing my answers below.&lt;/p&gt; &lt;ol type=&quot;1&quot; start=&quot;1&quot;&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;Data and data types&lt;/b&gt; &lt;/p&gt; &lt;ol type=&quot;a&quot; start=&quot;1&quot;&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;What volumes of data are we dealing with today? What is the growth rate? Where can we expect to be in 2015? &lt;/b&gt; &lt;/p&gt; &lt;p&gt;Private data warehouses of corporations have more than doubled yearly for the past years; hundreds of TB is not exceptional. This will continue. The real shift is in structured data being published in increasing quantities with a minimum level of integrate-ability through use of &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x7d7e7a0&quot;&gt;RDF&lt;/a&gt; and &lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id0x7f2a788&quot;&gt;linked data&lt;/a&gt; principles. There are rewards for use of standard vocabularies and identifiers through search engines recognizing such data. There is convergence around &lt;a href=&quot;http://dbpedia.org/resource/DBpedia&quot; id=&quot;link-id0x7dfbca8&quot;&gt;DBpedia&lt;/a&gt; identifiers for real-world entities, e.g., most things that would be in the news.&lt;/p&gt; &lt;p&gt;This also means that internal data processes and silos may be enriched with this content. There is consequent pressure for accommodating more diversity of data, with more flexible &lt;a href=&quot;http://dbpedia.org/resource/Database_schema&quot; id=&quot;link-id0x7babaf8&quot;&gt;schema&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;Ultimately, all content presently stored in RDBs and presented in public accessible dynamic web pages will end up on the web of linked data. Examples are product catalogs, price lists, event schedules and the like.&lt;/p&gt; &lt;p&gt;The volume of the well known linked data sets is around 10 billion statements. With the above mentioned trends, growth by two or three orders of magnitude by 2015 seems reasonable, This is so especially if explicit semantics are extracted from the document web and if there is some further progress in the precision/recall of such extraction.&lt;/p&gt; &lt;p&gt;Relevant sections of this mass of data are a potential addition to any present or future analytics application.&lt;/p&gt; &lt;p&gt;Since arbitrary analytics over the database which is the web cannot be economically provided by a centralized search engine, a cloud model may be used for on-demand selection of relevant data and mixing it with private data. This will drive database innovation for the next years even more than the continued classical warehouse growth.&lt;/p&gt; &lt;p&gt;Science data is another driver of the data overflow. For example, faster gene sequencing, more accurate measurements in high energy physics, better imaging, and remote sensing will produce large volumes of data. This data has highly regular structure but labeling this data with source and lineage calls for a flexible, schema-last, self-describing model, such as RDF and linked data. Data and &lt;a href=&quot;http://dbpedia.org/resource/Metadata&quot; id=&quot;link-id0x96ce60&quot;&gt;metadata&lt;/a&gt; should travel together but may have different data models.&lt;/p&gt; &lt;p&gt;By and large, the metadata of science data will be another stream to the web of linked data, at least to the degree it is publicly accessible. Restricted circles can and likely will implement similar ideas.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;What types of data can we deal with intelligently due to their inherent structure (geospatial, temporal, social or &lt;a href=&quot;http://dbpedia.org/resource/Knowledge&quot; id=&quot;link-id0x7e8e248&quot;&gt;knowledge&lt;/a&gt; graphs, 3D, sensor streams...)?&lt;/b&gt; &lt;/p&gt; &lt;p&gt;All the above types should be supported inside one DBMS so as to allow efficient querying combining conditions on all these types of data, e.g., &lt;i&gt;photos of sunsets taken last summer in Ibiza, with over 20 megapixels, by people I know.&lt;/i&gt; &lt;/p&gt; &lt;p&gt;Note that the test for being a sunset is an operation on the image blob that should be taken to the data; the images cannot be economically transferred.&lt;/p&gt; &lt;p&gt;Interleaving of all database functions and types becomes increasingly important.&lt;/p&gt; &lt;/li&gt; &lt;/ol&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;Industries, communities&lt;/b&gt; &lt;/p&gt; &lt;ol type=&quot;a&quot; start=&quot;1&quot;&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;Who is producing these data and why? Could they do it better? How?&lt;/b&gt; &lt;/p&gt; &lt;p&gt;Right now, projects such as &lt;a href=&quot;http://www.bio2rdf.org/&quot; id=&quot;link-id0x43bd098&quot;&gt;Bio2RDF&lt;/a&gt;, &lt;a href=&quot;http://neurocommons.org/page/Main_Page&quot; id=&quot;link-id0x5c074b0&quot;&gt;Neurocommons&lt;/a&gt;, and DBPedia produce this data. The processes are in place and are reasonable. Incremental improvement is to be expected. These processes, along with the &lt;a href=&quot;http://www.w3.org/DesignIssues/LinkedData.html&quot; id=&quot;link-id0x72131d0&quot;&gt;linked data meme&lt;/a&gt; generally taking off, drive demand for better &lt;a href=&quot;http://dbpedia.org/resource/Natural_language_processing&quot; id=&quot;link-id0x71e7798&quot;&gt;NLP&lt;/a&gt; (&lt;a href=&quot;http://dbpedia.org/resource/Natural_language_processing&quot; id=&quot;link-id0x7e0e2f0&quot;&gt;Natural Language Processing&lt;/a&gt;), e.g., &lt;a href=&quot;http://dbpedia.org/resource/Entity&quot; id=&quot;link-id0x71ab500&quot;&gt;entity&lt;/a&gt; and relationship extraction, especially extraction that can produce instance data in given ontologies (e.g., events) using common identifiers (e.g., DBPedia URIs).&lt;/p&gt; &lt;p&gt;Mapping of RDBs to RDF is possible, and a W3C working group is developing standards for this. The required baseline level has been reached; the rest is a matter of automating deployment. Within the enterprise, there are advantages to be gained for &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id0x7a8e9a8&quot;&gt;information&lt;/a&gt; integration; e.g., all entities in the CRM space can be integrated with all email and support tickets through giving everything a &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Identifier&quot; id=&quot;link-id0x599f630&quot;&gt;URI&lt;/a&gt;. Some of this information may even be published on an &lt;a href=&quot;http://dbpedia.org/resource/Extranet&quot; id=&quot;link-id0x2a28f98&quot;&gt;extranet&lt;/a&gt; for self-service and web-service interfaces. This has been done at small scales and the rest is a matter of spreading adoption and lowering the entry barrier. Incremental progress will take place, eventually resulting in qualitatively better integration along the value chain when adoption is sufficiently widespread.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;Who is consuming these data and why? Could they do it better? How?&lt;/b&gt; &lt;/p&gt; &lt;p&gt;Consumers are various. The greatest need is for tools that summarize complex data and allow getting a bird&amp;#39;s eye view of what data is in the first instance available. Consuming the data is hindered by the user not even necessarily knowing what data there is. This is somewhat new, as traditionally the business analyst did know the schema of the warehouse and was proficient with &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x5999558&quot;&gt;SQL&lt;/a&gt; report generators and statistics packages.&lt;/p&gt; &lt;p&gt;Where Web 2.0 made the &lt;i&gt;citizen journalist&lt;/i&gt;, the web of linked data will make the &lt;i&gt;citizen analyst&lt;/i&gt;. For this to happen, with benefits for individuals, enterprises, and governments alike, more work in user interfaces, knowledge discovery, and query composition will be useful. We may envision a &amp;quot;meshup economy&amp;quot; where data is plentiful, but the unit of value and exchange is the smart report that crystallizes actionable value from this ocean.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;What industrial sectors in Europe could become more competitive if they became much better at managing data?&lt;/b&gt; &lt;/p&gt; &lt;p&gt;Any sector could benefit. Early adopters are seen in the biomedical field and to an extent in media. &lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;Is the regulation landscape imposing constraints (privacy, compliance ...) that don&amp;#39;t have today good tool support?&lt;/b&gt; &lt;/p&gt; &lt;p&gt;The regulation landscape drives database demand through data retention requirements and the like.&lt;/p&gt; &lt;p&gt;With data integration, especially with privacy-sensitive data (as in medicine), there are issues of whether one dares put otherwise-shareable information online. Regulation is needed to protect individuals, but integration should still be possible for science.&lt;/p&gt; &lt;p&gt;For this, we see a need for progress in applying policy-based approaches (e.g., row level security) to relatively schema-last data such as RDF. This is possible but needs some more work. Also, creating on-the-fly-anonymizing views on data might help.&lt;/p&gt; &lt;p&gt;More research is needed for reconciling the need for security with the advantages of broad-based &lt;i&gt;ad hoc&lt;/i&gt; integration. Ideally, data should be intelligent, aware of its origins and classification and cautious of whom it interacts with, all of this supported under the covers so that the user could ask anything but the data might refuse to answer or might restrict answers according to the user&amp;#39;s profile. This is a tall order and implementing something of the sort is an open question.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;What are the main practical problem identified for individuals and organizations? Please give examples and tell us about the main obstacles and barriers.&lt;/b&gt; &lt;/p&gt; &lt;p&gt;We have come across the following:&lt;/p&gt; &lt;ul&gt; &lt;li&gt;Knowing that the data exists in the first place.&lt;/li&gt; &lt;li&gt;If the data is found, figuring out the provenance, units and precision of measurement, identifiers, and the like.&lt;/li&gt; &lt;li&gt;Compatible subject matter but incompatible representation: For example, one has numbers on a map with different maps for different points in time; another has time series of instrument data with geo-location for the instrument. It is only to be expected that the time interval between measurements is not the same. So there is need for a lot of one-off programming to align data.&lt;/li&gt; &lt;/ul&gt; &lt;p&gt;Other problems have to do with sheer volume, i.e., transfer of data even in a local area network is too slow, let alone over a wide area network. Computation needs to go to the data, and databases need to support this.&lt;/p&gt; &lt;/li&gt; &lt;/ol&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;Services, software stacks, protocols, standards, benchmarks&lt;/b&gt; &lt;/p&gt; &lt;ol type=&quot;a&quot; start=&quot;1&quot;&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;What combinations of components are needed to deal with these problems?&lt;/b&gt; &lt;/p&gt; &lt;p&gt;Recent times have seen a proliferation of special purpose databases. Since the data needs of the future are about combining data with maximum agility and minimum performance hit, there is need to gather the currently-separate functionality into an integrated system with sufficient flexibility. We see some of this in integration of map-reduce and scale-out databases. The former antagonists have become partners. Vertica, &lt;a href=&quot;http://dbpedia.org/resource/Greenplum&quot; id=&quot;link-id0x45ecfa0&quot;&gt;Greenplum&lt;/a&gt;, and OpenLink &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x7f73fc8&quot;&gt;Virtuoso&lt;/a&gt; are example of DBMS featuring work in this direction.&lt;/p&gt; &lt;p&gt;Interoperability and at least &lt;i&gt;de facto&lt;/i&gt; standards in ways of doing this will emerge.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;What data exchange and processing mechanisms will be needed to work across platforms and programming languages?&lt;/b&gt; &lt;/p&gt; &lt;p&gt; &lt;a href=&quot;http://dbpedia.org/resource/Hypertext_Transfer_Protocol&quot; id=&quot;link-id0x776a1a0&quot;&gt;HTTP&lt;/a&gt;, &lt;a href=&quot;http://dbpedia.org/resource/XML&quot; id=&quot;link-id0x2a4e8d0&quot;&gt;XML&lt;/a&gt;, and RDF are in fact very verbose, yet these are the formats and models that have uptake. Thus, these will continue to be used even though one might think binary formats to be more efficient.&lt;/p&gt; &lt;p&gt;There are of course science data set standards that are more compressed and these will continue, hopefully adding a practice of rich metadata in RDF.&lt;/p&gt; &lt;p&gt;For internals of systems, MPI and TCP/IP with proprietary optimized wire formats will continue. Inter-system communication will likely continue to be HTTP, XML, and RDF as appropriate.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;What data environments are today so wastefully messy that they would benefit from the development of standards?&lt;/b&gt; &lt;/p&gt; &lt;p&gt;RDF and &lt;a href=&quot;http://dbpedia.org/resource/Web_Ontology_Language&quot; id=&quot;link-id0x2a35960&quot;&gt;OWL&lt;/a&gt; are not messy but they could use some more performance; we are working on this. &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x12362e8&quot;&gt;SPARQL&lt;/a&gt; is finally acquiring the capabilities of a serious query language, so things are slowly coming together.&lt;/p&gt; &lt;p&gt;Community process for developing application domain specific vocabularies works quite well, even though one could argue it is &lt;i&gt;ad hoc&lt;/i&gt; and not up to what a modeling purist might wish.&lt;/p&gt; &lt;p&gt;Top-down imposition of standards has a mixed history, with long and expensive development and sometimes no or little uptake, consider some WS* standards for example.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;What kind of performance is expected or required of these systems? Who will measure it reliably? How?&lt;/b&gt; &lt;/p&gt; &lt;p&gt;Relational databases have a history of substantial investment in &lt;a href=&quot;http://dbpedia.org/resource/Program_optimization&quot; id=&quot;link-id0x7b2d7c8&quot;&gt;optimization&lt;/a&gt; and some of them are very good for what they do, e.g., the newer generation of analytics databases.&lt;/p&gt; &lt;p&gt;The very large schema-last, no-SQL, sometimes eventually consistent key-value stores have a somewhat shorter history but do fill a real need.&lt;/p&gt; &lt;p&gt;These trends will merge: Extreme scale, schema-last, complex queries, even more complex inference, custom code for in-database machine learning and other bulk processing.&lt;/p&gt; &lt;p&gt;We find RDF augmented with some binary types at this crossroads. This point of the design space will have to provide performance roughly on the level of today&amp;#39;s best relational solution for workloads that fit the relational model. The added cost of schema-last and inference must come down. We are working on this. Research work such as carried out with &lt;a href=&quot;http://dbpedia.org/resource/MonetDB&quot; id=&quot;link-id0x794ee48&quot;&gt;MonetDB&lt;/a&gt; gives clues as to how these aims can be reached.&lt;/p&gt; &lt;p&gt;The separation of query language and inference is artificial. After the concepts are mature, these functions will merge and execute close to the data; there are clear evolutionary pressures in this direction.&lt;/p&gt; &lt;p&gt;Benchmarks are key. Some gain can be had even from repurposing standard relational benchmarks like &lt;a href=&quot;http://www.tpc.org/&quot; id=&quot;link-id0x7d45c58&quot;&gt;TPC&lt;/a&gt;-&lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id0x45b0198&quot;&gt;H&lt;/a&gt;. But the TPC-H rules do not allow official reporting of such.&lt;/p&gt; &lt;p&gt;Development of benchmarks for RDF, complex queries, and inference is needed. A bold challenge to the community, it should be rooted in real-life integration needs and involve high heterogeneity. A key-value store benchmark might also be conceived. A transaction benchmark like TPC-&lt;a href=&quot;http://dbpedia.org/resource/C%2B%2B&quot; id=&quot;link-id0x7e32178&quot;&gt;C&lt;/a&gt; might be the basis, maybe augmented with massive user-generated content like reviews and blogs.&lt;/p&gt; &lt;p&gt;If benchmarks exist and are not too easy nor inaccessibly difficult nor too expensive to run — think of the high end TPC-C results — then TPC-style rules and processes would be quite adequate. The threshold to publish should be lowered: Everybody runs the TPC workloads internally but few publish.&lt;/p&gt; &lt;p&gt;Some EC initiative for benchmarking could make sense, similar to the TREC initiative of the US government. Industry should be consulted for the specific content; possibly the answers to the present questionnaire can provide an approximate direction.&lt;/p&gt; &lt;p&gt;Benchmarks should be run by software vendors on their own systems, tuned by themselves. But there should be a process of disclosure and auditing; the TPC rules give an example. Compliance should not be too expensive or time consuming. Some community development for automating these things would be a worthwhile target for EC funding.&lt;/p&gt; &lt;/li&gt; &lt;/ol&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;Usability and training&lt;/b&gt; &lt;/p&gt; &lt;ol type=&quot;a&quot; start=&quot;1&quot;&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;How difficult will it be for a developer of average competence to deploy components whose core is based on rather deep computer science? Do we all need to understand Monads and Continuations? What can be done to make it ever easier?&lt;/b&gt; &lt;/p&gt; &lt;p&gt;In the database world, huge advances in technology have taken place behind a relatively simple and stable interface: SQL. For the linked data &lt;a href=&quot;http://dbpedia.org/resource/Giant_Global_Graph&quot; id=&quot;link-id0x7e01618&quot;&gt;web&lt;/a&gt;, the same will take place behind SPARQL.&lt;/p&gt; &lt;p&gt;Beyond these, for example, programming with MPI with good utilization of a cluster platform for an arbitrary algorithm, is quite difficult. The casual amateur is hereby warned.&lt;/p&gt; &lt;p&gt;There is no single solution. For automatic parallelization, since explicit, programmatic parallelization of things with MPI for example is very unscalable in terms of required skill, we should favor declarative and/or functional approaches.&lt;/p&gt; &lt;p&gt;Developing a debugger and explanation engine for rule-based and description-logics-based inference would be an idea.&lt;/p&gt; &lt;p&gt;For procedural workloads, things like Erlang may be good in cases and are not overly difficult in principle, especially if there are good debugging facilities.&lt;/p&gt; &lt;p&gt;For shipping functions in a cluster or cloud, the &lt;a href=&quot;http://www.eecs.berkeley.edu/Research/Projects/Data/105733.html&quot; id=&quot;link-id0x43665a8&quot;&gt;BOOM&lt;/a&gt; (&lt;a href=&quot;http://www.eecs.berkeley.edu/Research/Projects/Data/105733.html&quot; id=&quot;link-id0x7718f00&quot;&gt;Berkeley Orders Of Magnitude&lt;/a&gt;) approach or logic programming with explicit specification of compute location seem promising, surely more flexible than map-reduce. The question is whether a &lt;a href=&quot;http://dbpedia.org/resource/PHP&quot; id=&quot;link-id0x7d64f68&quot;&gt;PHP&lt;/a&gt; developer can be made to do logic programming.&lt;/p&gt; &lt;p&gt;This bridge will be crossed only with actual need and even then reluctantly. We may look at the Web 2.0 practice of sharding &lt;a href=&quot;http://dbpedia.org/resource/MySQL&quot; id=&quot;link-id0xbab1ae98&quot;&gt;MySQL&lt;/a&gt;, inconvenient as this may be, for an example. There is inertia and thus re-architecting is a constant process that is generally in reaction to facts, &lt;i&gt;post hoc&lt;/i&gt;, often a point solution. One could argue that planning ahead would be smarter but by and large the world does not work so.&lt;/p&gt; &lt;p&gt;One part of the answer is an infinitely-scalable SQL database that expands and shrinks in the clouds, with the usual semantics, maybe optional eventual consistency and built-in map reduce. If such a thing is inexpensive enough and syntax-level-compatible with present installed base, many developers do not have to learn very much more.&lt;/p&gt; &lt;p&gt;This is maybe good for the bread-and-butter IT, but European competitiveness should not rest on this. Therefore we wish to go for bold new application types for which the client-server database application is not the model. Data-centric languages like BOOM, if they can be made very efficient and have good debugging support, are attractive there. These do require more intellectual investment but that is not a problem since the less-inquisitive part of the developer community is served by the first part of the answer.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;How is a developer of average skills going to learn about these new advanced tools? How can we plan for excellent documentation and training, community mentoring, exchange of good practices, etc... across all EU countries?&lt;/b&gt; &lt;/p&gt; &lt;p&gt;For the most part, developers do not learn things for the sake of learning. When they have learned something and it is adequate, they stay with it for the most part and are even reluctant to engage in cross-camps interaction. The research world is often similarly insular. A new inflection in the application landscape is needed to drive learning. This inflection is provided by the &lt;a href=&quot;https://wiki.mozilla.org/Labs/Ubiquity&quot; id=&quot;link-id0x770df38&quot;&gt;ubiquity&lt;/a&gt; of mobile devices, sensor data, explicit semantics, NLP concept extraction, web of linked data, and such factors.&lt;/p&gt; &lt;p&gt;RDFa is a good example of a new technique piggybacking on something everybody uses, namely HTML. These new things should, within possibility, be deployed in the usual technology stack, &lt;a href=&quot;http://en.wikipedia.org/wiki/LAMP_%28software_bundle%29&quot; id=&quot;link-id0x55596a8&quot;&gt;LAMP&lt;/a&gt; or Java. Of course these do not have to be LAMP or Java or HTML or HTTP themselves but they must manifest through these.&lt;/p&gt; &lt;p&gt;A lot of the &lt;a href=&quot;http://dbpedia.org/resource/Semantic_Web&quot; id=&quot;link-id0x3d5378&quot;&gt;semantic web&lt;/a&gt; potential can be realized within the client-server database application model, thus no fundamental re-architecting, just some new data types and queries.&lt;/p&gt; &lt;p&gt;For data- or processing-intensive tasks, an on-demand hookup to cloud-based servers with Erlang and/or BOOM for programming model would be easy enough to learn and utilize.&lt;/p&gt; &lt;p&gt;The question is one of providing challenges. Addressing actual challenges with these techniques will lead to maturity, documentation, examples, and training. With virtual, Europe-wide distributed teams a reality in many places, Europe-wide dissemination is no longer insurmountable.&lt;/p&gt; &lt;p&gt;As the data overflow proceeds, its victims will multiply and create demand for solutions. The EC could here encourage research project use cases gaining an extended life past the end of research projects, possibly being maintained and multiplied and spun off.&lt;/p&gt; &lt;p&gt;If such things could be mutated into self-sustaining service businesses with pay-per-use revenue, say through a cloud SaaS business model, still primarily leveraging an open source technology stack, we could have self-propagating and self-supporting models for exploiting advanced IT. This would create interest, and interest would drive training and dissemination.&lt;/p&gt; &lt;p&gt;The problem is creating the pull.&lt;/p&gt; &lt;/li&gt; &lt;/ol&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;Challenges&lt;/b&gt; &lt;/p&gt; &lt;ol type=&quot;a&quot; start=&quot;1&quot;&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;What should be, in this domain, the equivalent of the Netflix challenge, Ansari X Prize, &lt;a href=&quot;http://dbpedia.org/resource/Google&quot; id=&quot;link-id0x6a6c2b0&quot;&gt;Google&lt;/a&gt; Lunar X Prize, etc. ... ?&lt;/b&gt; &lt;/p&gt; &lt;p&gt;The EC itself no doubt suffers from data overflow in one function or another. Unless security/secrecy prohibits, simply publishing a large data set and a description of what operations should be done on it would be a start. The more real the data, the better — reality is consistently more complex and surprising than imagination. Since many interesting problems touch on fraud detection and law enforcement, there may be some security obstacles for using these application domains as subject matters of open challenges.&lt;/p&gt; &lt;p&gt;Once there is a good benchmark, as discussed above, there can be some prize money allocated for the winners, specially if the race is tight.&lt;/p&gt; &lt;p&gt;The Semantic Web Challenge and the Billion Triples Challenge exist and are useful as such, but do not seem to have any huge impact.&lt;/p&gt; &lt;p&gt;The incentives should be sufficient and part of the expenses arising from running for such challenges could be funded. Otherwise investing in existing business development will be more interesting to industry. Some industry participation seems necessary; we would wish academia and industry to work closer. Also, having industry supply the baseline guarantees that academia actually does further the state of the art. This is not always certain.&lt;/p&gt; &lt;p&gt;If challenges are based on actual problems, whether of the EC, its member governments, or private entities, and winning the challenge may lead to a contract for supplying an actual solution, these will naturally become more interesting for consortia involving integrators, specialist software vendors, and academia. Such a model would build actual capacity to deploy leading edge technologies in production, which is sorely needed.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;What should one do to set up such a challenge, administer, and monitor it?&lt;/b&gt; &lt;/p&gt; &lt;p&gt;The EC should probably circulate a call for actual problem scenarios involving big data. If the matter of the overflow is as dire as represented, cases should be easy to find. A few should be selected and then anonymized if needed.&lt;/p&gt; &lt;p&gt;The party with the use case would benefit by having hopefully the best work on it. The contestants would benefit from having real world needs guide R&amp;amp;D. The EC would not have to do very much, except possibly use some money for funding the best proposals. The winner would possibly get a large account and related sales and service income. The contestants would have to be teams possibly involving many organizations; for example, development and first-line services and support could come from different companies along a systems integrator model such as is widely used in the US.&lt;/p&gt; &lt;p&gt;There may be a good benchmark at the time, possibly resulting from FP7 itself. In such a case, the EC could offer a prize for winners. Details would have to be worked out case by case. Such a challenge could be repeated a few times, as benchmark-driven progress in databases or TREC for example have taken some years to reach a point of slowdown in progress.&lt;/p&gt; &lt;p&gt;Administrating such an activity should not be prohibitive, as most of the expertise can be found with the stakeholders.&lt;/p&gt; &lt;/li&gt; &lt;/ol&gt; &lt;/li&gt; &lt;/ol&gt;</atom:content>
  <atom:author>
    <atom:name>Orri Erling</atom:name>
    <atom:email>oerling@openlinksw.com</atom:email>
   </atom:author>
  <atom:category term="database" />
  <atom:category term="databases" />
  <atom:category term="cluster" />
  <atom:category term="benchmarking" />
  <atom:category term="scalability" />
  <atom:category term="webservices" />
  <atom:category term="web2.0" />
  <atom:category term="web20" />
  <atom:category term="rdf" />
  <atom:category term="xml" />
  <atom:category term="mysql" />
  <atom:category term="semanticweb" />
  <atom:category term="web30" />
  <atom:category term="sparql" />
  <atom:category term="history" />
  <atom:category term="virtuoso" />
  <atom:category term="openlink" />
  <atom:updated>2009-10-27T14:57:28.000002-04:00</atom:updated>
 </atom:entry>
 <atom:entry>
  <atom:title>VLDB 2009 Web Scale Data Management Panel (5 of 5)</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?id=1582</atom:id>
  <atom:link href="http://www.openlinksw.com/weblog/oerling/?id=1582" type="text/html" rel="alternate" />
  <atom:link href="http://www.openlinksw.com/GData/oerling-blog-0/1582/3" rel="edit" />
  <atom:published>2009-09-01T16:24:17Z</atom:published>
  <atom:content type="html">&lt;blockquote&gt; &lt;p&gt; &lt;i&gt;&amp;quot;The universe of cycles is not exactly one of literal cycles, but rather one of spirals,&amp;quot; mused &lt;a href=&quot;http://db.cs.berkeley.edu/jmh/&quot; id=&quot;link-id117455a0&quot;&gt;Joe Hellerstein&lt;/a&gt; of UC Berkeley.&lt;/i&gt; &lt;/p&gt; &lt;p&gt; &lt;i&gt;&amp;quot;Come on, let&amp;#39;s all drop some &lt;a href=&quot;http://dbpedia.org/resource/ACID&quot; id=&quot;link-id16b3db50&quot;&gt;ACID&lt;/a&gt;,&amp;quot; interjected another.&lt;/i&gt; &lt;/p&gt; &lt;p&gt; &lt;i&gt;&amp;quot;It is not that we end up repeating the exact same things, rather even if some patterns seem to repeat, they do so at a higher level, enhanced by the experience gained,&amp;quot; continued Joe.&lt;/i&gt; &lt;/p&gt; &lt;/blockquote&gt; &lt;p&gt;Thus did the Web Scale &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id11061ae0&quot;&gt;Data&lt;/a&gt; Management panel conclude.&lt;/p&gt; &lt;p&gt;Whether successive generations are made wiser by the ones that have gone before may be argued either way.&lt;/p&gt; &lt;p&gt;The cycle in question was that of developers discovering ACID in the 1960s, i.e. Atomicity, Consistency, Integrity, Durability. Thus did the DBMS come into being. Then DBMSs kept becoming more complex until, as there will be a counter-force to each force, came the &lt;a href=&quot;http://dbpedia.org/resource/Meme&quot; id=&quot;link-id11076cc8&quot;&gt;meme&lt;/a&gt; of key value stores and BASE, no multiple-row transactions, eventual consistency, no query language but scaling to thousands of computers. So now, the DBMS community asks itself what went wrong.&lt;/p&gt; &lt;p&gt;In the words of one panelist, another demonstrated a &amp;quot;shocking familiarity with the subject matter of substance abuse&amp;quot; when he called for the DBMS community to get on a &lt;a href=&quot;http://dbpedia.org/resource/Twelve-step_program&quot; id=&quot;link-id15d954a8&quot;&gt;12 step program&lt;/a&gt; and to look where addiction to certain ideas, among which ACID, had brought its life. Look at yourself: The influential papers in what ought to be your space by rights are coming from the OS community: &lt;a href=&quot;http://dbpedia.org/resource/Google&quot; id=&quot;link-id166675f0&quot;&gt;Google&lt;/a&gt; Bigtable, Amazon Dynamo, want more? When you ought to drive, you give excuses and play catch up! Stop denial, drop &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id1105adf0&quot;&gt;SQL&lt;/a&gt;, drop ACID!&lt;/p&gt; &lt;p&gt;The web developers have revolted against the time-honored principles of the DBMS. This is true. Sharded &lt;a href=&quot;http://dbpedia.org/resource/MySQL&quot; id=&quot;link-id1221c230&quot;&gt;MySQL&lt;/a&gt; is not the ticket — or is it? Must they rediscover the virtues of ACID, just like the previous generation did?&lt;/p&gt; &lt;p&gt;Nothing under the sun is new. As in music and fashion, trends keep cycling also in science and engineering.&lt;/p&gt; &lt;p&gt;But seriously, does the full-featured DBMS scale to web scale? &lt;a href=&quot;http://dbpedia.org/resource/Microsoft&quot; id=&quot;link-id10ffcaf8&quot;&gt;Microsoft&lt;/a&gt; says the Azure version of SQL server does. &lt;a href=&quot;http://dbpedia.org/resource/Yahoo%21&quot; id=&quot;link-id16b3f138&quot;&gt;Yahoo&lt;/a&gt; says they want no SQL but &lt;a href=&quot;http://dbpedia.org/resource/Hadoop&quot; id=&quot;link-id11046ef0&quot;&gt;Hadoop&lt;/a&gt; and &lt;a href=&quot;http://research.yahoo.com/node/2304&quot; id=&quot;link-id110a0040&quot;&gt;PNUTS&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;Twitter, Facebook, and other web names got their own discussion. Why do they not go to serious DBMS vendors for their data but make their own, like Facebook with Hive?&lt;/p&gt; &lt;p&gt;Who can divine the mind of the web developer? What makes them go to &lt;a href=&quot;http://www.danga.com/memcached/&quot; id=&quot;link-id1109e280&quot;&gt;memcached&lt;/a&gt;, manually sharded MySQL, and &lt;a href=&quot;http://dbpedia.org/resource/MapReduce&quot; id=&quot;link-id1107cd60&quot;&gt;MapReduce&lt;/a&gt;, walking away from the 40 years of technology invested in declarative query and ACID? What is this highly visible but hard to grasp &lt;a href=&quot;http://dbpedia.org/resource/Entity&quot; id=&quot;link-id1105b6b8&quot;&gt;entity&lt;/a&gt;? My guess is that they want something they can understand, at least at the beginning. A DBMS, especially on a cluster, is complicated, and it is not so easy to say how it works and how its performance is determined. The big brands, if deployed on a thousand PCs, would also be prohibitively expensive. But if all you do with the DBMS is single row selects and updates, it is no longer so scary, but you end up doing all the distributed things in a middle layer, and abandoning expressive queries, transactions, and database-supported transparency of location. But at least now you know how it works and what it is good/not good for.&lt;/p&gt; &lt;p&gt;This would be the case for those who make a conscious choice. But by and large the choice is not deliberate; it is something one drifts into: The application gains popularity; the single &lt;a href=&quot;http://en.wikipedia.org/wiki/LAMP_%28software_bundle%29&quot; id=&quot;link-iddc68d28&quot;&gt;LAMP&lt;/a&gt; can no longer keep all in memory; you need a second MySQL in the LAMP and you decide that users A–M go left and N–Z right (horizontal partitioning). This siren of sharding beckons you and all is good until you hit the reef of re-architecting. Memcached and duct-tape help, like aspirin helps with hangover, but the root cause of the headache lies unaddressed.&lt;/p&gt; &lt;p&gt;The conclusion was that there ought to be something incrementally scalable from the get-go. Low cost of entry and built-in scale-out. No, the web developers do not hate SQL; they just have gotten the idea that it does not scale. But they would really wish it to. So, DBMS people, show there is life in you yet.&lt;/p&gt; &lt;p&gt;Joe Hellerstein was the philosopher and paradigmatician of the panel. His team had developed a protocol-compatible Hadoop in a few months using a declarative logic programming style approach. His claim was that developers made the market. Thus, for writing applications against web scale data, there would have to be data centric languages. Why not? These are discussed in &lt;a href=&quot;http://www.eecs.berkeley.edu/Research/Projects/Data/105733.html&quot; id=&quot;link-id110ba0e0&quot;&gt;Berkeley Orders Of Magnitude&lt;/a&gt; (&lt;a href=&quot;http://www.eecs.berkeley.edu/Research/Projects/Data/105733.html&quot; id=&quot;link-id16aab768&quot;&gt;BOOM&lt;/a&gt;).&lt;/p&gt; &lt;p&gt;I come from &lt;a href=&quot;http://en.wikipedia.org/wiki/Lisp_%28programming_language%29&quot; id=&quot;link-id10f2cd68&quot;&gt;Lisp&lt;/a&gt; myself, way back. I have since abandoned any desire to tell anybody what they ought to program in. This is a bit like religion: Attempting to impose or legislate or ram it on somebody just results in anything from lip service to rejection to war. The appeal exerted by the diverse language/paradigm -isms on their followers seems to be based on hitting a simplification of reality that coincides with a problem in the air. MapReduce is an example of this. &lt;a href=&quot;http://dbpedia.org/resource/PHP&quot; id=&quot;link-ide22cdd0&quot;&gt;PHP&lt;/a&gt; is another. A quick fix for a present need: Scripting web servers (PHP) or processing tons of files (MapReduce). The full database is not as quick a fix, even though it has many desirable features. It is also not as easy to tell what happens inside one, so MapReduce may give a greater feeling of control.&lt;/p&gt; &lt;p&gt;Totally self-managing, dynamically-scalable &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id152864b0&quot;&gt;RDF&lt;/a&gt; would be a fix for not having to design or administer databases: Since it would be indexed on everything, complex queries would be possible; no full database scans would stop everything. For the mid-size segment of web sites this might be a fit. For the extreme ends of the spectrum, the choice is likely something custom built and much less expressive.&lt;/p&gt; &lt;p&gt;The BOOM rule language for data-centric programming would be something very easy for us to implement, in fact we will get something of the sort essentially for free when we do the rule support already planned.&lt;/p&gt; &lt;p&gt;The question is, can one induce web developers to do logic? The history is one of procedures, both in LAMP and MapReduce. On the other hand, the query languages that were ever universally adopted were declarative, i.e., keyword search and SQL. There certainly is a quest for an application model for the cloud space beyond just migrating apps. We&amp;#39;ll see. More on this another time.&lt;/p&gt;</atom:content>
  <atom:author>
    <atom:name>Orri Erling</atom:name>
    <atom:email>oerling@openlinksw.com</atom:email>
   </atom:author>
  <atom:category term="cluster" />
  <atom:category term="rdf" />
  <atom:category term="sql_server" />
  <atom:category term="mysql" />
  <atom:category term="semanticweb" />
  <atom:category term="history" />
  <atom:updated>2009-09-02T12:05:20.000001-04:00</atom:updated>
 </atom:entry>
 <atom:entry>
  <atom:title>VLDB 2009 Yahoo Keynote (4 of 5)</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?id=1577</atom:id>
  <atom:link href="http://www.openlinksw.com/weblog/oerling/?id=1577" type="text/html" rel="alternate" />
  <atom:link href="http://www.openlinksw.com/GData/oerling-blog-0/1577/2" rel="edit" />
  <atom:published>2009-09-01T16:04:36Z</atom:published>
  <atom:content type="html">&lt;p&gt; &lt;a href=&quot;http://dbpedia.org/resource/Raghu_Ramakrishnan&quot; id=&quot;link-id0x19076030&quot;&gt;Raghu Ramakrishnan&lt;/a&gt; of &lt;a href=&quot;http://dbpedia.org/resource/Yahoo%21&quot; id=&quot;link-id0x47142b8&quot;&gt;Yahoo&lt;/a&gt;! gave a keynote about &lt;a href=&quot;http://research.yahoo.com/node/2304&quot; id=&quot;link-id0x186c1288&quot;&gt;PNUTS&lt;/a&gt;, the Yahoo solution for managing massive user &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x4e966e0&quot;&gt;data&lt;/a&gt;, from front page preferences to mail to social networks.&lt;/p&gt; &lt;p&gt;Dynamic scale, wide area replication, and high availability are the issues. Transactions on multiple records, complex queries, and absolute consistency at all times are traded off. Also, the programming interfaces are lower level than with &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x23e68948&quot;&gt;SQL&lt;/a&gt;. Replication and consistency rules are choices for the application developer; the platform offers some basic alternatives. Implementation-wise, there is a &lt;a href=&quot;http://dbpedia.org/resource/MySQL&quot; id=&quot;link-id0x182cead8&quot;&gt;MySQL&lt;/a&gt; back-end and all the partitioning, query routing, replication, and balancing take place in a layer of front-ends.&lt;/p&gt; &lt;p&gt;Now what do we say to this?&lt;/p&gt; &lt;p&gt;In the Yahoo! case, even if complex queries were possible, which they are not, one would probably keep them off the online system since latency and availability are everything. A latency of some tens of milliseconds is however acceptable, which is not so terrible for single record operations: There is time for a couple of messages on the data center network and even maybe for a disk read.&lt;/p&gt; &lt;p&gt;PNUTS is probably the fastest way of getting to the desired beachhead of simple access to data at infinite scale in multiple geographies. In the identical situation, I might have done something similar.&lt;/p&gt; &lt;p&gt;But we are in a different situation, concerned with complex queries, a highly-normalized &lt;a href=&quot;http://dbpedia.org/resource/Database_schema&quot; id=&quot;link-id0x197b7948&quot;&gt;schema&lt;/a&gt;-last situation, i.e., index on everything, large objects normalized away, as is done in &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x385a900&quot;&gt;RDF&lt;/a&gt;. Then we are also in the relational situation. Infinite scale, fault tolerance, and wide-area replication do come up regularly in user needs. The applications for which people would like RDF are not only complex reasoning things but very big &lt;a href=&quot;http://dbpedia.org/resource/Metadata&quot; id=&quot;link-id0x25a30d98&quot;&gt;metadata&lt;/a&gt; stores for user generated content, social networks, and the like.&lt;/p&gt; &lt;p&gt;Which of the PNUTS principles could we apply?&lt;/p&gt; &lt;ul&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;Division in tablets:&lt;/b&gt; When a partition of the data grows too big, it should split.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;Migration of partitions:&lt;/b&gt; as capacity/demand change, partitions should migrate so as to equalize load.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;High availability:&lt;/b&gt; This is divided in two — on one hand inside the data center; on the other between data centers. Inside the data center, storing partitions in duplicate and running them synchronously is possible. This is manifestly impossible in wide area settings, though. For this, we need a log-shipping style of asynchronous replication. But how does one deal with split networks and transfer of replication mastery?&lt;/p&gt; &lt;/li&gt; &lt;/ul&gt; &lt;p&gt;PNUTS determines the master copy record by record. This makes sense when the record, for example, corresponds to a user. For RDF, doing this by the triple would be prohibitive. Doing this by the graph, or by the subject of a set of triples across all graphs, would be better. We would agree with PNUTS that transferring mastery by the storage chunk is not desired, as the chunk will contain arbitrary unrelated data.&lt;/p&gt; &lt;p&gt; &lt;/p&gt; &lt;p&gt;The eventual consistency mechanisms can be generalized to RDF readily enough. In a social RDF application, the graph is the most likely unit of data ownership and update authorization, so the graph would also be the unit of eventual consistency. Keeping a separate data structure listing recent inserts/deletes to a graph with timestamps would serve for establishing consistency. The size of this would be a small fraction of the size of the graph.&lt;/p&gt; &lt;p&gt;RDF cannot do anything without joining between partitions, whereas for PNUTS the join between partitions is an application matter. But then PNUTS does have an extra step of RPC between the PNUTS infrastructure and the back-end. Doing query routing in the back-end gets rid of this. RDF does remain more dependent on even performance and short interconnect latencies, though. It also likely takes more space. But the essential consistency and availability features can be generalized to it, providing the merge of semi-structured data at infinite scale and availability with complex query.&lt;/p&gt; &lt;p&gt;At any rate, repartitioning-on-demand and partition-migration remain the key agenda items for us, confirmed over and over at VLDB.&lt;/p&gt;</atom:content>
  <atom:author>
    <atom:name>Orri Erling</atom:name>
    <atom:email>oerling@openlinksw.com</atom:email>
   </atom:author>
  <atom:category term="rdf" />
  <atom:category term="mysql" />
  <atom:category term="semanticweb" />
  <atom:updated>2009-09-01T17:32:35.000002-04:00</atom:updated>
 </atom:entry>
 <atom:entry>
  <atom:title>VLDB 2009 TPC Workshop (3 of 5)</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?id=1576</atom:id>
  <atom:link href="http://www.openlinksw.com/weblog/oerling/?id=1576" type="text/html" rel="alternate" />
  <atom:link href="http://www.openlinksw.com/GData/oerling-blog-0/1576/2" rel="edit" />
  <atom:published>2009-09-01T15:51:09Z</atom:published>
  <atom:content type="html">&lt;p&gt;Michael &lt;a href=&quot;http://dbpedia.org/resource/Michael_Stonebraker&quot; id=&quot;link-id0x15e5efe0&quot;&gt;Stonebraker&lt;/a&gt; gave the keynote at the &lt;a href=&quot;http://www.tpc.org/&quot; id=&quot;link-id0x18cee5f0&quot;&gt;TPC&lt;/a&gt; workshop. His message was that the TPC, at the venerable age of 21, was already a decade late in reinventing itself. From the height of relevance at the time of the debit/credit benchmark twenty years back, it was slipping into the sunset of irrelevance unless it paid attention.&lt;/p&gt; &lt;p&gt;Now we are great fans of the TPC and while we have not published results by the TPC book, we have extensively used TPC material for guiding &lt;a href=&quot;http://dbpedia.org/resource/Program_optimization&quot; id=&quot;link-id0x4e55368&quot;&gt;optimization&lt;/a&gt;, as has pretty much everybody else.&lt;/p&gt; &lt;p&gt;It is true that the rules encourage unrealistic configurations. The emphasis on random access from disk that is built into the rules leads to disk configurations that are very improbable in practice, such as 1PB of disks for 3TB of &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x191cd880&quot;&gt;data&lt;/a&gt;, just so there are enough disk arms in parallel. Stonebraker also pointed out that replication and failover were ubiquitous in real life and that roll forward from logs was unrealistic as a recovery model since it took so long. Benchmarks should therefore include replication.&lt;/p&gt; &lt;p&gt;Further, Stonebraker challenged the TPC to go for the new frontier, which he described as the huge data sets in science and on big web sites. Scientists, the ones who would save our planet from the diverse ills confronting it, do not like relational databases. They avoid them when can. They want arrays for physics, and graphs for biology and chemistry. &lt;a href=&quot;http://dbpedia.org/resource/MapReduce&quot; id=&quot;link-id0x53f6040&quot;&gt;MapReduce&lt;/a&gt; is eating database&amp;#39;s lunch; what will you do about this?&lt;/p&gt; &lt;p&gt;I later suggested incorporating an &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x18902070&quot;&gt;RDF&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/Metadata&quot; id=&quot;link-id0x3990af8&quot;&gt;metadata&lt;/a&gt; benchmark into the TPC suite. We&amp;#39;ll see about this; we&amp;#39;ll first have to come up with a suitable one. There is a great deal of pressure for making good RDF benchmarks but this is not yet in the center of the mainstream that TPC tends to cover.&lt;/p&gt; &lt;p&gt;TPC&amp;#39;s own talk was about the life cycle of benchmarks. A benchmark begins a bit ahead of the mainstream, with a problem that is difficult but not so difficult as to be uncommon. When the solution to this problem becomes commonplace, the benchmark&amp;#39;s relevance gradually drops.&lt;/p&gt; &lt;p&gt;There was a talk on robustness of query plans which was well to the point. Indeed, there are performance cliffs at certain points; for example, when passing from memory-only to disk-pageable data structures, or when switching from indexed access to table scans, or from loop to hash joins. Quite so. The analysis I really would have liked to see would have been one of what happens when passing from single server to a cluster, and from local joins to cross-partition ones. Also contrasting of &lt;a href=&quot;http://dbpedia.org/resource/Cache&quot; id=&quot;link-id0x1942aca8&quot;&gt;cache&lt;/a&gt; fusion and partitioning. We have our own data and experience but we find we don&amp;#39;t have time to measure all the other systems.&lt;/p&gt; &lt;p&gt;Anyway it is good to raise the question of smooth and predictable performance.&lt;/p&gt;</atom:content>
  <atom:author>
    <atom:name>Orri Erling</atom:name>
    <atom:email>oerling@openlinksw.com</atom:email>
   </atom:author>
  <atom:category term="cluster" />
  <atom:category term="benchmarking" />
  <atom:category term="scalability" />
  <atom:category term="rdf" />
  <atom:category term="semanticweb" />
  <atom:updated>2009-09-01T17:32:30-04:00</atom:updated>
 </atom:entry>
 <atom:entry>
  <atom:title>Some Interesting VLDB 2009 Papers (2 of 5)</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?id=1575</atom:id>
  <atom:link href="http://www.openlinksw.com/weblog/oerling/?id=1575" type="text/html" rel="alternate" />
  <atom:link href="http://www.openlinksw.com/GData/oerling-blog-0/1575/2" rel="edit" />
  <atom:published>2009-09-01T15:46:14Z</atom:published>
  <atom:content type="html">&lt;h3&gt; &lt;a href=&quot;http://dbpedia.org/resource/Intel_Corporation&quot; id=&quot;link-id0x3588e30&quot;&gt;Intel&lt;/a&gt; on &lt;a href=&quot;http://dbpedia.org/resource/Hash_join&quot; id=&quot;link-id0x1bc77c90&quot;&gt;Hash Join&lt;/a&gt; &lt;/h3&gt; &lt;p&gt;Intel and &lt;a href=&quot;http://dbpedia.org/resource/Oracle_Database&quot; id=&quot;link-id0x2f1d4d8&quot;&gt;Oracle&lt;/a&gt; had measured hash and sort merge joins on Intel Core i7. The result was that hash join with both tables partitioned to match &lt;a href=&quot;http://dbpedia.org/resource/Central_processing_unit&quot; id=&quot;link-id0x55b2b70&quot;&gt;CPU&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/Cache&quot; id=&quot;link-id0x2a4fef8&quot;&gt;cache&lt;/a&gt; was still the best but that sort/merge would catch up with more &lt;a href=&quot;http://dbpedia.org/resource/SIMD&quot; id=&quot;link-id0x4fe8670&quot;&gt;SIMD&lt;/a&gt; instructions in the future.&lt;/p&gt; &lt;p&gt;We should probably experiment with this but the most important partitioning of hash joins is still between cluster nodes. Within the process, we will see. The tradeoff of doing all in cache-sized partitions is larger intermediate results which in turn will impact the working set of disk pages in RAM. For one-off queries this is OK; for online use this has an effect.&lt;/p&gt; &lt;h3&gt;1000 TABLE Queries&lt;/h3&gt; &lt;p&gt; &lt;a href=&quot;http://dbpedia.org/resource/SAP_AG&quot; id=&quot;link-id0x55a1018&quot;&gt;SAP&lt;/a&gt; presented a paper about &lt;a href=&quot;http://dbpedia.org/resource/Federated_database_system&quot; id=&quot;link-id0x5500758&quot;&gt;federating relational databases&lt;/a&gt;. Queries would be expressed against VIEWs defined over remote TABLEs, UNIONed together and so forth. Traditional methods of &lt;a href=&quot;http://dbpedia.org/resource/Program_optimization&quot; id=&quot;link-id0x4f038f0&quot;&gt;optimization&lt;/a&gt; would run out of memory; a single 1000 TABLE plan is already a big thing. Enumerating multiple variations of such is not possible in practice. So the solution was to plan in two stages — first arrange the subqueries and derived TABLEs, and then do the JOIN orders locally. Further, local JOIN orders could even be adjusted at run time based on the actual &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x41d3560&quot;&gt;data&lt;/a&gt;. Nice.&lt;/p&gt; &lt;h3&gt;Oracle Subqueries and New Implementation of LOBs&lt;/h3&gt; &lt;p&gt;Oracle presented some new &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x39ad838&quot;&gt;SQL&lt;/a&gt; optimizations, combining and inlining subqueries and derived TABLEs. We do fairly similar things and might extend the repertoire of tricks in the direction outlined by Oracle as and when the need presents itself. This further confirms that SQL and other query optimization is really an incremental collection of specially recognized patterns. We still have not found any other way of doing it.&lt;/p&gt; &lt;p&gt;Another interesting piece by Oracle was about their re-implementation of large object support, where they compared LOB loading to file system and raw device speeds.&lt;/p&gt; &lt;h3&gt; &lt;a href=&quot;http://dbpedia.org/resource/Amadeus_CRS&quot; id=&quot;link-id0x3aa1378&quot;&gt;Amadeus CRS&lt;/a&gt; booking system, steady query time for arbitrary single table queries&lt;/h3&gt; &lt;p&gt;There was a paper about a memory-resident database that could give steady time for any kind of single-table scan query. The innovation was to not use indices, but to have one partition of the table per processor core, all in memory. Then each core would have exactly two cursors — one reading, the other writing. The write cursor should keep ahead of the read cursor. Like this, there would be no read/write contention on pages, no locking, no multiple threads splitting a tree at different points, none of the complexity of a multithreaded database engine. Then, when the cursor would hit a row, it would look at the set of queries or updates and add the result to the output if there was a result. The data indexes the queries, not the other way around. We have done something similar for detecting changes in a full text corpus but never thought of doing queries this way.&lt;/p&gt; &lt;p&gt;Well, we are all about JOINs so this is not for us, but it deserves a mention for being original and clever. And indeed, anything one can ask about a table will likely be served with great predictability.&lt;/p&gt; &lt;h3&gt; &lt;a href=&quot;http://dbpedia.org/resource/Greenplum&quot; id=&quot;link-id0x3670360&quot;&gt;Greenplum&lt;/a&gt; &lt;/h3&gt; &lt;p&gt; &lt;a href=&quot;http://dbpedia.org/resource/Google&quot; id=&quot;link-id0x2c5dfb8&quot;&gt;Google&lt;/a&gt;&amp;#39;s chief economist said that the winning career choice would be to pick a scarce skill that made value from something that was plentiful. For the 2010s this career is that of the statistician/data analyst. We&amp;#39;ve said it before — the next web is analytics for all. The Greenplum talk was divided between the Fox use case, with 200TB of data about ads, web site traffic, and other things, growing 5TB a day. The message was that cubes and drill down are passé, that it is about complex statistical methods that have to run in the database, that the new kind of geek is the data geek, whose vocation it is to consume and spit out data, discover things in it, and so forth.&lt;/p&gt; &lt;p&gt;The technical part was about Greenplum, a SQL database running on a cluster with a &lt;a href=&quot;http://dbpedia.org/resource/PostgreSQL&quot; id=&quot;link-id0x4e15798&quot;&gt;PostgreSQL&lt;/a&gt; back-end. The interesting points were embedding &lt;a href=&quot;http://dbpedia.org/resource/MapReduce&quot; id=&quot;link-id0x4fd3e00&quot;&gt;MapReduce&lt;/a&gt; into SQL, and using relational tables for arrays and complex data types — pretty much what we also do. Greenplum emphasized scale-out and found column orientation more like a nice-to-have.&lt;/p&gt; &lt;h3&gt; &lt;a href=&quot;http://dbpedia.org/resource/MonetDB&quot; id=&quot;link-id0x416d288&quot;&gt;MonetDB&lt;/a&gt;, optimizing database for CPU cache&lt;/h3&gt; &lt;p&gt;The MonetDB people from &lt;a href=&quot;http://dbpedia.org/resource/National_Research_Institute_for_Mathematics_and_Computer_Science&quot; id=&quot;link-id0x4ebedb0&quot;&gt;CWI&lt;/a&gt; in Amsterdam gave a 10 year best paper award talk about optimizing database for CPU cache. The key point was that if data is stored as columns, it ought also to be transferred as columns inside the execution engine. Materialize big chunks of state to cut down on interpretation overhead and use cache to best effect. They vector for CPU cache; we vector for scale-out, since the only way to ship operations is to ship many at a time. So we might as well vector also in single servers. This could be worth an experiment. Also we regularly visit the topic of &lt;a href=&quot;http://dbpedia.org/resource/Column-oriented_DBMS&quot; id=&quot;link-id0x4d34cb0&quot;&gt;column storage&lt;/a&gt;. But we are not yet convinced that it would be better than row-style covering indices for &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x4d55d78&quot;&gt;RDF&lt;/a&gt; quads. But something could certainly be tried, given time.&lt;/p&gt;</atom:content>
  <atom:author>
    <atom:name>Orri Erling</atom:name>
    <atom:email>oerling@openlinksw.com</atom:email>
   </atom:author>
  <atom:category term="cluster" />
  <atom:category term="rdf" />
  <atom:category term="oracle" />
  <atom:category term="postgres" />
  <atom:category term="semanticweb" />
  <atom:updated>2009-09-01T17:32:24.000004-04:00</atom:updated>
 </atom:entry>
 <atom:entry>
  <atom:title>VLDB 2009 (1 of 5)</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?id=1574</atom:id>
  <atom:link href="http://www.openlinksw.com/weblog/oerling/?id=1574" type="text/html" rel="alternate" />
  <atom:link href="http://www.openlinksw.com/GData/oerling-blog-0/1574/3" rel="edit" />
  <atom:published>2009-09-01T15:30:37Z</atom:published>
  <atom:content type="html">&lt;p&gt;I was at the &lt;a href=&quot;http://vldb2009.org/&quot; id=&quot;link-id0x6700588&quot;&gt;VLDB 2009&lt;/a&gt; conference in Lyon, France. I will in the next few posts discuss some of the prominent themes and how they relate to our products or to &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x69386a8&quot;&gt;RDF&lt;/a&gt; and &lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id0x7537ce0&quot;&gt;Linked Data&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;Firstly, RDF was as good as absent from the presentations and discussions we saw. There were a few mentions in the panel on structured &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x1c201ed0&quot;&gt;data&lt;/a&gt; on the web, however RDF was not in any way seen to be essential for this. There were also a couple of RDF mentions in questions at other sessions, but that was about it.&lt;/p&gt; &lt;p&gt;It is a common perception that RDF and database people do not talk with each other. Evidence seems to bear this out.&lt;/p&gt; &lt;p&gt;As a database developer I did get a lot of readily applicable ideas from the VLDB talks. These run across the whole range of DBMS topics, from &lt;a href=&quot;http://dbpedia.org/resource/Data_compression&quot; id=&quot;link-id0x1b802010&quot;&gt;key compression&lt;/a&gt; and &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x48bc820&quot;&gt;SQL&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/Program_optimization&quot; id=&quot;link-id0x218bd558&quot;&gt;optimization&lt;/a&gt;, to &lt;a href=&quot;http://dbpedia.org/resource/Column-oriented_DBMS&quot; id=&quot;link-id0x238a39c8&quot;&gt;column storage&lt;/a&gt;, &lt;a href=&quot;http://dbpedia.org/resource/Central_processing_unit&quot; id=&quot;link-id0x6694538&quot;&gt;CPU&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/Cache&quot; id=&quot;link-id0x4895568&quot;&gt;cache&lt;/a&gt; optimization, and the like. In this sense, VLDB is directly relevant to all we do. In a conversation, someone was mildly confused that I should on one hand mention I was doing RDF, and on the other hand also be concerned about database performance. These things are not seen to belong together, even though making RDF do something useful certainly depends on a great deal of database optimization.&lt;/p&gt; &lt;p&gt;The question of all questions — that of infinite scale-out with complex queries, resilience, replication, and full database semantics — was strongly in the air.&lt;/p&gt; &lt;p&gt;But it was in the air more as a question than as an answer. Not very much at all was said about the performance of distributed query plans, of &lt;a href=&quot;http://dbpedia.org/resource/Two-phase_commit_protocol&quot; id=&quot;link-id0x7a4b208&quot;&gt;2pc&lt;/a&gt; (&lt;a href=&quot;http://dbpedia.org/resource/Two-phase_commit_protocol&quot; id=&quot;link-id0x1a0e8ac8&quot;&gt;two-phase commit&lt;/a&gt;), of the impact of interconnect latency, and such things. On the other hand, people were talking quite liberally about optimizing CPU cache and local multi-core execution, not to mention SQL plans and rewrites. Also, almost nothing was said about transactions.&lt;/p&gt; &lt;p&gt;Still, there is bound to be a great deal of work in scale-out of complex workloads by any number of players. Either these things are all figured out and considered self-evidently trivial, or they are so hot that people will go there only by way of allusion and vague reference. I think it is the latter.&lt;/p&gt; &lt;p&gt;By and large, we were confirmed in our understanding that infinite scale-out on the go, with redundancy, is the ticket, especially if one can offer complex queries and transactional semantics coupled with instant data loading and &lt;a href=&quot;http://dbpedia.org/resource/Database_schema&quot; id=&quot;link-id0x23a0c590&quot;&gt;schema&lt;/a&gt;-last.&lt;/p&gt; &lt;p&gt;Column storage and cache optimizations seem to come right after these.&lt;/p&gt; &lt;p&gt;Certainly the database space is diversifying.&lt;/p&gt; &lt;p&gt; &lt;a href=&quot;http://dbpedia.org/resource/MapReduce&quot; id=&quot;link-id0x185e6370&quot;&gt;MapReduce&lt;/a&gt; was discussed quite a bit, as an intruder into what would be the database turf. We have no great problem with MapReduce; we do that in SQL procedures if one likes to program in this way. &lt;a href=&quot;http://dbpedia.org/resource/Greenplum&quot; id=&quot;link-id0x1ad96d68&quot;&gt;Greenplum&lt;/a&gt; also seems to have come by the same idea.&lt;/p&gt; &lt;p&gt;As said before, RDF and RDF reasoning were ignored. Do these actually offer something to the database side? Certainly for search, discovery, integration, and resource discovery, linked data has evident advantages.&lt;/p&gt; &lt;p&gt;Two points of the design space — the warehouse, and the web-scale key-value store — got a lot of attention. Would I do either in RDF? RDF is a slightly different design space point, like key-value with complex queries — on the surface, a fusion of the two. As opposed to RDF, the relational warehouse gains from fixed data-types and task-specific layout, whether row or column. The key-value store gains from having a concept of a semi-structured record, a bit like the RDF subject of a triple, but now with ad-hoc (if any) secondary indices, and inline blobs. The latter is much simpler and more compact than the generic RDF subject with graphs and all, and can be easily treated as a unit of version control and replication mastering. RDF, being more generic and more normalized, is representationally neither as ad-hoc nor as compact.&lt;/p&gt; &lt;p&gt;But RDF will be the natural choice when complex queries and ad-hoc schema meet, for example in web-wide integrations of application data.&lt;/p&gt; &lt;p&gt;There seems to be a huge divide in understanding between database-developing people and those who would be using databases. On one side, this has led to a back-to-basics movement with no SQL, no &lt;a href=&quot;http://dbpedia.org/resource/ACID&quot; id=&quot;link-id0x2ec4088&quot;&gt;ACID&lt;/a&gt;, key-value pairs instead of schema, MapReduce instead of fancy but hard-to-follow parallel execution plans. On the other side, the database space specializes more and more; it is no longer simply transactions vs. analytics, but many more points of specialization.&lt;/p&gt; &lt;p&gt;Some frustration can be sensed in the ivory towers of science when it is seen that the ones most in need of database understanding in fact have the least. &lt;a href=&quot;http://dbpedia.org/resource/Google&quot; id=&quot;link-id0x7748540&quot;&gt;Google&lt;/a&gt;, &lt;a href=&quot;http://dbpedia.org/resource/Yahoo%21&quot; id=&quot;link-id0x1ba44020&quot;&gt;Yahoo&lt;/a&gt;!, and &lt;a href=&quot;http://dbpedia.org/resource/Microsoft&quot; id=&quot;link-id0x5788710&quot;&gt;Microsoft&lt;/a&gt; know what they are doing, with or without SQL, but the medium-size or fast-growing web sites seem to be in confusion when &lt;a href=&quot;http://en.wikipedia.org/wiki/LAMP_%28software_bundle%29&quot; id=&quot;link-id0x18098f18&quot;&gt;LAMP&lt;/a&gt; or &lt;a href=&quot;http://dbpedia.org/resource/Ruby_programming_language&quot; id=&quot;link-id0x4844138&quot;&gt;Ruby&lt;/a&gt; or the scripting-du-jour can no longer cut it.&lt;/p&gt; &lt;p&gt;Can somebody using a database be expected to understand how it works? I would say no, not in general. Can a database be expected to unerringly self-configure based on workload? Sure, a database can suggest layouts, but it ought not restructure itself on the spur of the moment under full load.&lt;/p&gt; &lt;p&gt;It is safe to say that the community at large no longer believes in &amp;quot;one size fits all&amp;quot;. Since there is no general solution, there is a fragmented space of specific solutions. We will be looking at some of these issues in the following posts.&lt;/p&gt;</atom:content>
  <atom:author>
    <atom:name>Orri Erling</atom:name>
    <atom:email>oerling@openlinksw.com</atom:email>
   </atom:author>
  <atom:category term="rdf" />
  <atom:category term="semanticweb" />
  <atom:category term="dynamic_languages" />
  <atom:category term="ruby" />
  <atom:updated>2009-09-01T16:53:20-04:00</atom:updated>
 </atom:entry>
 <atom:entry>
  <atom:title>Provenance and Reification in Virtuoso</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?id=1572</atom:id>
  <atom:link href="http://www.openlinksw.com/weblog/oerling/?id=1572" type="text/html" rel="alternate" />
  <atom:link href="http://www.openlinksw.com/GData/oerling-blog-0/1572/1" rel="edit" />
  <atom:published>2009-09-01T14:44:08Z</atom:published>
  <atom:content type="html">&lt;p&gt;These days, &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x4a44870&quot;&gt;data&lt;/a&gt; provenance is a big topic across the board, ranging from the &lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id0x4e10e60&quot;&gt;linked data&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/Giant_Global_Graph&quot; id=&quot;link-id0x4738350&quot;&gt;web&lt;/a&gt;, to &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x1fe33310&quot;&gt;RDF&lt;/a&gt; in general, to any kind of data integration, with or without RDF. Especially with scientific data we encounter the need for metadata and provenance, repeatability of experiments, etc. Data without context is worthless, yet the producers of said data do not always have a model or budget for metadata. And if they do, the approach is often a proprietary relational schema with web services in front.&lt;/p&gt; &lt;p&gt;RDF and linked data principles could evidently be a great help. This is a large topic that goes into the culture of doing science and will deserve a more extensive treatment down the road.&lt;/p&gt; &lt;p&gt;For now, I will talk about possible ways of dealing with provenance annotations in &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x36581e8&quot;&gt;Virtuoso&lt;/a&gt; at a fairly technical level.&lt;/p&gt; &lt;p&gt;If data comes many-triples-at-a-time from some source (e.g., library catalogue, user of a social network), then it is often easiest to put the data from each source/user into its own graph. Annotations can then be made on the graph. The graph IRI will simply occur as the subject of a triple in the same or some other graph. For example, all such annotations could go into a special annotations graph.&lt;/p&gt; &lt;p&gt;On the query side, having lots of distinct graphs does not have to be a problem if the index scheme is the right one, i.e., the 4 index scheme &lt;a href=&quot;http://docs.openlinksw.com/virtuoso/rdfperformancetuning.html#rdfperfindexes&quot; id=&quot;link-id142a0798&quot;&gt;discussed in the Virtuoso documentation&lt;/a&gt;. If the query does not specify a graph, then triples in any graph will be considered when evaluating the query.&lt;/p&gt; &lt;p&gt;One could write queries like —&lt;/p&gt; &lt;blockquote&gt; &lt;code&gt;&lt;pre&gt;SELECT ?pub WHERE { GRAPH ?g { ?person foaf:knows ?contact } ?contact foaf:name &amp;quot;Alice&amp;quot; . ?g xx:has_publisher ?pub }&lt;/pre&gt; &lt;/code&gt; &lt;/blockquote&gt; &lt;p&gt;This would return the publishers of graphs that assert that somebody knows Alice.&lt;/p&gt; &lt;p&gt;Of course, the &lt;a href=&quot;http://www.w3.org/TR/2004/REC-rdf-primer-20040210/#reification&quot; id=&quot;link-id14fa9488&quot;&gt;RDF reification vocabulary&lt;/a&gt; can be used as-is to say things about single triples. It is however very inefficient and is not supported by any specific optimization. Further, reification does not seem to get used very much; thus there is no great pressure to specially optimize it.&lt;/p&gt; &lt;p&gt;If we have to say things about specific triples and this occurs frequently (i.e., for more than 10% or so of the triples), then modifying the quad table becomes an option. For all its inefficiency, the RDF reification vocabulary is applicable if reification is a rarity.&lt;/p&gt; &lt;p&gt;Virtuoso&amp;#39;s &lt;code&gt;RDF_QUAD&lt;/code&gt; table can be altered to have more columns. The problem with this is that space usage is increased and the RDF loading and query functions will not know about the columns. A &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x4b1d938&quot;&gt;SQL&lt;/a&gt; update statement can be used to set values for these additional columns if one knows the &lt;code&gt;G,S,P,O&lt;/code&gt;. &lt;/p&gt; &lt;p&gt;Suppose we annotated each quad with the user who inserted it and a timestamp. These would be columns in the &lt;code&gt;RDF_QUAD&lt;/code&gt; table. The next choice would be whether these were primary key parts or dependent parts. If primary key parts, these would be non-&lt;code&gt;NULL&lt;/code&gt; and would occur on every index. The same quad would exist for each distinct user and time this quad had been inserted. For loading functions to work, these columns would need a default. In practice, we think that having such metadata as a dependent part is more likely, so that &lt;code&gt;G,S,P,O&lt;/code&gt; are the unique identifier of the quad. Whether one would then include these columns on indices other than the primary key would depend on how frequently they were accessed.&lt;/p&gt; &lt;p&gt;In &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x472afb0&quot;&gt;SPARQL&lt;/a&gt;, one could use an extension syntax like —&lt;/p&gt; &lt;blockquote&gt; &lt;code&gt;&lt;pre&gt;SELECT * WHERE { ?person foaf:knows ?connection OPTION ( time ?ts ) . ?connection foaf:name &amp;quot;Alice&amp;quot; . FILTER ( ?ts &amp;gt; &amp;quot;2009-08-08&amp;quot;^^xsd:datetime ) }&lt;/pre&gt; &lt;/code&gt; &lt;/blockquote&gt; &lt;p&gt;This would return everybody who knows Alice since a date more recent than 2009-08-08. This presupposes that the quad table has been extended with a datetime column.&lt;/p&gt; &lt;p&gt;The &lt;code&gt;OPTION (time ?ts)&lt;/code&gt; syntax is not presently supported but we can easily add something of the sort if there is user demand for it. In practice, this would be an extension mechanism enabling one to access extension columns of &lt;code&gt;RDF_QUAD&lt;/code&gt; via a column &lt;code&gt;?variable&lt;/code&gt; syntax in the &lt;code&gt;OPTION&lt;/code&gt; clause.&lt;/p&gt; &lt;p&gt;If quad metadata were not for every quad but still relatively frequent, another possibility would be making a separate table with a key of &lt;code&gt;GSPO&lt;/code&gt; and a dependent part of &lt;code&gt;R&lt;/code&gt;, where &lt;code&gt;R&lt;/code&gt; would be the reification &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Identifier&quot; id=&quot;link-id0x365b190&quot;&gt;URI&lt;/a&gt; of the quad. Reification statements would then be made with &lt;code&gt;R&lt;/code&gt; as a subject. This would be more compact than the reification vocabulary and would not modify the &lt;code&gt;RDF_QUAD&lt;/code&gt; table. The syntax for referring to this could be something like —&lt;/p&gt; &lt;blockquote&gt; &lt;code&gt;&lt;pre&gt;SELECT * WHERE { ?person foaf:knows ?contact OPTION ( reify ?r ) . ?r xx:assertion_time ?ts . ?contact foaf:name &amp;quot;Alice&amp;quot; . FILTER ( ?ts &amp;gt; &amp;quot;2008-8-8&amp;quot;^^xsd:datetime ) }&lt;/pre&gt; &lt;/code&gt; &lt;/blockquote&gt; &lt;p&gt;We could even recognize the reification vocabulary and convert it into the reify option if this were really necessary. But since it is so unwieldy I don&amp;#39;t think there would be huge demand. Who knows? You tell us.&lt;/p&gt;</atom:content>
  <atom:author>
    <atom:name>Orri Erling</atom:name>
    <atom:email>oerling@openlinksw.com</atom:email>
   </atom:author>
  <atom:category term="database" />
  <atom:category term="databases" />
  <atom:category term="webservices" />
  <atom:category term="rdf" />
  <atom:category term="semanticweb" />
  <atom:category term="web30" />
  <atom:category term="foaf" />
  <atom:category term="sparql" />
  <atom:category term="socialnetworking" />
  <atom:category term="virtuoso" />
  <atom:updated>2009-09-01T11:20:44-04:00</atom:updated>
 </atom:entry>
 <atom:entry>
  <atom:title>More On Parallel RDF/Text Query Evaluation</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?id=1570</atom:id>
  <atom:link href="http://www.openlinksw.com/weblog/oerling/?id=1570" type="text/html" rel="alternate" />
  <atom:link href="http://www.openlinksw.com/GData/oerling-blog-0/1570/1" rel="edit" />
  <atom:published>2009-08-19T17:28:50Z</atom:published>
  <atom:content type="html">&lt;p&gt;We have received some more questions about &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x266cd288&quot;&gt;Virtuoso&lt;/a&gt;&amp;#39;s parallel query evaluation model.&lt;/p&gt; &lt;p&gt;In answer, we will here explain how we do search engine style processing by writing &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x23c628b8&quot;&gt;SPARQL&lt;/a&gt;. There is no need for custom procedural code because the query optimizer does all the partitioning and the equivalent of map reduce.&lt;/p&gt; &lt;p&gt;The point is that what used to require programming can often be done in a generic query language. The technical detail is that the implementation must be smart enough with respect to parallelizing queries for this to be of practical benefit. But by combining these two things, we are a step closer to the web being the database.&lt;/p&gt; &lt;p&gt;I will here show how we do some joins combining full text, &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x22ff08b0&quot;&gt;RDF&lt;/a&gt; conditions, and aggregates and &lt;code&gt;ORDER BY&lt;/code&gt;. The sample task is finding the top 20 entities with New York in some attribute value. Then we specify the search further by only taking actors associated with New York. The results are returned in the order of a composite of &lt;a href=&quot;http://dbpedia.org/resource/Entity&quot; id=&quot;link-id0x22da5258&quot;&gt;entity&lt;/a&gt; rank and text match score.&lt;/p&gt; &lt;p&gt;The basic query is:&lt;/p&gt; &lt;blockquote&gt; &lt;code&gt;&lt;pre&gt; SELECT ( &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x237a6530&quot;&gt;sql&lt;/a&gt;:s_sum_page ( &amp;lt;sql:vector_agg&amp;gt; ( &amp;lt;bif:vector&amp;gt; ( ?c1 , ?sm ) ), bif:vector ( &amp;#39;new&amp;#39;, &amp;#39;york&amp;#39; ) ) ) AS ?res WHERE { { SELECT ( &amp;lt;SHORT_OR_LONG::&amp;gt;(?s1) ) AS ?c1 ( &amp;lt;sql:S_SUM&amp;gt; ( &amp;lt;SHORT_OR_LONG::IRI_RANK&amp;gt; ( ?s1 ) , &amp;lt;SHORT_OR_LONG::&amp;gt; ( ?s1textp ) , &amp;lt;SHORT_OR_LONG::&amp;gt; ( ?o1 ) , ?sc ) ) AS ?sm WHERE { ?s1 ?s1textp ?o1 . ?o1 bif:contains &amp;quot;new AND york&amp;quot; OPTION ( SCORE ?sc ) } ORDER BY DESC ( &amp;lt;sql:sum_rank&amp;gt; (( &amp;lt;sql:S_SUM&amp;gt; ( &amp;lt;SHORT_OR_LONG::IRI_RANK&amp;gt; ( ?s1 ) , &amp;lt;SHORT_OR_LONG::&amp;gt; ( ?s1textp ) , &amp;lt;SHORT_OR_LONG::&amp;gt; ( ?o1 ) , ?sc ) )) ) LIMIT 20 } } &lt;/pre&gt; &lt;/code&gt; &lt;/blockquote&gt; &lt;p&gt;This takes some explaining. The basic part is&lt;/p&gt; &lt;blockquote&gt; &lt;code&gt;&lt;pre&gt;{ ?s1 ?s1textp ?o1 . ?o1 bif:contains &amp;quot;new AND york&amp;quot; OPTION ( SCORE ?sc ) }&lt;/pre&gt; &lt;/code&gt; &lt;/blockquote&gt; &lt;p&gt;This just makes tuples where &lt;code&gt;?s1&lt;/code&gt; is the object, &lt;code&gt;?s1textp&lt;/code&gt; the property, and &lt;code&gt;?o1&lt;/code&gt; the literal which contains &amp;quot;New York&amp;quot;. For a single &lt;code&gt;?s1&lt;/code&gt;, there can of course be many properties which all contain &amp;quot;New York&amp;quot;.&lt;/p&gt; &lt;p&gt;The rest of the query gathers all the &amp;quot;New York&amp;quot; containing properties of an entity into a single aggregate, and then gets the entity ranks of all such entities.&lt;/p&gt; &lt;p&gt;After this, the aggregates are sorted by a sum of the entity rank and a combined text score calculated based on the individual text match scores between &amp;quot;New York&amp;quot; and the strings containing &amp;quot;New York&amp;quot;. The text hit score is higher if the words repeat often and in close proximity.&lt;/p&gt; &lt;p&gt;The &lt;code&gt;s_sum&lt;/code&gt; function is a user-defined aggregate which takes 4 arguments: The rank of the subject of the triple; the predicate of the triple containing the text; the object of the triple containing the text; and the text match score.&lt;/p&gt; &lt;p&gt;These are grouped by the subject of the triple. After this, these are sorted by &lt;code&gt;sum_score&lt;/code&gt; of the aggregate constructed with &lt;code&gt;s_sum&lt;/code&gt;. The &lt;code&gt;sum_score&lt;/code&gt; is a SQL function combining the entity rank with the text scores of the different literals.&lt;/p&gt; &lt;p&gt;This executes as one would expect: All partitions make a text index lookup, retrieving the object of the triple. The text index entries of an object are stored in the same partition as the object. But the entity rank is a property of the subject and is partitioned by the subject. Also the &lt;code&gt;GROUP BY&lt;/code&gt; is by the subject. Thus the &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x24381030&quot;&gt;data&lt;/a&gt; is produced from all partitions, then streamed into the receiving partitions, determined by the subject. This partition can then get the score and group the matches by the subject. Since all these partial aggregates are partitioned by the subject, there is no need to merge them; thus, the top &lt;code&gt;k&lt;/code&gt; sort can be done for each partition separately. Finally, the top 20 of each partition are merged into the global top 20. This is then passed to a final function &lt;code&gt;s_sum_page&lt;/code&gt; that turns this all into an &lt;a href=&quot;http://dbpedia.org/resource/XML&quot; id=&quot;link-id0x2363d6c0&quot;&gt;XML&lt;/a&gt; fragment that can be processed with XSLT for inclusion on a web page.&lt;/p&gt; &lt;p&gt;This differs from the text search engine in that the query pipeline can contain arbitrary cross-partition joins. Also, the string &amp;quot;New York&amp;quot; is a common label that occurs in many distinct entities. Thus one text match, to one document, in the case the containing only the string &amp;quot;New York&amp;quot; will get many entities, likely all from different partitions.&lt;/p&gt; &lt;p&gt;So, if we only want actors with a mention of &amp;quot;New York&amp;quot;, we need to get the inner part of the query as:&lt;/p&gt; &lt;blockquote&gt; &lt;code&gt;&lt;pre&gt;{ ?s1 ?s1textp ?o1 . ?o1 bif:contains &amp;quot;new AND york&amp;quot; OPTION ( SCORE ?sc ) . ?s1 a &amp;lt;&lt;a href=&quot;http://dbpedia.org/resource/Hypertext_Transfer_Protocol&quot; id=&quot;link-id0x237110b8&quot;&gt;http&lt;/a&gt;://&lt;a href=&quot;http://umbel.org/about/&quot; id=&quot;link-id0x2318e198&quot;&gt;umbel&lt;/a&gt;.org/umbel/sc/Actor&amp;gt; }&lt;/pre&gt; &lt;/code&gt; &lt;/blockquote&gt; &lt;p&gt;Whether an entity is an actor can be checked in the same partition as the rank of the entity. Thus the query plan gets this check right before getting the rank. This is natural since there is no point in getting the rank of something that is not an actor.&lt;/p&gt; &lt;p&gt;The &lt;code&gt;&amp;lt;short_or_long::sql:func&amp;gt;&lt;/code&gt; notation means that we call &lt;code&gt;func&lt;/code&gt;, which is a SQL stored procedure with the arguments in their internal form. Thus, if a variable bound to an IRI is passed, the &lt;code&gt;short_or_long&lt;/code&gt; specifies that it is passed as its internal ID and is not converted into its text form. This is essential, since there is no point getting the text of half a million IRIs when only 20 at most will be shown in the end.&lt;/p&gt; &lt;p&gt;Now, when we run this on a collection of 4.5 billion triples of &lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id0x24381160&quot;&gt;linked data&lt;/a&gt;, once we have the working set, we can get the top 20 &amp;quot;New York&amp;quot; occurrences, with text summaries and all, in just 1.1s, with 12 of 16 cores busy. (The hardware is two boxes with two quad-core Xeon 5345 each.)&lt;/p&gt; &lt;p&gt;If we run this query in two parallel sessions, we get both results in 1.9s, with 14 of 16 cores busy. This gets about 200K &amp;quot;New York&amp;quot; strings, which becomes about 400K entities with New York somewhere, for which a rank then has to be retrieved. After this, all the possibly-many occurrences of New York in the title, text, and other properties of the entity are aggregated together, resolving into some 220K groups. These are then sorted. This is internally over 1.5 million random lookups and some 40MB of traffic between processes. Restricting the type of the entity to actor drops the execution time of one query to 0.8s because there are then fewer ranks to retrieve and less data to aggregate and sort.&lt;/p&gt; &lt;p&gt;By adding partitions and cores, we scale horizontally, as evaluating the query involves almost no central control, even though data are swapped between partitions. There is some flow control to avoid constructing overly-large intermediate results but generally partitions run independently and asynchronously. In the above case, there is just one fence at the point where all aggregates are complete, so that they can be sorted; otherwise, all is asynchronous.&lt;/p&gt; &lt;p&gt;Doing &lt;code&gt;JOINs&lt;/code&gt; between partitions and partitioned &lt;code&gt;GROUP BY&lt;/code&gt;/&lt;code&gt;ORDER BY&lt;/code&gt; is pretty regular database stuff. Applying this to RDF is a most natural thing.&lt;/p&gt; &lt;p&gt;If we do not parallelize the user-defined aggregate for grouping all the &amp;quot;New York&amp;quot; occurrences, the query takes 8s instead of 1.1s. If we could not put SQL procedures as user-defined aggregates to be parallelized with the query, we&amp;#39;d have to either bring all the data to a central point before the top k, which would destroy performance, or we would have to do procedures with explicit parallel procedure calls which is hard to write, surely too hard for &lt;i&gt;ad hoc&lt;/i&gt; queries.&lt;/p&gt; &lt;a href=&quot;http://bit.ly/4jAVHC&quot; id=&quot;link-id114d58f0&quot;&gt;Results of live execution&lt;/a&gt; may not be complete on initial load, as this link includes a &amp;quot;Virtuoso Anytime&amp;quot; timeout of 10 seconds. Running against a cold cache, these results may take much longer to return; a warm cache will deliver response times along the lines of those discussed above. &lt;p&gt;Engineering matters. If we wish to commoditize queries on a lot of data, such intelligence in the DBMS is necessary; it is very unscalable to require people to do procedural code or give query parallelization hints. If you need to optimize a workload of 10 different transactions, this is of course possible and even desirable, but for the infinity of all search or analysis, this will not happen.&lt;/p&gt;</atom:content>
  <atom:author>
    <atom:name>Orri Erling</atom:name>
    <atom:email>oerling@openlinksw.com</atom:email>
   </atom:author>
  <atom:category term="database" />
  <atom:category term="databases" />
  <atom:category term="rdf" />
  <atom:category term="xml" />
  <atom:category term="xslt" />
  <atom:category term="semanticweb" />
  <atom:category term="sparql" />
  <atom:category term="virtuoso" />
  <atom:updated>2009-08-19T14:00:29.000002-04:00</atom:updated>
 </atom:entry>
 <atom:entry>
  <atom:title>Updated hardware improves LUBM 8000 load rate in Virtuoso 6</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?id=1568</atom:id>
  <atom:link href="http://www.openlinksw.com/weblog/oerling/?id=1568" type="text/html" rel="alternate" />
  <atom:link href="http://www.openlinksw.com/GData/oerling-blog-0/1568/2" rel="edit" />
  <atom:published>2009-08-14T19:01:30Z</atom:published>
  <atom:content type="html">&lt;p&gt;We repeated the &lt;a href=&quot;http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1562&quot; id=&quot;link-id173d3068&quot;&gt;earlier LUBM 8000 experiment&lt;/a&gt; on a newer machine, with 2 x Xeon 5520 and 72G 1333MHz memory, and once again with the 2 machines as a networked cluster. Otherwise the settings were the same.&lt;/p&gt; &lt;p&gt;The load rate is now 160,739 triples-per-second.&lt;/p&gt; &lt;table&gt; &lt;tr&gt; &lt;th&gt;&lt;/th&gt; &lt;td&gt;   &lt;/td&gt; &lt;th align=&quot;center&quot;&gt;&lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x199b9740&quot;&gt;Virtuoso&lt;/a&gt; 6 &lt;br /&gt; (previous run)&lt;/th&gt; &lt;td&gt;   &lt;/td&gt; &lt;th align=&quot;center&quot;&gt;Virtuoso 6 &lt;br /&gt; (new run)&lt;/th&gt; &lt;td&gt;   &lt;/td&gt; &lt;th align=&quot;center&quot;&gt;Virtuoso 6 &lt;br /&gt; (newest run)&lt;/th&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td align=&quot;left&quot;&gt;blades&lt;/td&gt; &lt;td&gt;   &lt;/td&gt; &lt;td align=&quot;center&quot;&gt;1 &lt;/td&gt; &lt;td&gt;   &lt;/td&gt; &lt;td align=&quot;center&quot;&gt;1 &lt;/td&gt; &lt;td&gt;   &lt;/td&gt; &lt;td align=&quot;center&quot;&gt;2&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td align=&quot;left&quot;&gt;processors&lt;/td&gt; &lt;td&gt;   &lt;/td&gt; &lt;td align=&quot;center&quot;&gt;2 x Xeon 5410&lt;/td&gt; &lt;td&gt;   &lt;/td&gt; &lt;td align=&quot;center&quot;&gt;2 x Xeon 5520&lt;/td&gt; &lt;td&gt;   &lt;/td&gt; &lt;td align=&quot;center&quot;&gt; 2 x Xeon 5520 &lt;br /&gt;+ &lt;br /&gt;2 x Xeon 5410 &lt;br /&gt;with 1x1GigE &lt;br /&gt;interconnect &lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td align=&quot;left&quot;&gt;memory&lt;/td&gt; &lt;td&gt;   &lt;/td&gt; &lt;td align=&quot;center&quot;&gt; 16G 667 MHz&lt;/td&gt; &lt;td&gt;   &lt;/td&gt; &lt;td align=&quot;center&quot;&gt;72G 1333 MHz&lt;/td&gt; &lt;td&gt;   &lt;/td&gt; &lt;td align=&quot;center&quot;&gt;72G 1333 MHz &lt;br /&gt;+ &lt;br /&gt; 16G 667 MHz &lt;br /&gt; respectively&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td align=&quot;left&quot;&gt;reported load rate&lt;br /&gt;triples-per-second&lt;/td&gt; &lt;td&gt;   &lt;/td&gt; &lt;td align=&quot;center&quot;&gt; 110,532 &lt;/td&gt; &lt;td&gt;   &lt;/td&gt; &lt;td align=&quot;center&quot;&gt; 160,739 &lt;/td&gt; &lt;td&gt;   &lt;/td&gt; &lt;td align=&quot;center&quot;&gt; 214,188 &lt;/td&gt; &lt;/tr&gt; &lt;/table&gt; &lt;p&gt;Again, if others talk about loading LUBM, so must we. Otherwise, this metric is rather uninteresting.&lt;/p&gt;</atom:content>
  <atom:author>
    <atom:name>Orri Erling</atom:name>
    <atom:email>oerling@openlinksw.com</atom:email>
   </atom:author>
  <atom:category term="database" />
  <atom:category term="databases" />
  <atom:category term="cluster" />
  <atom:category term="lubm" />
  <atom:category term="benchmarking" />
  <atom:category term="virtuoso" />
  <atom:category term="dataspace" />
  <atom:updated>2009-08-15T15:27:25-04:00</atom:updated>
 </atom:entry>
 <atom:entry>
  <atom:title>Single Virtuoso host loads 110,500 triples-per-second on LUBM 8000</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?id=1562</atom:id>
  <atom:link href="http://www.openlinksw.com/weblog/oerling/?id=1562" type="text/html" rel="alternate" />
  <atom:link href="http://www.openlinksw.com/GData/oerling-blog-0/1562/3" rel="edit" />
  <atom:published>2009-06-29T16:12:34Z</atom:published>
  <atom:content type="html">&lt;p&gt;LUBM load speed still seems to be a metric that is quoted in comparisons of &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id142df6e8&quot;&gt;RDF&lt;/a&gt; stores. Consequently, we too measured the load time of LUBM 8000, 1,068-million triples, on the newest &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id1389dfa0&quot;&gt;Virtuoso&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;The real time for the load was 161m 3s. The rate was 110,532 triples-per-second. The hardware was one machine with 2 x Xeon 5410 (quad core, 2.33 GHz) and 16G 667 MHz RAM. The software was Virtuoso 6 Cluster, configured into 8 partitions (processes) — one partition per CPU core. Each partition had its database striped over 6 disks total; the 6 disks on the system were shared between the 8 database processes.&lt;/p&gt; &lt;p&gt;The load was done on 8 streams, one per server process. At the beginning of the load, the CPU usage was 740% with no disk; at the end, it was around 700% with 25% disk wait. 100% counts here for one CPU core or one disk being constantly busy.&lt;/p&gt; &lt;p&gt;The RDF store was configured with the default two indices over quads, these being GSPO and OGPS. Text indexing of literals was not enabled. No materialization of entailed triples was made.&lt;/p&gt; &lt;p&gt;We think that LUBM loading is not a realistic benchmark for the world but since other people publish such numbers, so do we.&lt;/p&gt;</atom:content>
  <atom:author>
    <atom:name>Orri Erling</atom:name>
    <atom:email>oerling@openlinksw.com</atom:email>
   </atom:author>
  <atom:category term="database" />
  <atom:category term="databases" />
  <atom:category term="cluster" />
  <atom:category term="lubm" />
  <atom:category term="benchmarking" />
  <atom:category term="benchmarking" />
  <atom:category term="scalability" />
  <atom:category term="rdf" />
  <atom:category term="semanticweb" />
  <atom:category term="virtuoso" />
  <atom:updated>2009-08-15T16:06:42.000001-04:00</atom:updated>
 </atom:entry>
 <atom:entry>
  <atom:title>Comparing Virtuoso Performance on Different Processors</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?id=1557</atom:id>
  <atom:link href="http://www.openlinksw.com/weblog/oerling/?id=1557" type="text/html" rel="alternate" />
  <atom:link href="http://www.openlinksw.com/GData/oerling-blog-0/1557/1" rel="edit" />
  <atom:published>2009-05-28T14:54:59Z</atom:published>
  <atom:content type="html">&lt;p&gt;Over the years we have run &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0xd420b90&quot;&gt;Virtuoso&lt;/a&gt; on different hardware. We will here give a few figures that help identify the best price point for machines running Virtuoso.&lt;/p&gt; &lt;p&gt;Our test is very simple: &lt;i&gt;Load 20 warehouses of &lt;a href=&quot;http://dbpedia.org/resource/TPC-C&quot; id=&quot;link-id0xdaaec90&quot;&gt;TPC-C&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0xca1b7e0&quot;&gt;data&lt;/a&gt;, and then run one client per warehouse for 10,000 new orders&lt;/i&gt;. The way this is set up, disk I/O does not play a role and lock contention between the clients is minimal.&lt;/p&gt; &lt;p&gt;The test essentially has 20 server and 20 client threads running the same workload in parallel. The load time gives the single thread number; the 20 clients run gives the multi-threaded number. The test uses about 2-3 GB of data, so all is in RAM but is large enough not to be all in processor cache.&lt;/p&gt; &lt;p&gt;All times reported are real times, starting from the start of the first client and ending with the completion of the last client.&lt;/p&gt; &lt;p&gt;Do not confuse these results with official TPC-C. The measurement protocols are entirely incomparable.&lt;/p&gt; &lt;style type=&quot;text/css&quot;&gt; TABLE { background: none; border: none } TH { text-align: center; font-weight: bold } TR.top { background: } TD { text-align: center; border: none } &lt;/style&gt; &lt;table align=&quot;center&quot; cellspacing=&quot;10&quot;&gt; &lt;tr&gt; &lt;th&gt;Test&lt;/th&gt; &lt;th&gt;Platform&lt;/th&gt; &lt;th&gt;Load&lt;br /&gt;(seconds)&lt;/th&gt; &lt;th&gt;Run&lt;br /&gt;(seconds)&lt;/th&gt; &lt;th&gt;GHz / cores / threads&lt;/th&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;1&lt;/td&gt; &lt;td&gt;Amazon &lt;a href=&quot;http://aws.amazon.com/ec2/&quot; id=&quot;link-id0xdaab030&quot;&gt;EC2&lt;/a&gt; Extra Large&lt;br /&gt;(4 virtual cores)&lt;/td&gt; &lt;td&gt;340&lt;/td&gt; &lt;td&gt;42&lt;/td&gt; &lt;td&gt;1.2 GHz? / 4 / 1&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;1&lt;/td&gt; &lt;td&gt;Amazon EC2 Extra Large&lt;br /&gt;(4 virtual cores)&lt;/td&gt; &lt;td&gt;305&lt;/td&gt; &lt;td&gt;43.3&lt;/td&gt; &lt;td&gt;1.2 GHz? / 4 / 1&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;2&lt;/td&gt; &lt;td&gt;1 x dual-core AMD 5900&lt;/td&gt; &lt;td&gt;263&lt;/td&gt; &lt;td&gt;58.2&lt;/td&gt; &lt;td&gt;2.9 GHz / 2 / 1&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;3&lt;/td&gt; &lt;td&gt;2 x dual-core Xeon 5130 (&amp;quot;Woodcrest&amp;quot;)&lt;/td&gt; &lt;td&gt;245&lt;/td&gt; &lt;td&gt;35.7&lt;/td&gt; &lt;td&gt;2.0 GHz / 4 / 1&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;4&lt;/td&gt; &lt;td&gt;2 x quad-core Xeon 5410 (&amp;quot;Harpertown&amp;quot;)&lt;/td&gt; &lt;td&gt;237&lt;/td&gt; &lt;td&gt;18.0&lt;/td&gt; &lt;td&gt;2.33 GHz / 8 / 1&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;5&lt;/td&gt; &lt;td&gt;2 x quad-core Xeon 5520 (&amp;quot;Nehalem&amp;quot;)&lt;/td&gt; &lt;td&gt;162&lt;/td&gt; &lt;td&gt;18.3&lt;/td&gt; &lt;td&gt;2.26 GHz / 8 / 2&lt;/td&gt; &lt;/tr&gt; &lt;/table&gt; &lt;p&gt;We tried two different EC2 instances to see if there would be variation. The variation was quite small. The tested EC2 instances costs 20 US cents per hour. The AMD dual-core costs 550 US dollars with 8G. The 3 Xeon configurations are Supermicro boards with 667MHz memory for the Xeon 5130 (&amp;quot;Woodcrest&amp;quot;) and Xeon 5410 (&amp;quot;Harpertown&amp;quot;), and 800MHz memory for the Nehalem. The Xeon systems cost between 4000 and 7000 US dollars, with 5000 for a configuration with 2 x Xeon 5520 (&amp;quot;Nehalem&amp;quot;), 72 GB RAM, and 8 x 500 GB SATA disks.&lt;/p&gt; &lt;p&gt; &lt;i&gt;Caveat: Due to slow memory (we could not get faster within available time), the results for the Nehalem do not take full advantage of its principal edge over the previous generation, i.e., memory subsystem. We&amp;#39;ll see another time with faster memories.&lt;/i&gt; &lt;/p&gt; &lt;p&gt;The operating systems were various 64 bit Linux distributions.&lt;/p&gt; &lt;p&gt;We did some further measurements comparing Harpertown and Nehalem processors. The Nehalem chip was a bit faster for a slightly lower clock but we did not see any of the twofold and greater differences advertised by Intel.&lt;/p&gt; &lt;p&gt;We tried some &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0xce85438&quot;&gt;RDF&lt;/a&gt; operations on the two last systems:&lt;/p&gt; &lt;table align=&quot;center&quot; cellspacing=&quot;10&quot;&gt; &lt;tr&gt; &lt;th&gt;operation&lt;/th&gt; &lt;th&gt; Harpertown&lt;/th&gt; &lt;th&gt;Nehalem&lt;/th&gt; &lt;/tr&gt; &lt;tr&gt; &lt;th&gt;Build text index for &lt;a href=&quot;http://dbpedia.org/resource/DBpedia&quot; id=&quot;link-id0xab826a8&quot;&gt;DBpedia&lt;/a&gt;&lt;/th&gt; &lt;td&gt;1080s&lt;/td&gt; &lt;td&gt;770s&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;th&gt;&lt;a href=&quot;http://dbpedia.org/resource/Entity&quot; id=&quot;link-id0xcbb9938&quot;&gt;Entity&lt;/a&gt; Rank iteration&lt;/th&gt; &lt;td&gt;263s&lt;/td&gt; &lt;td&gt;251s&lt;/td&gt; &lt;/tr&gt; &lt;/table&gt; &lt;p&gt;Then we tried to see if the core multithreading of Nehalem could be seen anywhere. To this effect, we ran the Fibonacci function in &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0xcd62218&quot;&gt;SQL&lt;/a&gt; to serve as an example of an all in-cache integer operation. 16 concurrent operations took exactly twice as long as 8 concurrent ones, as expected.&lt;/p&gt; &lt;p&gt;For something that used memory, we took a count of RDF quads on two different indices, getting the same count. The database was a cluster setup with one process per core, so a count involved one thread per core. The counts in series took 5.02s and in parallel they took 4.27s.&lt;/p&gt; &lt;p&gt;Then we took a more memory intensive piece that read the RDF quads table in the order of one index and for each row checked that there is the equal row on another, differently-partitioned index. This is a cross-partition join. One of the indices is read sequentially and the other at random. The throughput can be reported as random-lookups-per-second. The data was English DBpedia, about 140M triples. One such query takes a couple of minutes with a 650% CPU utilization. Running multiple such queries should show effects of core multithreading since we expect frequent cache misses.&lt;/p&gt; &lt;ol&gt; &lt;li&gt;On the host OS of the Nehalem system — &lt;table align=&quot;center&quot; cellspacing=&quot;10&quot;&gt; &lt;tr&gt; &lt;th&gt;n&lt;/th&gt; &lt;th&gt;cpu%&lt;/th&gt; &lt;th&gt;rows per second&lt;/th&gt; &lt;/tr&gt; &lt;tr&gt; &lt;th&gt;1 query&lt;/th&gt; &lt;td&gt;503&lt;/td&gt; &lt;td&gt;906,413&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;th&gt;2 queries&lt;/th&gt; &lt;td&gt;1263&lt;/td&gt; &lt;td&gt;1,578,585&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;th&gt;3 queries&lt;/th&gt; &lt;td&gt;1204&lt;/td&gt; &lt;td&gt;1,566,849&lt;/td&gt; &lt;/tr&gt; &lt;/table&gt; &lt;/li&gt; &lt;li&gt;In a VM under Xen, on the Nehalem system — &lt;table align=&quot;center&quot; cellspacing=&quot;10&quot;&gt; &lt;tr&gt; &lt;th&gt;n&lt;/th&gt; &lt;th&gt;cpu%&lt;/th&gt; &lt;th&gt;rows per second&lt;/th&gt; &lt;/tr&gt; &lt;tr&gt; &lt;th&gt;1 query&lt;/th&gt; &lt;td&gt;652&lt;/td&gt; &lt;td&gt;799,293&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;th&gt;2 queries&lt;/th&gt; &lt;td&gt;1266&lt;/td&gt; &lt;td&gt;1,486,710&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;th&gt;3 queries&lt;/th&gt; &lt;td&gt;1222&lt;/td&gt; &lt;td&gt;1,484,093&lt;/td&gt; &lt;/tr&gt; &lt;/table&gt; &lt;/li&gt; &lt;li&gt; On the host OS of the Harpertown system — &lt;table align=&quot;center&quot; cellspacing=&quot;10&quot;&gt; &lt;tr&gt; &lt;th&gt;n&lt;/th&gt; &lt;th&gt;cpu%&lt;/th&gt; &lt;th&gt;rows per second&lt;/th&gt; &lt;/tr&gt; &lt;tr&gt; &lt;th&gt;1 query&lt;/th&gt; &lt;td&gt; 648 &lt;/td&gt; &lt;td&gt; 1,041,448 &lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;th&gt;2 queries&lt;/th&gt; &lt;td&gt; 708 &lt;/td&gt; &lt;td&gt; 1,124,866 &lt;/td&gt; &lt;/tr&gt; &lt;/table&gt; &lt;/li&gt; &lt;/ol&gt; &lt;p&gt;The CPU percentages are as reported by the OS: user + system CPU divided by real time.&lt;/p&gt; &lt;p&gt;So, Nehalem is in general somewhat faster, around 20-30%, than Harpertown. The effect of core multithreading can be noticed but is not huge, another 20% or so for situations with more threads than cores. The join where Harpertown did better could be attributed to its larger cache — 12 MB vs 8 MB.&lt;/p&gt; &lt;p&gt;We see that Xen has a measurable but not prohibitive overhead; count a little under 10% for everything, also tasks with no I/O. The VM was set up to have all CPU for the test and the queries did not do disk I/O.&lt;/p&gt; &lt;p&gt;The executables were compiled with &lt;code&gt;gcc&lt;/code&gt; with default settings. Specifying &lt;code&gt;-march=nocona&lt;/code&gt; (Core 2 target) dropped the cross-partition join time mentioned above from 128s to 122s on Harpertown. We did not try this on Nehalem but presume the effect would be the same, since the out-of-order unit is not much different. We did not do anything about process-to-memory affinity on Nehalem, which is a non-uniform architecture. We would expect this to increase performance since we have many equal size processes with even load.&lt;/p&gt; &lt;p&gt;The mainstay of the Nehalem value proposition is a better memory subsystem. Since the unit we got was at 800 MHz memory, we did not see any great improvement. So if you buy Nehalem, you should make sure it is with 1333 MHz memory, else the best case will not be over 50% over a 667 MHz Core 2-based Xeon.&lt;/p&gt; &lt;p&gt;Nehalem remains a better deal for us because of more memory per board. One Nehalem box with 72 GB costs less than two Harpertown boxes with 32 GB and offers almost the same performance. Having a lot of memory in a small space is key. With faster memory, it might even outperform two Harpertown boxes, but this remains to be seen.&lt;/p&gt; &lt;p&gt;If space were not a constraint, we could make a cluster of 12 small workstations for the price of our largest system and get still more memory and more processor power per unit of memory. The Nehalem box was almost 4x faster than the AMD box but then it has 9x the memory, so the CPU to memory ratio might be better with the smaller boxes.&lt;/p&gt;</atom:content>
  <atom:author>
    <atom:name>Orri Erling</atom:name>
    <atom:email>oerling@openlinksw.com</atom:email>
   </atom:author>
  <atom:category term="database" />
  <atom:category term="databases" />
  <atom:category term="cluster" />
  <atom:category term="architecture" />
  <atom:category term="hpc" />
  <atom:category term="rdf" />
  <atom:category term="semanticweb" />
  <atom:category term="linux" />
  <atom:category term="virtuoso" />
  <atom:updated>2009-05-28T11:15:39-04:00</atom:updated>
 </atom:entry>
 <atom:entry>
  <atom:title>Social Web Camp (#5 of 5)</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?id=1554</atom:id>
  <atom:link href="http://www.openlinksw.com/weblog/oerling/?id=1554" type="text/html" rel="alternate" />
  <atom:link href="http://www.openlinksw.com/GData/oerling-blog-0/1554/1" rel="edit" />
  <atom:published>2009-04-30T16:14:02Z</atom:published>
  <atom:content type="html">&lt;p&gt;(Last of five posts related to the &lt;a href=&quot;http://www2009.org/&quot; id=&quot;link-id0xd28c860&quot;&gt;WWW 2009&lt;/a&gt; conference, held the week of April 20, 2009.) &lt;/p&gt; &lt;p&gt;The social networks camp was interesting, with a special meeting around Twitter. Half jokingly, we (that is, the OpenLink folks attending) concluded that societies would never be completely classless, although mobility between, as well as criteria for membership in, given classes would vary with time and circumstance. Now, there would be a new class division between people for whom micro-blogging is obligatory and those for whom it is an option.&lt;/p&gt; &lt;p&gt;By my experience, a great deal is possible in a short time, but this possibility depends on focus and concentration. These are increasingly rare. I am a great believer in core competence and focus. This is not only for geeks — one can have a lot of breadth-of-scope but this too depends on not getting sidetracked by constant &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id0x10019a70&quot;&gt;information&lt;/a&gt; overload.&lt;/p&gt; &lt;p&gt;Insofar as personal success depends on constant reaction to online social media, this comes at a cost in time and focus and this cost will have to be managed somehow, for example by automation or outsourcing. But if the social media is only automated fronts twitting and re-twitting among themselves, a bit like electronic trading systems do with securities, with or without human operators, the value of the medium decreases.&lt;/p&gt; &lt;p&gt;There are contradictory requirements. On one hand, what is said in electronic media is essentially permanent, so one had best only say things that are well considered. On the other hand, one must say these things without adequate time for reflection or analysis. To cope with this, one must have a well-rehearsed position that is compacted so that it fits in a short format and is easy to remember and unambiguous to express. A culture of pre-cooked fast-food advertising cuts down on depth. Real-world things are complex and multifaceted. Besides, prevalent patterns of communication train the brain for a certain mode of functioning. If we train for rapid-fire 140-character messaging, we optimize one side but probably at the expense of another. In the meantime, the world continues developing increased complexity by all kinds of emergent effects. Connectivity is good but don&amp;#39;t get lost in it.&lt;/p&gt; &lt;p&gt;There is &lt;a href=&quot;https://www.cia.gov/library/center-for-the-study-of-intelligence/csi-publications/books-and-monographs/psychology-of-intelligence-analysis/index.html&quot; id=&quot;link-id170cb010&quot;&gt;a CIA memorandum about how analysts misinterpret data and see what they want to see&lt;/a&gt;. This is a relevant resource for understanding some psychology of perception and memory. With the information overload, largely driven by user generated content, interpreting fragmented and variously-biased real-time information is not only for the analyst but for everyone who needs to intelligently function in cyber-social space.&lt;/p&gt; &lt;p&gt;I participated in discussions on security and privacy and on mobile social networks and context.&lt;/p&gt; &lt;p&gt;For privacy, the main thing turned out to be whether people should be protected from themselves. Should information expire? Will it get buried by itself under huge volumes of new content? Well, for purposes of visibility, it will certainly get buried and will require constant management to stay visible. But for purposes of future finding of dirt, it will stay findable for those who are looking.&lt;/p&gt; &lt;p&gt;There is also the corollary of setting security for resources, like documents, versus setting security for statements, i.e., structured data like social networks. As I have blogged before, policies &lt;a id=&quot;link-id14aaff90&quot;&gt;à la&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x10b058d0&quot;&gt;SQL&lt;/a&gt; do not work well when schema is fluid and end-users can&amp;#39;t be expected to formulate or understand these. Remember &lt;a href=&quot;http://dbpedia.org/resource/Ted_Nelson&quot; id=&quot;link-id0x145b3070&quot;&gt;Ted Nelson&lt;/a&gt;? A user interface should be such that a beginner understands it in 10 seconds in an emergency. The user interaction question is how to present things so that the user understands who will have access to what content. Also, users should themselves be able to check what potentially sensitive information can be found out about them. A service along the lines of Garlic&amp;#39;s Data Patrol should be a part of the social web infrastructure of the future.&lt;/p&gt; &lt;p&gt;People at MIT have developed AIR (Accountability In &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x10dec8f8&quot;&gt;RDF&lt;/a&gt;) for expressing policies about what can be done with data and for explaining why access is denied if it is denied. However, if we at all look at the history of secrets, it is rather seldom that one hears that access to information about X is restricted to compartment so-and-so; it is much more common to hear that there is no X. I would say that a policy system that just leaves out information that is not supposed to be available will please the users more. This is not only so for organizations; it is fully plausible that an individual might not wish to expose even the existence of some selected inner circle of friends, their parties together, or whatever.&lt;/p&gt; &lt;p&gt;In conclusion, there is no self-evident solution for careless use of social media. A site that requires people to confirm multiple times that they know what they are doing when publishing a photo will not get much use. We will see.&lt;/p&gt; &lt;p&gt;For mobility, there was some talk about the context of usage. Again, this is difficult. For different contexts, one would for example disclose one&amp;#39;s location at the granularity of the city; for some other purposes, one would say which conference room one is in.&lt;/p&gt; &lt;p&gt;Embarrassing social situations may arise if mobile devices are too clever: If information about travel is pushed into the social network, one would feel like having to explain why one does not call on such-and-such a person and so on. Too much initiative in the mobile phone seems like a recipe for problems.&lt;/p&gt; &lt;p&gt;There is a thin line between convenience and having IT infrastructure rule one&amp;#39;s life. The complexities and subtleties of social situations ought not to be reduced to the level of if-then rules. People and their interactions are more complex than they themselves often realize. A system is not its own metasystem, as Gödel put it. Similarly, human self-&lt;a href=&quot;http://dbpedia.org/resource/Knowledge&quot; id=&quot;link-id0xd7b1808&quot;&gt;knowledge&lt;/a&gt;, let alone knowledge about another, is by this very principle only approximate. Not to forget what psychology tells us about state-dependent recall and of how circumstance can evoke patterns of behavior before one even notices. The history of expert systems did show that people do not do very well at putting their skills in the form of if-then rules. Thus automating sociality past a certain point seems a problematic proposition.&lt;/p&gt;</atom:content>
  <atom:author>
    <atom:name>Orri Erling</atom:name>
    <atom:email>oerling@openlinksw.com</atom:email>
   </atom:author>
  <atom:category term="infomania" />
  <atom:category term="rdf" />
  <atom:category term="semanticweb" />
  <atom:category term="howto" />
  <atom:category term="history" />
  <atom:category term="zigzag" />
  <atom:category term="openlink" />
  <atom:updated>2009-04-30T12:51:49-04:00</atom:updated>
 </atom:entry>
 <atom:entry>
  <atom:title>Web Science and Keynotes at WWW 2009 (#4 of 5)</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?id=1551</atom:id>
  <atom:link href="http://www.openlinksw.com/weblog/oerling/?id=1551" type="text/html" rel="alternate" />
  <atom:link href="http://www.openlinksw.com/GData/oerling-blog-0/1551/1" rel="edit" />
  <atom:published>2009-04-30T16:00:22Z</atom:published>
  <atom:content type="html">&lt;p&gt;(Fourth of five posts related to the &lt;a href=&quot;http://www2009.org/&quot; id=&quot;link-id0x1232b550&quot;&gt;WWW 2009&lt;/a&gt; conference, held the week of April 20, 2009.) &lt;/p&gt; &lt;p&gt;There was quite a bit of talk about what web science could or ought to be. I will here comment a bit on the &lt;a href=&quot;http://www2009.org/panels.html&quot; id=&quot;link-id1514ec30&quot;&gt;panels&lt;/a&gt; and &lt;a href=&quot;http://www2009.org/keynote_abs.html&quot; id=&quot;link-id11a5d620&quot;&gt;keynotes&lt;/a&gt;, in no special order. &lt;/p&gt; &lt;p&gt;In the web science panel, Tim Berners-Lee said that the deliverable of the web science initiative could be a way of making sense of all the world&amp;#39;s &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0xe01cd68&quot;&gt;data&lt;/a&gt; once the web had transformed into a database capable of answering arbitrary queries.&lt;/p&gt; &lt;p&gt;Michael Brodie of Verizon said that one deliverable would be a well considered understanding of the issue of counter-terrorism and civil liberties: Everything, including terrorism, operates on the platform of the web. How do we understand an issue that is not one of privacy, intelligence, jurisprudence, or sociology, but of all these and more?&lt;/p&gt; &lt;p&gt;I would add to this that it is not only a matter of governments keeping and analyzing vast amounts of private data, but of basically anybody who wants to do this being able to do so, even if at a smaller scale. In a way, the data web brings formerly government-only capabilities to the public, and is thus a democratization of intelligence and analytics. The citizen blogger increased the accountability of the press; the citizen analyst may have a similar effect. This is trickier though. We remember Jefferson&amp;#39;s words about vigilance and the price of freedom. But vigilance is harder today, not because &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id0x130558b8&quot;&gt;information&lt;/a&gt; is not there but because there is so much of it, with diverse spins put on it.&lt;/p&gt; &lt;p&gt;Tim B-L said at another panel that it seemed as if the new capabilities, especially the web as a database, were coming just in time to help us cope with the problems confronting the planet. With this, plus having everybody online, we would have more information, more creativity, more of everything at our disposal.&lt;/p&gt; &lt;p&gt;I&amp;#39;d have to say that the web is dual use: The bulk of traffic may contribute to distraction more than to awareness, but then the same infrastructure and the social behaviors it supports may also create unprecedented value and in the best of cases also transparency. I have to think of &amp;quot;For whosoever hath, to him shall be given.&amp;quot; [Matthew 13:12] This can mean many things; here I am talking about whoever hath a drive for &lt;a href=&quot;http://dbpedia.org/resource/Knowledge&quot; id=&quot;link-id0x16032470&quot;&gt;knowledge&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;The web is both equalizing and polarizing: The equality is in the access; the polarity in the use made thereof. For a huge amount of noise there will be some crystallization of value that could not have arisen otherwise. Developments have unexpected effects. I would not have anticipated that gaming should advance supercomputing, for example.&lt;/p&gt; &lt;p&gt;Wendy Hall gave a dinner speech about communities and conferences; how the original hypertext conferences, with lots of representation of the humanities, became the techie WWW conference series; and how now we have the pendulum swinging back to more diversity with the web science conferences. So it is with life. Aside from the facts that there are trends and pendulum effects, and that paths that cross usually cross again, it is very hard to say exactly how these things play out.&lt;/p&gt; &lt;p&gt;At the &amp;quot;20 years of web&amp;quot; panel, there was a round of questions on how different people had been surprised by the web. Surprises ranged from the web&amp;#39;s actual scalability to its rapid adoption and the culture of &amp;quot;if I do my part, others will do theirs.&amp;quot; On the minus side, the emergence of spam and phishing were mentioned as unexpected developments.&lt;/p&gt; &lt;p&gt;Questions of simplicity and complexity got a lot of attention, along with network effects. When things hit the right simplicity at the right place (e.g., HTML and &lt;a href=&quot;http://dbpedia.org/resource/Hypertext_Transfer_Protocol&quot; id=&quot;link-id0x1069cc18&quot;&gt;HTTP&lt;/a&gt;, which hypertext-wise were nothing special), there is a tipping point.&lt;/p&gt; &lt;p&gt;No barrier of entry, not too much modeling, was repeated quite a bit, also in relation to &lt;a href=&quot;http://dbpedia.org/resource/Semantic_Web&quot; id=&quot;link-id0x15d2c200&quot;&gt;semantic web&lt;/a&gt; and ontology design. There is a magic of emergent effects when the pieces are simple enough: Organic chemistry out of a couple of dozen elements; all the world&amp;#39;s information online with a few tags of markup and a couple of protocol verbs. But then this is where the real complexity starts — one half of it in the transport, the other in the applications, yet a narrow interface between the two.&lt;/p&gt; &lt;p&gt;This then begs the question of content- and application-aware networks. The preponderance of opinion was for separation of powers — keep carriers and content apart.&lt;/p&gt; &lt;p&gt;Michael Brodie commented in the questions to the first panel that simplicity was greatly overrated, that the world was in fact very complex. It seems to me that that any field of human endeavor develops enough complexity to fully occupy the cleverest minds who undertake said activity. The life-cycle between simplicity and complexity seems to be a universal feature. It is a bit like the Zen idea that &amp;quot;for the beginner, rivers are rivers and mountains are mountains, for the student these are imponderable mysteries of bewildering complexity and transcendent dimension but for the master these are again rivers and mountains.&amp;quot; One way of seeing this is that the master, in spite of the actual complexity and interrelatedness of all things, sees where these complexities are significant and where not and knows to communicate concerning these as fits the situation.&lt;/p&gt; &lt;p&gt;There is no fixed formula for saying where complexities and simplicities fit, relevance of detail is forever contextual. For technological systems, we find that there emerge relatively simple interfaces on either side of which there is huge complexity: The x86 instruction set, TCP/IP, &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x10363000&quot;&gt;SQL&lt;/a&gt;, to name a few. These are lucky breaks, it is very hard to say beforehand where these will emerge. Object oriented people would like to see such everywhere, which just leads to problems of modeling.&lt;/p&gt; &lt;p&gt;There was a keynote from Telefonica about infrastructure. We heard that the power and cooling cost more than the equipment, that data centers ought to be scaled down from the football stadium and 20 megawatt scale, that systems must be designed for partitioning, to name a few topics. This is all well accepted. The new question is whether storage should go into the network infrastructure. We have blogged that the network will be the database, and it is no surprise that a telco should have the same idea, just with slightly different emphasis and wording. For Telefonica, this is about efficiency of bulk delivery, for us this is more about virtualized query-able dataspaces. Both will be distributed but issues of separation of powers may keep the two roles of network with storage separate.&lt;/p&gt; &lt;p&gt;In conclusion, the network being the database was much more visible and accepted this year than last. The &lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id0x100f4cf0&quot;&gt;linked data&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/Giant_Global_Graph&quot; id=&quot;link-id0x15a55db8&quot;&gt;web&lt;/a&gt; was in Tim B-L&amp;#39;s keynote as it was in the opening speech by the Prince of Asturias.&lt;/p&gt;</atom:content>
  <atom:author>
    <atom:name>Orri Erling</atom:name>
    <atom:email>oerling@openlinksw.com</atom:email>
   </atom:author>
  <atom:category term="semanticweb" />
  <atom:category term="web30" />
  <atom:category term="socialnetworking" />
  <atom:category term="visionary" />
  <atom:updated>2009-04-30T12:11:44-04:00</atom:updated>
 </atom:entry>
 <atom:entry>
  <atom:title>Short Recap of Virtuoso Basics (#3 of 5)</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?id=1550</atom:id>
  <atom:link href="http://www.openlinksw.com/weblog/oerling/?id=1550" type="text/html" rel="alternate" />
  <atom:link href="http://www.openlinksw.com/GData/oerling-blog-0/1550/1" rel="edit" />
  <atom:published>2009-04-30T15:49:53Z</atom:published>
  <atom:content type="html">&lt;p&gt;(Third of five posts related to the &lt;a href=&quot;http://www2009.org/&quot; id=&quot;link-id0x1081fe40&quot;&gt;WWW 2009&lt;/a&gt; conference, held the week of April 20, 2009.) &lt;/p&gt; &lt;p&gt;There are some points that came up in conversation at WWW 2009 that I will reiterate here. We find there is still some lack of clarity in the product image, so I will here condense it.&lt;/p&gt; &lt;p&gt; &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0xd0e85f0&quot;&gt;Virtuoso&lt;/a&gt; is a DBMS. We pitch it primarily to the &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x14a294d8&quot;&gt;data&lt;/a&gt; web space because this is where we see the emerging frontier. Virtuoso does both &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x108042f8&quot;&gt;SQL&lt;/a&gt; and &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x10889878&quot;&gt;SPARQL&lt;/a&gt; and can do both at large scale and high performance. The popular perception of &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x107d3b40&quot;&gt;RDF&lt;/a&gt; and Relational models as mutually exclusive and antagonistic poles is based on the poor scalability of early RDF implementations. What we do is to have all the RDF specifics, like IRIs and typed literals as native SQL types, and to have a cost based optimizer that knows about this all.&lt;/p&gt; &lt;p&gt;If you want application-specific data structures as opposed to a schema-agnostic quad-store model (triple + graph-name), then Virtuoso can give you this too. &lt;a href=&quot;http://docs.openlinksw.com/virtuoso/rdfsparqlintegrationmiddleware.html#rdfviews&quot; id=&quot;link-id14ddc7c8&quot;&gt;Rendering application specific data structures as RDF&lt;/a&gt; applies equally to relational data in non-Virtuoso databases because Virtuoso SQL can &lt;a href=&quot;http://docs.openlinksw.com/virtuoso/qsvdbsrv.html&quot; id=&quot;link-id14aaea70&quot;&gt;federate tables from heterogenous DBMS&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;On top of this, there is a &lt;a href=&quot;http://docs.openlinksw.com/virtuoso/qswebserver.html&quot; id=&quot;link-id16fcde60&quot;&gt;web server built in&lt;/a&gt;, so that no extra server is needed for web services, web pages, and the like.&lt;/p&gt; &lt;p&gt;Installation is simple, just one exe and one config file. There is a huge amount of code in &lt;a href=&quot;http://docs.openlinksw.com/virtuoso/installation.html&quot; id=&quot;link-id16767b40&quot;&gt;installers&lt;/a&gt; — application code and test suites and such — but none of this is needed when you deploy. Scale goes from a 25MB memory footprint on the desktop to hundreds of gigabytes of RAM and endless terabytes of disk on shared-nothing clusters.&lt;/p&gt; &lt;p&gt;Clusters (coming in Release 6) and SQL federation are &lt;a href=&quot;http://download.openlinksw.com/download/product_matrix.vsp?p=l_os&amp;amp;c=39&amp;amp;df=16&quot; id=&quot;link-id16722550&quot;&gt;commercial only&lt;/a&gt;; the rest can be had &lt;a href=&quot;http://sourceforge.net/project/showfiles.php?group_id=161622&quot; id=&quot;link-id131080a8&quot;&gt;under GPL&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;To condense further:&lt;/p&gt; &lt;ul&gt; &lt;li&gt;Scalable Delivery of &lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id0x12211da8&quot;&gt;Linked Data&lt;/a&gt; &lt;/li&gt; &lt;li&gt;SPARQL and SQL &lt;ul&gt; &lt;li&gt;Arbitrary RDF Data + Relational&lt;/li&gt; &lt;li&gt;Also From 3rd Party &lt;a href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id0x168db0e0&quot;&gt;RDBMS&lt;/a&gt; &lt;/li&gt; &lt;/ul&gt; &lt;/li&gt; &lt;li&gt;Easy Deployment &lt;/li&gt; &lt;li&gt;Standard Interfaces &lt;ul&gt; &lt;li&gt; &lt;a href=&quot;http://dbpedia.org/resource/Open_Database_Connectivity&quot; id=&quot;link-id0x10473bf0&quot;&gt;ODBC&lt;/a&gt;, &lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id0x12187f58&quot;&gt;JDBC&lt;/a&gt;, OLE DB, &lt;a href=&quot;http://dbpedia.org/resource/ADO.NET&quot; id=&quot;link-id0x10354e48&quot;&gt;ADO&lt;/a&gt;.&lt;a href=&quot;http://dbpedia.org/resource/.NET_Framework&quot; id=&quot;link-id0x16eeadd0&quot;&gt;NET&lt;/a&gt;, XMLA&lt;/li&gt; &lt;li&gt; &lt;a href=&quot;http://jena.sourceforge.net/&quot; id=&quot;link-id0x12e3fe08&quot;&gt;Jena&lt;/a&gt;, &lt;a href=&quot;http://sourceforge.net/projects/sesame/&quot; id=&quot;link-id0x15e62470&quot;&gt;Sesame&lt;/a&gt;, etc.&lt;/li&gt; &lt;li&gt;All Web Protocols &lt;/li&gt; &lt;/ul&gt; &lt;/li&gt; &lt;/ul&gt;</atom:content>
  <atom:author>
    <atom:name>Orri Erling</atom:name>
    <atom:email>oerling@openlinksw.com</atom:email>
   </atom:author>
  <atom:category term="database" />
  <atom:category term="databases" />
  <atom:category term="webservices" />
  <atom:category term="rdf" />
  <atom:category term="oledb" />
  <atom:category term="sql" />
  <atom:category term="jdbc" />
  <atom:category term="odbc" />
  <atom:category term="semanticweb" />
  <atom:category term="web30" />
  <atom:category term="sparql" />
  <atom:category term="virtuoso" />
  <atom:category term=".net" />
  <atom:updated>2009-04-30T12:11:43.000001-04:00</atom:updated>
 </atom:entry>
 <atom:entry>
  <atom:title>Search at WWW 2009 (#2 of 5)</atom:title>
  <atom:id>http://www.openlinksw.com/weblog/oerling/?id=1548</atom:id>
  <atom:link href="http://www.openlinksw.com/weblog/oerling/?id=1548" type="text/html" rel="alternate" />
  <atom:link href="http://www.openlinksw.com/GData/oerling-blog-0/1548/2" rel="edit" />
  <atom:published>2009-04-30T15:18:24Z</atom:published>
  <atom:content type="html">&lt;p&gt;(Second of five posts related to the &lt;a href=&quot;http://www2009.org/&quot; id=&quot;link-id124024c8&quot;&gt;WWW 2009&lt;/a&gt; conference, held the week of April 20, 2009.) &lt;/p&gt; &lt;p&gt;There was a &lt;a href=&quot;http://data.semanticweb.org/conference/www/2009/paper/109/html&quot; id=&quot;link-id1207a3b0&quot;&gt;workshop on semantic search&lt;/a&gt; plus &lt;a href=&quot;http://data.semanticweb.org/conference/www/2009/html&quot; id=&quot;link-id1704ff48&quot;&gt;a number of papers&lt;/a&gt; and of course &lt;a href=&quot;http://www2009.org/keynote.html&quot; id=&quot;link-id11ec08d8&quot;&gt;keynotes from Google and Yahoo&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;A general topic was the use of and access to query logs. Are these the monopoly of GYM (Google, Yahoo, Microsoft) or should they be made more generally available? This is a privacy question. Use of query logs and click through of search results for improved ranking was mentioned many times throughout the conference.&lt;/p&gt; &lt;p&gt;The &lt;a href=&quot;http://data.semanticweb.org/conference/www/2009/paper/109/html&quot; id=&quot;link-id120b7d38&quot;&gt;semantic search workshop&lt;/a&gt; was largely about benchmarks for keyword search in &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id171e2950&quot;&gt;information&lt;/a&gt; retrieval. For &lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id11e1a9b0&quot;&gt;linked data&lt;/a&gt;, which is a database proposition, these benchmarks are not really applicable. For document search aided by semantics derived by NLP, these are of course applicable. But there is a divide in approach.&lt;/p&gt; &lt;p&gt; &lt;a href=&quot;http://g1o.net/foaf.rdf#me&quot; id=&quot;link-id11d1c7b0&quot;&gt;Giovanni Tummarello&lt;/a&gt; &lt;a href=&quot;http://data.semanticweb.org/conference/www/2009/paper/59/html&quot; id=&quot;link-id169add28&quot;&gt;presented&lt;/a&gt; &lt;a href=&quot;http://sig.ma/&quot; id=&quot;link-id11af0128&quot;&gt;Sig.ma&lt;/a&gt;, a service using &lt;a href=&quot;http://sindice.com/&quot; id=&quot;link-id11a69fa0&quot;&gt;Sindice&lt;/a&gt;&amp;#39;s &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id11f3a088&quot;&gt;RDF&lt;/a&gt; index for collecting all RDF statements about entities matching some set of keywords. One could then choose which sources and which entities were the right ones. One could further store such a query and embed it on a page. The point was that the filtering done manually could be persisted and republished, so as to create dynamic content aggregated from selected live sources. Further speculating, one could use such user feedback for adjusting ranking, even though Sig.ma did not. We may adopt the idea of manually excluding sources into our browser too. Fresnel lenses are another thing to look at.&lt;/p&gt; &lt;p&gt;There was &lt;a href=&quot;http://www2009.eprints.org/242/&quot; id=&quot;link-id11dc7c68&quot;&gt;a paper by Josep M. Pujol and Pablo Rodriguez, of Telefonica Research&lt;/a&gt;, about returning search to the people by means of Porqpine, a peer-to-peer search implementation based on sharing search results from search engines among peers and indexing them locally as they were retrieved. For users with similar interests, this can give a community based ranking model but has issues of privacy. Another point was that with local processing and personal scale &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id171a8948&quot;&gt;data&lt;/a&gt; volumes various kinds of brute force processing were feasible that would cost a lot for the web scale. Much can be done web scale but it must be done cleverly, not with a shell script and not so ad hoc.&lt;/p&gt; &lt;p&gt;As a counterpoint to this, there was &lt;a href=&quot;http://www2009.eprints.org/220/&quot; id=&quot;link-id120bf9e0&quot;&gt;a talk about Hadoop and Hive&lt;/a&gt;, a map-reduce-based &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-idee5d700&quot;&gt;SQL&lt;/a&gt;-like framework. One could do an SQL &lt;code&gt;GROUP BY&lt;/code&gt; on text files with record parsing at run time, all spread over a Hadoop cluster. The issue is, if you have a petabyte of data, you may wish to run more than one ad hoc query on it. This means that joining between partitions and complex processing becomes important. This cannot be done without indices and complex query optimization, and needs a DBMS. Stonebraker and company are fully justified in their &lt;a href=&quot;http://database.cs.brown.edu/sigmod09/&quot; id=&quot;link-id11be1088&quot;&gt;critique of map reduce&lt;/a&gt;. It looks like each generation must get dazzled by the oversimplified and then retrace the same discoveries of complexity as the previous one.&lt;/p&gt; &lt;p&gt;Some of our future plans were confirmed by what we saw, for example as concerns:&lt;/p&gt; &lt;ul&gt; &lt;li&gt;Interactively selecting sources for search, showing the graphs, then interactively refining&lt;/li&gt; &lt;li&gt;More social networks, more network analysis, and more work on social recommendation&lt;/li&gt; &lt;li&gt;Real time indexing of new pings, filling the store by forwarding queries to search engines, and harvesting micro-formats from results&lt;/li&gt; &lt;li&gt;Using &lt;a href=&quot;http://dbpedia.org/resource/Entity&quot; id=&quot;link-id16440770&quot;&gt;entity&lt;/a&gt; extraction&lt;/li&gt; &lt;/ul&gt; &lt;p&gt;These are all items in the pipeline, easy to do on top of the existing platform. For the machine learning and NLP parts, we will partner with others, details will be worked out while we work on the items we implement by ourselves.&lt;/p&gt;</atom:content>
  <atom:author>
    <atom:name>Orri Erling</atom:name>
    <atom:email>oerling@openlinksw.com</atom:email>
   </atom:author>
  <atom:category term="cluster" />
  <atom:category term="rdf" />
  <atom:category term="semanticweb" />
  <atom:category term=".net" />
  <atom:updated>2009-04-30T12:51:48-04:00</atom:updated>
 </atom:entry>
</atom:feed>