Details

Orri Erling

Subscribe

Post Categories

Recent Articles

Display Settings

articles per page.
order.
Showing posts in all categories RefreshRefresh
European Commission and the Data Overflow

The European Commission recently circulated a questionnaire to selected experts on what could be done for the future of big data.

Since the questionnaire is public, I am publishing my answers below.

  1. Data and data types

    1. What volumes of data are we dealing with today? What is the growth rate? Where can we expect to be in 2015?

      Private data warehouses of corporations have more than doubled yearly for the past years; hundreds of TB is not exceptional. This will continue. The real shift is in structured data being published in increasing quantities with a minimum level of integrate-ability through use of RDF and linked data principles. There are rewards for use of standard vocabularies and identifiers through search engines recognizing such data. There is convergence around DBpedia identifiers for real-world entities, e.g., most things that would be in the news.

      This also means that internal data processes and silos may be enriched with this content. There is consequent pressure for accommodating more diversity of data, with more flexible schema.

      Ultimately, all content presently stored in RDBs and presented in public accessible dynamic web pages will end up on the web of linked data. Examples are product catalogs, price lists, event schedules and the like.

      The volume of the well known linked data sets is around 10 billion statements. With the above mentioned trends, growth by two or three orders of magnitude by 2015 seems reasonable, This is so especially if explicit semantics are extracted from the document web and if there is some further progress in the precision/recall of such extraction.

      Relevant sections of this mass of data are a potential addition to any present or future analytics application.

      Since arbitrary analytics over the database which is the web cannot be economically provided by a centralized search engine, a cloud model may be used for on-demand selection of relevant data and mixing it with private data. This will drive database innovation for the next years even more than the continued classical warehouse growth.

      Science data is another driver of the data overflow. For example, faster gene sequencing, more accurate measurements in high energy physics, better imaging, and remote sensing will produce large volumes of data. This data has highly regular structure but labeling this data with source and lineage calls for a flexible, schema-last, self-describing model, such as RDF and linked data. Data and metadata should travel together but may have different data models.

      By and large, the metadata of science data will be another stream to the web of linked data, at least to the degree it is publicly accessible. Restricted circles can and likely will implement similar ideas.

    2. What types of data can we deal with intelligently due to their inherent structure (geospatial, temporal, social or knowledge graphs, 3D, sensor streams...)?

      All the above types should be supported inside one DBMS so as to allow efficient querying combining conditions on all these types of data, e.g., photos of sunsets taken last summer in Ibiza, with over 20 megapixels, by people I know.

      Note that the test for being a sunset is an operation on the image blob that should be taken to the data; the images cannot be economically transferred.

      Interleaving of all database functions and types becomes increasingly important.

  2. Industries, communities

    1. Who is producing these data and why? Could they do it better? How?

      Right now, projects such as Bio2RDF, Neurocommons, and DBPedia produce this data. The processes are in place and are reasonable. Incremental improvement is to be expected. These processes, along with the linked data meme generally taking off, drive demand for better NLP (Natural Language Processing), e.g., entity and relationship extraction, especially extraction that can produce instance data in given ontologies (e.g., events) using common identifiers (e.g., DBPedia URIs).

      Mapping of RDBs to RDF is possible, and a W3C working group is developing standards for this. The required baseline level has been reached; the rest is a matter of automating deployment. Within the enterprise, there are advantages to be gained for information integration; e.g., all entities in the CRM space can be integrated with all email and support tickets through giving everything a URI. Some of this information may even be published on an extranet for self-service and web-service interfaces. This has been done at small scales and the rest is a matter of spreading adoption and lowering the entry barrier. Incremental progress will take place, eventually resulting in qualitatively better integration along the value chain when adoption is sufficiently widespread.

    2. Who is consuming these data and why? Could they do it better? How?

      Consumers are various. The greatest need is for tools that summarize complex data and allow getting a bird's eye view of what data is in the first instance available. Consuming the data is hindered by the user not even necessarily knowing what data there is. This is somewhat new, as traditionally the business analyst did know the schema of the warehouse and was proficient with SQL report generators and statistics packages.

      Where Web 2.0 made the citizen journalist, the web of linked data will make the citizen analyst. For this to happen, with benefits for individuals, enterprises, and governments alike, more work in user interfaces, knowledge discovery, and query composition will be useful. We may envision a "meshup economy" where data is plentiful, but the unit of value and exchange is the smart report that crystallizes actionable value from this ocean.

    3. What industrial sectors in Europe could become more competitive if they became much better at managing data?

      Any sector could benefit. Early adopters are seen in the biomedical field and to an extent in media.

    4. Is the regulation landscape imposing constraints (privacy, compliance ...) that don't have today good tool support?

      The regulation landscape drives database demand through data retention requirements and the like.

      With data integration, especially with privacy-sensitive data (as in medicine), there are issues of whether one dares put otherwise-shareable information online. Regulation is needed to protect individuals, but integration should still be possible for science.

      For this, we see a need for progress in applying policy-based approaches (e.g., row level security) to relatively schema-last data such as RDF. This is possible but needs some more work. Also, creating on-the-fly-anonymizing views on data might help.

      More research is needed for reconciling the need for security with the advantages of broad-based ad hoc integration. Ideally, data should be intelligent, aware of its origins and classification and cautious of whom it interacts with, all of this supported under the covers so that the user could ask anything but the data might refuse to answer or might restrict answers according to the user's profile. This is a tall order and implementing something of the sort is an open question.

    5. What are the main practical problem identified for individuals and organizations? Please give examples and tell us about the main obstacles and barriers.

      We have come across the following:

      • Knowing that the data exists in the first place.
      • If the data is found, figuring out the provenance, units and precision of measurement, identifiers, and the like.
      • Compatible subject matter but incompatible representation: For example, one has numbers on a map with different maps for different points in time; another has time series of instrument data with geo-location for the instrument. It is only to be expected that the time interval between measurements is not the same. So there is need for a lot of one-off programming to align data.

      Other problems have to do with sheer volume, i.e., transfer of data even in a local area network is too slow, let alone over a wide area network. Computation needs to go to the data, and databases need to support this.

  3. Services, software stacks, protocols, standards, benchmarks

    1. What combinations of components are needed to deal with these problems?

      Recent times have seen a proliferation of special purpose databases. Since the data needs of the future are about combining data with maximum agility and minimum performance hit, there is need to gather the currently-separate functionality into an integrated system with sufficient flexibility. We see some of this in integration of map-reduce and scale-out databases. The former antagonists have become partners. Vertica, Greenplum, and OpenLink Virtuoso are example of DBMS featuring work in this direction.

      Interoperability and at least de facto standards in ways of doing this will emerge.

    2. What data exchange and processing mechanisms will be needed to work across platforms and programming languages?

      HTTP, XML, and RDF are in fact very verbose, yet these are the formats and models that have uptake. Thus, these will continue to be used even though one might think binary formats to be more efficient.

      There are of course science data set standards that are more compressed and these will continue, hopefully adding a practice of rich metadata in RDF.

      For internals of systems, MPI and TCP/IP with proprietary optimized wire formats will continue. Inter-system communication will likely continue to be HTTP, XML, and RDF as appropriate.

    3. What data environments are today so wastefully messy that they would benefit from the development of standards?

      RDF and OWL are not messy but they could use some more performance; we are working on this. SPARQL is finally acquiring the capabilities of a serious query language, so things are slowly coming together.

      Community process for developing application domain specific vocabularies works quite well, even though one could argue it is ad hoc and not up to what a modeling purist might wish.

      Top-down imposition of standards has a mixed history, with long and expensive development and sometimes no or little uptake, consider some WS* standards for example.

    4. What kind of performance is expected or required of these systems? Who will measure it reliably? How?

      Relational databases have a history of substantial investment in optimization and some of them are very good for what they do, e.g., the newer generation of analytics databases.

      The very large schema-last, no-SQL, sometimes eventually consistent key-value stores have a somewhat shorter history but do fill a real need.

      These trends will merge: Extreme scale, schema-last, complex queries, even more complex inference, custom code for in-database machine learning and other bulk processing.

      We find RDF augmented with some binary types at this crossroads. This point of the design space will have to provide performance roughly on the level of today's best relational solution for workloads that fit the relational model. The added cost of schema-last and inference must come down. We are working on this. Research work such as carried out with MonetDB gives clues as to how these aims can be reached.

      The separation of query language and inference is artificial. After the concepts are mature, these functions will merge and execute close to the data; there are clear evolutionary pressures in this direction.

      Benchmarks are key. Some gain can be had even from repurposing standard relational benchmarks like TPC-H. But the TPC-H rules do not allow official reporting of such.

      Development of benchmarks for RDF, complex queries, and inference is needed. A bold challenge to the community, it should be rooted in real-life integration needs and involve high heterogeneity. A key-value store benchmark might also be conceived. A transaction benchmark like TPC-C might be the basis, maybe augmented with massive user-generated content like reviews and blogs.

      If benchmarks exist and are not too easy nor inaccessibly difficult nor too expensive to run — think of the high end TPC-C results — then TPC-style rules and processes would be quite adequate. The threshold to publish should be lowered: Everybody runs the TPC workloads internally but few publish.

      Some EC initiative for benchmarking could make sense, similar to the TREC initiative of the US government. Industry should be consulted for the specific content; possibly the answers to the present questionnaire can provide an approximate direction.

      Benchmarks should be run by software vendors on their own systems, tuned by themselves. But there should be a process of disclosure and auditing; the TPC rules give an example. Compliance should not be too expensive or time consuming. Some community development for automating these things would be a worthwhile target for EC funding.

  4. Usability and training

    1. How difficult will it be for a developer of average competence to deploy components whose core is based on rather deep computer science? Do we all need to understand Monads and Continuations? What can be done to make it ever easier?

      In the database world, huge advances in technology have taken place behind a relatively simple and stable interface: SQL. For the linked data web, the same will take place behind SPARQL.

      Beyond these, for example, programming with MPI with good utilization of a cluster platform for an arbitrary algorithm, is quite difficult. The casual amateur is hereby warned.

      There is no single solution. For automatic parallelization, since explicit, programmatic parallelization of things with MPI for example is very unscalable in terms of required skill, we should favor declarative and/or functional approaches.

      Developing a debugger and explanation engine for rule-based and description-logics-based inference would be an idea.

      For procedural workloads, things like Erlang may be good in cases and are not overly difficult in principle, especially if there are good debugging facilities.

      For shipping functions in a cluster or cloud, the BOOM (Berkeley Orders Of Magnitude) approach or logic programming with explicit specification of compute location seem promising, surely more flexible than map-reduce. The question is whether a PHP developer can be made to do logic programming.

      This bridge will be crossed only with actual need and even then reluctantly. We may look at the Web 2.0 practice of sharding MySQL, inconvenient as this may be, for an example. There is inertia and thus re-architecting is a constant process that is generally in reaction to facts, post hoc, often a point solution. One could argue that planning ahead would be smarter but by and large the world does not work so.

      One part of the answer is an infinitely-scalable SQL database that expands and shrinks in the clouds, with the usual semantics, maybe optional eventual consistency and built-in map reduce. If such a thing is inexpensive enough and syntax-level-compatible with present installed base, many developers do not have to learn very much more.

      This is maybe good for the bread-and-butter IT, but European competitiveness should not rest on this. Therefore we wish to go for bold new application types for which the client-server database application is not the model. Data-centric languages like BOOM, if they can be made very efficient and have good debugging support, are attractive there. These do require more intellectual investment but that is not a problem since the less-inquisitive part of the developer community is served by the first part of the answer.

    2. How is a developer of average skills going to learn about these new advanced tools? How can we plan for excellent documentation and training, community mentoring, exchange of good practices, etc... across all EU countries?

      For the most part, developers do not learn things for the sake of learning. When they have learned something and it is adequate, they stay with it for the most part and are even reluctant to engage in cross-camps interaction. The research world is often similarly insular. A new inflection in the application landscape is needed to drive learning. This inflection is provided by the ubiquity of mobile devices, sensor data, explicit semantics, NLP concept extraction, web of linked data, and such factors.

      RDFa is a good example of a new technique piggybacking on something everybody uses, namely HTML. These new things should, within possibility, be deployed in the usual technology stack, LAMP or Java. Of course these do not have to be LAMP or Java or HTML or HTTP themselves but they must manifest through these.

      A lot of the semantic web potential can be realized within the client-server database application model, thus no fundamental re-architecting, just some new data types and queries.

      For data- or processing-intensive tasks, an on-demand hookup to cloud-based servers with Erlang and/or BOOM for programming model would be easy enough to learn and utilize.

      The question is one of providing challenges. Addressing actual challenges with these techniques will lead to maturity, documentation, examples, and training. With virtual, Europe-wide distributed teams a reality in many places, Europe-wide dissemination is no longer insurmountable.

      As the data overflow proceeds, its victims will multiply and create demand for solutions. The EC could here encourage research project use cases gaining an extended life past the end of research projects, possibly being maintained and multiplied and spun off.

      If such things could be mutated into self-sustaining service businesses with pay-per-use revenue, say through a cloud SaaS business model, still primarily leveraging an open source technology stack, we could have self-propagating and self-supporting models for exploiting advanced IT. This would create interest, and interest would drive training and dissemination.

      The problem is creating the pull.

  5. Challenges

    1. What should be, in this domain, the equivalent of the Netflix challenge, Ansari X Prize, Google Lunar X Prize, etc. ... ?

      The EC itself no doubt suffers from data overflow in one function or another. Unless security/secrecy prohibits, simply publishing a large data set and a description of what operations should be done on it would be a start. The more real the data, the better — reality is consistently more complex and surprising than imagination. Since many interesting problems touch on fraud detection and law enforcement, there may be some security obstacles for using these application domains as subject matters of open challenges.

      Once there is a good benchmark, as discussed above, there can be some prize money allocated for the winners, specially if the race is tight.

      The Semantic Web Challenge and the Billion Triples Challenge exist and are useful as such, but do not seem to have any huge impact.

      The incentives should be sufficient and part of the expenses arising from running for such challenges could be funded. Otherwise investing in existing business development will be more interesting to industry. Some industry participation seems necessary; we would wish academia and industry to work closer. Also, having industry supply the baseline guarantees that academia actually does further the state of the art. This is not always certain.

      If challenges are based on actual problems, whether of the EC, its member governments, or private entities, and winning the challenge may lead to a contract for supplying an actual solution, these will naturally become more interesting for consortia involving integrators, specialist software vendors, and academia. Such a model would build actual capacity to deploy leading edge technologies in production, which is sorely needed.

    2. What should one do to set up such a challenge, administer, and monitor it?

      The EC should probably circulate a call for actual problem scenarios involving big data. If the matter of the overflow is as dire as represented, cases should be easy to find. A few should be selected and then anonymized if needed.

      The party with the use case would benefit by having hopefully the best work on it. The contestants would benefit from having real world needs guide R&D. The EC would not have to do very much, except possibly use some money for funding the best proposals. The winner would possibly get a large account and related sales and service income. The contestants would have to be teams possibly involving many organizations; for example, development and first-line services and support could come from different companies along a systems integrator model such as is widely used in the US.

      There may be a good benchmark at the time, possibly resulting from FP7 itself. In such a case, the EC could offer a prize for winners. Details would have to be worked out case by case. Such a challenge could be repeated a few times, as benchmark-driven progress in databases or TREC for example have taken some years to reach a point of slowdown in progress.

      Administrating such an activity should not be prohibitive, as most of the expertise can be found with the stakeholders.

# PermaLink Comments [0]
10/27/2009 13:29 GMT Modified: 10/27/2009 14:57 GMT
VLDB 2009 Web Scale Data Management Panel (5 of 5)

"The universe of cycles is not exactly one of literal cycles, but rather one of spirals," mused Joe Hellerstein of UC Berkeley.

"Come on, let's all drop some ACID," interjected another.

"It is not that we end up repeating the exact same things, rather even if some patterns seem to repeat, they do so at a higher level, enhanced by the experience gained," continued Joe.

Thus did the Web Scale Data Management panel conclude.

Whether successive generations are made wiser by the ones that have gone before may be argued either way.

The cycle in question was that of developers discovering ACID in the 1960s, i.e. Atomicity, Consistency, Integrity, Durability. Thus did the DBMS come into being. Then DBMSs kept becoming more complex until, as there will be a counter-force to each force, came the meme of key value stores and BASE, no multiple-row transactions, eventual consistency, no query language but scaling to thousands of computers. So now, the DBMS community asks itself what went wrong.

In the words of one panelist, another demonstrated a "shocking familiarity with the subject matter of substance abuse" when he called for the DBMS community to get on a 12 step program and to look where addiction to certain ideas, among which ACID, had brought its life. Look at yourself: The influential papers in what ought to be your space by rights are coming from the OS community: Google Bigtable, Amazon Dynamo, want more? When you ought to drive, you give excuses and play catch up! Stop denial, drop SQL, drop ACID!

The web developers have revolted against the time-honored principles of the DBMS. This is true. Sharded MySQL is not the ticket — or is it? Must they rediscover the virtues of ACID, just like the previous generation did?

Nothing under the sun is new. As in music and fashion, trends keep cycling also in science and engineering.

But seriously, does the full-featured DBMS scale to web scale? Microsoft says the Azure version of SQL server does. Yahoo says they want no SQL but Hadoop and PNUTS.

Twitter, Facebook, and other web names got their own discussion. Why do they not go to serious DBMS vendors for their data but make their own, like Facebook with Hive?

Who can divine the mind of the web developer? What makes them go to memcached, manually sharded MySQL, and MapReduce, walking away from the 40 years of technology invested in declarative query and ACID? What is this highly visible but hard to grasp entity? My guess is that they want something they can understand, at least at the beginning. A DBMS, especially on a cluster, is complicated, and it is not so easy to say how it works and how its performance is determined. The big brands, if deployed on a thousand PCs, would also be prohibitively expensive. But if all you do with the DBMS is single row selects and updates, it is no longer so scary, but you end up doing all the distributed things in a middle layer, and abandoning expressive queries, transactions, and database-supported transparency of location. But at least now you know how it works and what it is good/not good for.

This would be the case for those who make a conscious choice. But by and large the choice is not deliberate; it is something one drifts into: The application gains popularity; the single LAMP can no longer keep all in memory; you need a second MySQL in the LAMP and you decide that users A–M go left and N–Z right (horizontal partitioning). This siren of sharding beckons you and all is good until you hit the reef of re-architecting. Memcached and duct-tape help, like aspirin helps with hangover, but the root cause of the headache lies unaddressed.

The conclusion was that there ought to be something incrementally scalable from the get-go. Low cost of entry and built-in scale-out. No, the web developers do not hate SQL; they just have gotten the idea that it does not scale. But they would really wish it to. So, DBMS people, show there is life in you yet.

Joe Hellerstein was the philosopher and paradigmatician of the panel. His team had developed a protocol-compatible Hadoop in a few months using a declarative logic programming style approach. His claim was that developers made the market. Thus, for writing applications against web scale data, there would have to be data centric languages. Why not? These are discussed in Berkeley Orders Of Magnitude (BOOM).

I come from Lisp myself, way back. I have since abandoned any desire to tell anybody what they ought to program in. This is a bit like religion: Attempting to impose or legislate or ram it on somebody just results in anything from lip service to rejection to war. The appeal exerted by the diverse language/paradigm -isms on their followers seems to be based on hitting a simplification of reality that coincides with a problem in the air. MapReduce is an example of this. PHP is another. A quick fix for a present need: Scripting web servers (PHP) or processing tons of files (MapReduce). The full database is not as quick a fix, even though it has many desirable features. It is also not as easy to tell what happens inside one, so MapReduce may give a greater feeling of control.

Totally self-managing, dynamically-scalable RDF would be a fix for not having to design or administer databases: Since it would be indexed on everything, complex queries would be possible; no full database scans would stop everything. For the mid-size segment of web sites this might be a fit. For the extreme ends of the spectrum, the choice is likely something custom built and much less expressive.

The BOOM rule language for data-centric programming would be something very easy for us to implement, in fact we will get something of the sort essentially for free when we do the rule support already planned.

The question is, can one induce web developers to do logic? The history is one of procedures, both in LAMP and MapReduce. On the other hand, the query languages that were ever universally adopted were declarative, i.e., keyword search and SQL. There certainly is a quest for an application model for the cloud space beyond just migrating apps. We'll see. More on this another time.

# PermaLink Comments [0]
09/01/2009 12:24 GMT Modified: 09/02/2009 12:05 GMT
Social Web Camp (#5 of 5)

(Last of five posts related to the WWW 2009 conference, held the week of April 20, 2009.)

The social networks camp was interesting, with a special meeting around Twitter. Half jokingly, we (that is, the OpenLink folks attending) concluded that societies would never be completely classless, although mobility between, as well as criteria for membership in, given classes would vary with time and circumstance. Now, there would be a new class division between people for whom micro-blogging is obligatory and those for whom it is an option.

By my experience, a great deal is possible in a short time, but this possibility depends on focus and concentration. These are increasingly rare. I am a great believer in core competence and focus. This is not only for geeks — one can have a lot of breadth-of-scope but this too depends on not getting sidetracked by constant information overload.

Insofar as personal success depends on constant reaction to online social media, this comes at a cost in time and focus and this cost will have to be managed somehow, for example by automation or outsourcing. But if the social media is only automated fronts twitting and re-twitting among themselves, a bit like electronic trading systems do with securities, with or without human operators, the value of the medium decreases.

There are contradictory requirements. On one hand, what is said in electronic media is essentially permanent, so one had best only say things that are well considered. On the other hand, one must say these things without adequate time for reflection or analysis. To cope with this, one must have a well-rehearsed position that is compacted so that it fits in a short format and is easy to remember and unambiguous to express. A culture of pre-cooked fast-food advertising cuts down on depth. Real-world things are complex and multifaceted. Besides, prevalent patterns of communication train the brain for a certain mode of functioning. If we train for rapid-fire 140-character messaging, we optimize one side but probably at the expense of another. In the meantime, the world continues developing increased complexity by all kinds of emergent effects. Connectivity is good but don't get lost in it.

There is a CIA memorandum about how analysts misinterpret data and see what they want to see. This is a relevant resource for understanding some psychology of perception and memory. With the information overload, largely driven by user generated content, interpreting fragmented and variously-biased real-time information is not only for the analyst but for everyone who needs to intelligently function in cyber-social space.

I participated in discussions on security and privacy and on mobile social networks and context.

For privacy, the main thing turned out to be whether people should be protected from themselves. Should information expire? Will it get buried by itself under huge volumes of new content? Well, for purposes of visibility, it will certainly get buried and will require constant management to stay visible. But for purposes of future finding of dirt, it will stay findable for those who are looking.

There is also the corollary of setting security for resources, like documents, versus setting security for statements, i.e., structured data like social networks. As I have blogged before, policies à la SQL do not work well when schema is fluid and end-users can't be expected to formulate or understand these. Remember Ted Nelson? A user interface should be such that a beginner understands it in 10 seconds in an emergency. The user interaction question is how to present things so that the user understands who will have access to what content. Also, users should themselves be able to check what potentially sensitive information can be found out about them. A service along the lines of Garlic's Data Patrol should be a part of the social web infrastructure of the future.

People at MIT have developed AIR (Accountability In RDF) for expressing policies about what can be done with data and for explaining why access is denied if it is denied. However, if we at all look at the history of secrets, it is rather seldom that one hears that access to information about X is restricted to compartment so-and-so; it is much more common to hear that there is no X. I would say that a policy system that just leaves out information that is not supposed to be available will please the users more. This is not only so for organizations; it is fully plausible that an individual might not wish to expose even the existence of some selected inner circle of friends, their parties together, or whatever.

In conclusion, there is no self-evident solution for careless use of social media. A site that requires people to confirm multiple times that they know what they are doing when publishing a photo will not get much use. We will see.

For mobility, there was some talk about the context of usage. Again, this is difficult. For different contexts, one would for example disclose one's location at the granularity of the city; for some other purposes, one would say which conference room one is in.

Embarrassing social situations may arise if mobile devices are too clever: If information about travel is pushed into the social network, one would feel like having to explain why one does not call on such-and-such a person and so on. Too much initiative in the mobile phone seems like a recipe for problems.

There is a thin line between convenience and having IT infrastructure rule one's life. The complexities and subtleties of social situations ought not to be reduced to the level of if-then rules. People and their interactions are more complex than they themselves often realize. A system is not its own metasystem, as Gödel put it. Similarly, human self-knowledge, let alone knowledge about another, is by this very principle only approximate. Not to forget what psychology tells us about state-dependent recall and of how circumstance can evoke patterns of behavior before one even notices. The history of expert systems did show that people do not do very well at putting their skills in the form of if-then rules. Thus automating sociality past a certain point seems a problematic proposition.

# PermaLink Comments [0]
04/30/2009 12:14 GMT Modified: 04/30/2009 12:51 GMT
Virtuoso - Are We Too Clever for Our Own Good? (updated)

"Physician, heal thyself," it is said. We profess to say what the messaging of the semantic web ought to be, but is our own perfect?

I will here engage in some critical introspection as well as amplify on some answers given to Virtuoso-related questions in recent times.

I use some conversations from the Vienna Linked Data Practitioners meeting as a starting point. These views are mine and are limited to the Virtuoso server. These do not apply to the ODS (OpenLink Data Spaces) applications line, OAT (OpenLink Ajax Toolkit), or ODE (OpenLink Data Explorer).

"It is not always clear what the main thrust is, we get the impression that you are spread too thin," said Sören Auer.

Well, personally, I am all for core competence. This is why I do not participate in all the online conversations and groups as much as I could, for example. Time and energy are critical resources and must be invested where they make a difference. In this case, the real core competence is running in the database race. This in itself, come to think of it, is a pretty broad concept.

This is why we put a lot of emphasis on Linked Data and the Data Web for now, as this is the emerging game. This is a deliberate choice, not an outside imperative or built-in limitation. More specifically, this means exposing any pre-existing relational data as linked data plus being the definitive RDF store.

We can do this because we own our database and SQL and data access middleware and have a history of connecting to any RDBMS out there.

The principal message we have been hearing from the RDF field is the call for scale of triple storage. This is even louder than the call for relational mapping. We believe that in time mapping will exceed triple storage as such, once we get some real production strength mappings deployed, enough to outperform RDF warehousing.

There are also RDF middleware things like RDF-ization and demand-driven web harvesting (i.e, the so-called Sponger). These are SPARQL options, thus accessed via standard interfaces. We have little desire to create our own languages or APIs, or to tell people how to program. This is why we recently introduced Sesame- and Jena-compatible APIs to our RDF store. From what we hear, these work. On the other hand, we do not hesitate to move beyond the standards when there is obvious value or necessity. This is why we brought SPARQL up to and beyond SQL expressivity. It is not a case of E3 (Embrace, Extend, Extinguish).

Now, this message could be better reflected in our material on the web. This blog is a rather informal step in this direction; more is to come. For now we concentrate on delivering.

The conventional communications wisdom is to split the message by target audience. For this, we should split the RDF, relational, and web services messages from each other. We believe that a challenger, like the semantic web technology stack, must have a compelling message to tell for it to be interesting. This is not a question of research prototypes. The new technology cannot lack something the installed technology takes for granted.

This is why we do not tend to show things like how to insert and query a few triples: No business out there will insert and query triples for the sake of triples. There must be a more compelling story — for example, turning the whole world into a database. This is why our examples start with things like turning the TPC-H database into RDF, queries and all. Anything less is not interesting. Why would an enterprise that has business intelligence and integration issues way more complex than the rather stereotypical TPC-H even look at a technology that pretends to be all for integration and all for expressivity of queries, yet cannot answer the first question of the entry exam?

The world out there is complex. But maybe we ought to make some simple tutorials? So, as a call to the people out there, tell us what a good tutorial would be. The question is more about figuring out what is out there and adapting these and making a sort of compatibility list. Jena and Sesame stuff ought to run as is. We could offer a webinar to all the data web luminaries showing how to promote the data web message with Virtuoso. After all, why not show it on the best platform?

"You are arrogant. When I read your papers or documentation, the impression I get is that you say you are smart and the reader is stupid."

We should answer in multiple parts.

For general collateral, like web sites and documentation:

The web site gives a confused product image. For the Virtuoso product, we should divide at the top into

  • Data web and RDF - Host linked data, expose relational assets as linked data;
  • Relational Database - Full function, high performance, open source, Federated/Virtual Relational DBMS, expose heterogeneous RDB assets through one point of contact for integration;
  • Web Services - access all the above over standard protocols, dynamic web pages, web hosting.

For each point, one simple statement. We all know what the above things mean?

Then we add a new point about scalability that impacts all the above, namely the Virtuoso version 6 Cluster, meaning that you can do all these things at 10 to 1000 times the scale. This means this much more data or in some cases this much more requests per second. This too is clear.

Far as I am concerned, hosting Java or .NET does not have to be on the front page. Also, we have no great interest in going against Apache when it comes to a web server only situation. The fact that we have a web listener is important for some things but our claim to fame does not rest on this.

Then for documentation and training materials: The documentation should be better. Specifically it should have more of a how-to dimension since nobody reads the whole thing anyhow. About online tutorials, the order of presentation should be different. They do not really reflect what is important at the present moment either.

Now for conference papers: Since taking the data web as a focus area, we have submitted some papers and had some rejected because these do not have enough references and do not explain what is obvious to ourselves.

I think that the communications failure in this case is that we want to talk about end to end solutions and the reviewers expect research. For us, the solution is interesting and exists only if there is an adequate functionality mix for addressing a specific use case. This is why we do not make a paper about query cost model alone because the cost model, while indispensable, is a thing that is taken for granted where we come from. So we mention RDF adaptations to cost model, as these are important to the whole but do not find these to be the justification for a whole paper. If we made papers on this basis, we would have to make five times as many. Maybe we ought to.

"Virtuoso is very big and very difficult"

One thing that is not obvious from the Virtuoso packaging is that the minimum installation is an executable under 10MB and a config file. Two files.

This gives you SQL and SPARQL out of the box. Adding ODBC and JDBC clients is as simple as it gets. After this, there is basic database functionality. Tuning is a matter of a few parameters that are explained on this blog and elsewhere. Also, the full scale installation is available as an Amazon EC2 image, so no installation required.

Now for the difficult side:

Use SQL and SPARQL; use stored procedures whenever there is server side business logic. For some time critical web pages, use VSP. Do not use VSPX. Otherwise, use whatever you are used to — PHP or Java or anything else. For web services, simple is best. Stick to basics. "The engineer is one who can invent a simple thing." Use SQL statements rather than admin UI.

Know that you can start a server with no database file and you get an initial database with nothing extra. The demo database, the way it is produced by installers is cluttered.

We should put this into a couple of use case oriented how-tos.

Also, we should create a network of "friendly local virtuoso geeks" for providing basic training and services so we do not have to explain these things all the time. To all you data-web-ers out there — please sign up and we will provide instructions, etc. Contact Yrjänä Rankka (ghard[at-sign]openlinksw.com), or go through the mailing lists; do not contact me directly.

"OK, we understand that you may be good at the large end of the spectrum but how do you reconcile this with the lightweight or embedded end, like the semantic desktop?"

Now, what is good for one end is usually good for the other. Namely, a database, no matter the scale, needs to have space efficient storage, fast index lookup, and correct query plans. Then there are things that occur only at the high-end, like clustering, but these are separate things. For embedding, the initial memory footprint needs to be small. With Virtuoso, this is accomplished by leaving out some 200 built-in tables and 100,000 lines of SQL procedures that are normally in by default, supporting things such as DAV and diverse other protocols. After all, if SPARQL is all one wants these are not needed.

If one really wants to do one's server logic (like web listener and thread dispatching) oneself, this is not impossible but requires some advice from us. On the other hand, if one wants to have logic for security close to the data, then using stored procedures is recommended; these execute right next to the data, and support inline SPARQL and SQL. Depending on the license status of the other code, some special licensing arrangements may apply.

We are talking about such things with different parties at present.

"How webby are you? What is webby?"

"Webby means distributed, heterogeneous, open; not monolithic consolidation of everything."

We are philosophically webby. We come from open standards; we are after all called OpenLink; our history consists of connecting things. We believe in choice — the user should be able to pick the best of breed for components and have them work together. We cannot and do not wish to force replacement of existing assets. Transforming data on the fly and connecting systems, leaving data where it originally resides, is the first preference. For the data web, the first preference is a federation of independent SPARQL end points. When there is harvesting, we prefer to do it on demand, as with our Sponger. With the immense amount of data out there we believe in finding what is relevant when it is relevant, preferably close at hand, leveraging things like social networks. With a data web, many things which are now siloized, such as marketplaces and social networks, will return to the open.

Google-style crawling of everything becomes less practical if one needs to run complex ad hoc queries against the mass of data. For these types of scenarios, if one needs to warehouse, the data cloud will offer solutions where one pays for database on demand. While we believe in loosely coupled federation where possible, we have serious work on the scalability side for the data center and the compute-on-demand cloud.

"How does OpenLink see the next five years unfolding?"

Personally, I think we have the basics for the birth of a new inflection in the knowledge economy. The URI is the unit of exchange; its value and competitive edge lie in the data it links you with. A name without context is worth little, but as a name gets more use, more information can be found through that name. This is anything from financial statistics, to legal precedents, to news reporting or government data. Right now, if the SEC just added one line of markup to the XBRL template, this would instantaneously make all SEC-mandated reporting into linked data via GRDDL.

The URI is a carrier of brand. An information brand gets traffic and references, and this can be monetized in diverse ways. The key word is context. Information overload is here to stay, and only better context offers the needed increase in productivity to stay ahead of the flood.

Semantic technologies on the whole can help with this. Why these should be semantic web or data web technologies as opposed to just semantic is the linked data value proposition. Even smart islands are still islands. Agility, scale, and scope, depend on the possibility of combining things. Therefore common terminologies and dereferenceability and discoverability are important. Without these, we are at best dealing with closed systems even if they were smart. The expert systems of the 1980s are a case in point.

Ever since the .com era, the URL has been a brand. Now it becomes a URI. Thus, entirely hiding the URI from the user experience is not always desirable. The URI is a sort of handle on the provenance and where more can be found; besides, people are already used to these.

With linked data, information value-add products become easy to build and deploy. They can be basically just canned SPARQL queries combining data in a useful and insightful manner. And where there is traffic there can be monetization, whether by advertizing, subscription, or other means. Such possibilities are a natural adjunct to the blogosphere. To publish analysis, one no longer needs to be a think tank or media company. We could call this scenario the birth of a meshup economy.

For OpenLink itself, this is our roadmap. The immediate future is about getting our high end offerings like clustered RDF storage generally available, both on the cloud and for private data centers. Ourselves, we will offer the whole Linked Open Data cloud as a database. The single feature to come in version 2 of this is fully automatic partitioning and repartitioning for on-demand scale; now, you have to choose how many partitions you have.

This makes some things possible that were hard thus far.

On the mapping front, we go for real-scale data integration scenarios where we can show that SPARQL can unify terms and concepts across databases, yet bring no added cost for complex queries. Enterprises can use their existing warehouses and have an added level of abstraction, the possibility of cross systems interlinking, the advantages of using the same taxonomies and ontologies across systems, and so forth.

Then there will be developments in the direction of smarter web harvesting on demand with the Virtuoso Sponger, and federation of heterogeneous SPARQL end points. The federation is not so unlike clustering, except the time scales are 2 orders of magnitude longer. The work on SPARQL end point statistics and data set description and discovery is a good development in the community.

Then there will be NLP integration, as exemplified by the Open Calais linked data wrapper and more.

Can we pull this off or is this being spread too thin? We know from experience that all this can be accomplished. Scale is already here; we show it with the billion triples set. Mapping is here; we showed it last in the Berlin Benchmark. We will also show some TPC-H results after we get a little quiet after the ISWC event. Then there is ongoing maintenance but with this we have shown a steady turnaround and quick time to fix for pretty much anything.

# PermaLink Comments [0]
10/26/2008 12:15 GMT Modified: 10/27/2008 12:07 GMT
Transitivity and Graphs for SQL

Background

I have mentioned on a couple of prior occasions that basic graph operations ought to be integrated into the SQL query language.

The history of databases is by and large about moving from specialized applications toward a generic platform. The introduction of the DBMS itself is the archetypal example. It is all about extracting the common features of applications and making these the features of a platform instead.

It is now time to apply this principle to graph traversal.

The rationale is that graph operations are somewhat tedious to write in a parallelize-able, latency-tolerant manner. Writing them as one would for memory-based data structures is easier but totally unscalable as soon as there is any latency involved, i.e., disk reads or messages between cluster peers.

The ad-hoc nature and very large volume of RDF data makes this a timely question. Up until now, the answer to this question has been to materialize any implied facts in RDF stores. If a was part of b, and b part of c, the implied fact that a is part of c would be inserted explicitly into the database as a pre-query step.

This is simple and often efficient, but tends to have the downside that one makes a specialized warehouse for each new type of query. The activity becomes less ad-hoc.

Also, this becomes next to impossible when the scale approaches web scale, or if some of the data is liable to be on-and-off included-into or excluded-from the set being analyzed. This is why with Virtuoso we have tended to favor inference on demand ("backward chaining") and mapping of relational data into RDF without copying.

The SQL world has taken steps towards dealing with recursion with the WITH - UNION construct which allows definition of recursive views. The idea there is to define, for example, a tree walk as a UNION of the data of the starting node plus the recursive walk of the starting node's immediate children.

The main problem with this is that I do not very well see how a SQL optimizer could effectively rearrange queries involving JOINs between such recursive views. This model of recursion seems to lose SQL's non-procedural nature. One can no longer easily rearrange JOINs based on what data is given and what is to be retrieved. If the recursion is written from root to leaf, it is not obvious how to do this from leaf to root. At any rate, queries written in this way are so complex to write, let alone optimize, that I decided to take another approach.

Take a question like "list the parts of products of category C which have materials that are classified as toxic." Suppose that the product categories are a tree, the product parts are a tree, and the materials classification is a tree taxonomy where "toxic" has a multilevel substructure.

Depending on the count of products and materials, the query can be evaluated as either going from products to parts to materials and then climbing up the materials tree to see if the material is toxic. Or one could do it in reverse, starting with the different toxic materials, looking up the parts containing these, going to the part tree to the product, and up the product hierarchy to see if the product is in the right category. One should be able to evaluate the identical query either way depending on what indices exist, what the cardinalities of the relations are, and so forth — regular cost based optimization.

Especially with RDF, there are many problems of this type. In regular SQL, it is a long-standing cultural practice to flatten hierarchies, but this is not the case with RDF.

In Virtuoso, we see SPARQL as reducing to SQL. Any RDF-oriented database-engine or query-optimization feature is accessed via SQL. Thus, if we address run-time-recursion in the Virtuoso query engine, this becomes, ipso facto, an SQL feature. Besides, we remember that SQL is a much more mature and expressive language than the current SPARQL recommendation.

SQL and Transitivity

We will here look at some simple social network queries. A later article will show how to do more general graph operations. We extend the SQL derived table construct, i.e., SELECT in another SELECT's FROM clause, with a TRANSITIVE clause.

Consider the data:

CREATE TABLE "knows" 
   ("p1" INT, 
    "p2" INT, 
    PRIMARY KEY ("p1", "p2")
   );
ALTER INDEX "knows" 
   ON "knows" 
   PARTITION ("p1" INT);
CREATE INDEX "knows2" 
   ON "knows" ("p2", "p1") 
   PARTITION ("p2" INT);

 

We represent a social network with the many-to-many relation "knows". The persons are identified by integers.

INSERT INTO "knows" VALUES (1, 2);
INSERT INTO "knows" VALUES (1, 3);
INSERT INTO "knows" VALUES (2, 4);
 
SELECT * 
   FROM (SELECT 
            TRANSITIVE 
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               "p1", 
            "p2" 
         FROM "knows"
        ) "k" 
   WHERE "k"."p1" = 1;

We obtain the result:

p1 p2
1 3
1 2
1 4

The operation is reversible:

SELECT * 
   FROM (SELECT 
            TRANSITIVE 
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               "p1", 
            "p2" 
         FROM "knows"
        ) "k" 
   WHERE "k"."p2" = 4;

 
p1 p2
2 4
1 4

Since now we give p2, we traverse from p2 towards p1. The result set states that 4 is known by 2 and 2 is known by 1.

To see what would happen if x knowing y also meant y knowing x, one could write:

SELECT * 
   FROM (SELECT 
            TRANSITIVE
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               "p1", 
            "p2" 
	    FROM (SELECT 
                  "p1", 
                  "p2" 
               FROM "knows" 
               UNION ALL 
                  SELECT 
                     "p2", 
                     "p1" 
                  FROM "knows"
              ) "k2"
        ) "k" 
   WHERE "k"."p2" = 4;
 
p1 p2
2 4
1 4
3 4

Now, since we know that 1 and 4 are related, we can ask how they are related.

SELECT * 
   FROM (SELECT 
            TRANSITIVE 
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               "p1", 
            "p2", 
            T_STEP (1) AS "via", 
            T_STEP ('step_no') AS "step", 
            T_STEP ('path_id') AS "path" 
         FROM "knows"
        ) "k" 
   WHERE "p1" = 1 
      AND "p2" = 4;
 
p1 p2 via step path
1 4 1 0 0
1 4 2 1 0
1 4 4 2 0

The two first columns are the ends of the path. The next column is the person that is a step on the path. The next one is the number of the step, counting from 0, so that the end of the path that corresponds to the end condition on the column designated as input, i.e., p1, has number 0. Since there can be multiple solutions, the last column is a sequence number allowing distinguishing multiple alternative paths from each other.

For LinkedIn users, the friends ordered by distance and descending friend count query, which is at the basis of most LinkedIn search result views can be written as:

SELECT p2, 
      dist, 
      (SELECT 
          COUNT (*) 
          FROM "knows" "c" 
          WHERE "c"."p1" = "k"."p2"
      ) 
   FROM (SELECT 
            TRANSITIVE t_in (1) t_out (2) t_distinct "p1", 
            "p2", 
            t_step ('step_no') AS "dist"
         FROM "knows"
        ) "k" 
   WHERE "p1" = 1 
   ORDER BY "dist", 3 DESC;
 
p2 dist aggregate
2 1 1
3 1 0
4 2 0

How?

The queries shown above work on Virtuoso v6. When running in cluster mode, several thousand graph traversal steps may be proceeding at the same time, meaning that all database access is parallelized and that the algorithm is internally latency-tolerant. By default, all results are produced in a deterministic order, permitting predictable slicing of result sets.

Furthermore, for queries where both ends of a path are given, the optimizer may decide to attack the path from both ends simultaneously. So, supposing that every member of a social network has an average of 30 contacts, and we need to find a path between two users that are no more than 6 steps apart, we begin at both ends, expanding each up to 3 levels, and we stop when we find the first intersection. Thus, we reach 2 * 30^3 = 54,000 nodes, and not 30^6 = 729,000,000 nodes.

Writing a generic database driven graph traversal framework on the application side, say in Java over JDBC, would easily be over a thousand lines. This is much more work than can be justified just for a one-off, ad-hoc query. Besides, the traversal order in such a case could not be optimized by the DBMS.

Next

In a future blog post I will show how this feature can be used for common graph tasks like critical path, itinerary planning, traveling salesman, the 8 queens chess problem, etc. There are lots of switches for controlling different parameters of the traversal. This is just the beginning. I will also give examples of the use of this in SPARQL.

# PermaLink Comments [0]
09/08/2008 09:20 GMT Modified: 09/08/2008 15:43 GMT
Linked Data and Information Architecture

We had a workshop on Linked Open Data (LOD) last week in Beijing. You can see the papers in the program. The event was a success with plenty of good talks and animated conversation. I will not go into every paper here but will comment a little on the conversation and draw some technology requirements going forward.

Tim Berners-Lee showed a read-write version of Tabulator. This raises the question of updating on the Data Web. The consensus was that one could assert what one wanted in one's own space but that others' spaces would be read-only. What spaces one considered relevant would be the user's or developer's business, as in the document web.

It seems to me that a significant use case of LOD is an open-web situation where the user picks a broad read-only "data wallpaper" or backdrop of assertions, and then uses this combined with a much smaller, local, writable data set. This is certainly the case when editing data for publishing, as in Tim's demo. This will also be the case when developing mesh-ups combining multiple distinct data sets bound together by sets of SameAs assertions, for example. Questions like, "What is the minimum subset of n data sets needed for deriving the result?" will be common. This will also be the case in applications using proprietary data combined with open data.

This means that databases will have to deal with queries that specify large lists of included graphs, all graphs in the store or all graphs with an exclusion list. All this is quite possible but again should be considered when architecting systems for an open linked data web.

"There is data but what can we really do with it? How far can we trust it, and what can we confidently decide based on it?"

As an answer to this question, Zitgist has compiled the UMBEL taxonomy using SKOS. This draws on Wikipedia, Open CYC, Wordnet, and YAGO, hence the acronym WOWY. UMBEL is both a taxononmy and a set of instance data, containing a large set of named entities, including persons, organizations, geopolitical entities, and so forth. By extracting references to this set of named entities from documents and correlating this to the taxonomy, one gets a good idea of what a document (or part thereof) is about.

Kingsley presented this in the Zitgist demo. This is our answer to the criticism about DBpedia having errors in classification. DBpedia, as a bootstrap stage, is about giving names to all things. Subsequent efforts like UMBEL are about refining the relationships.

"Should there be a global URI dictionary?"

There was a talk by Paolo Bouquet about Entity Name System, a a sort of data DNS, with the purpose of associating some description and rough classification to URIs. This would allow discovering URIs for reuse. I'd say that this is good if it can cut down on the SameAs proliferation and if this can be widely distributed and replicated for resilience, à la DNS. On the other hand, it was pointed out that this was not quite in the LOD spirit, where parties would mint their own dereferenceable URIs, in their own domains. We'll see.

"What to do when identity expires?"

Giovanni of Sindice said that a document should be removed from search if it was no longer available. Kingsley pointed out that resilience of reference requires some way to recover data. The data web cannot be less resilient than the document web, and there is a point to having access to history. He recommended hooking up with the Internet Archive, since they make long term persistence their business. In this way, if an application depends on data, and the URIs on which it depends are no longer dereferenceable or or provide content from a new owner of the domain, those who need the old version can still get it and host it themselves.

It is increasingly clear that OWL SameAs is both the blessing and bane of linked data. We can easily have tens of URIs for the same thing, especially with people. Still, these should be considered the same.

Returning every synonym in a query answer hardly makes sense but accepting them as input seems almost necessary. This is what we do with Virtuoso's SameAs support. Even so, this can easily double query times even when there are no synonyms.

Be that as it may, SameAs is here to stay; just consider the mapping of DBpedia to Geonames, for example.

Also, making aberrant SameAs statements can completely poison a data set and lead to absurd query results. Hence choosing which SameAs assertions from which source will be considered seems necessary. In an open web scenario, this leads inevitably to multi-graph queries that can be complex to write with regular SPARQL. By extension, it seems that a good query would also include the graphs actually used for deriving each result row. This is of course possible but has some implications on how databases should be organized.

Yves Raymond gave a talk about deriving identity between Musicbrainz and Jamendo. I see the issue as a core question of linked data in general. The algorithm Yves presented started with attribute value similarities and then followed related entities. Artists would be the same if they had similar names and similar names of albums with similar song titles, for example. We can find the same basic question in any analysis, for example, looking at how news reporting differs between media, supposing there is adequate entity extraction.

There is basic graph diffing in RDFSync, for example. But here we are expanding the context significantly. We will traverse references to some depth, allow similarity matches, SameAs, and so forth. Having presumed identity of two URIs, we can then look at the difference in their environment to produce a human readable summary. This could then be evaluated for purposes of analysis or of combining content.

At first sight, these algorithms seem well parallelizable, as long as all threads have access to all data. For scaling, this means a probably message-bound distributed algorithm. This is something to look into for the next stage of linked data.

Some inference is needed, but if everybody has their own choice of data sets to query, then everybody would also have their own entailed triples. This will make for an explosion of entailed graphs if forward chaining is used. Forward chaining is very nice because it keeps queries simple and easy to optimize. With Virtuoso, we still favor backward chaining since we expect a great diversity of graph combinations and near infinite volume in the open web scenario. With private repositories of slowly changing data put together for a special application, the situation is different.

In conclusion, we have a real LOD movement with actual momentum and a good idea of what to do next. The next step is promoting this to the broader community, starting with Linked Data Planet in New York in June.

# PermaLink Comments [0]
04/29/2008 12:08 GMT Modified: 04/29/2008 17:18 GMT
RDF Benchmarking, Role, Motives, and Rationale

Arising from the recent W3C workshop on mapping relational data to RDF, there is some discussion on starting a benchmarking oriented experimental group under the W3C. I'll here make some comments on where this might fit and how this might serve our nascent industry.

To the public, basically any recipient of the semantic data web message, the benchmarking activity should communicate:

  • The semantic data web claims to

    1. allow integrating any legacy data from wherever and allow translating this into common, mutually joinable vocabularies, and
    2. make the web into a big database capable of answering structured queries on any open data.
  • The benchmarking activity is to prove that this is not a pipe dream that Gartner Group forecast for 2027. Instead, there exists

    1. an industry,
    2. a degree of consensus within the industry concerning what the semantic data web is for, and
    3. products that are beyond experimental and can deliver at least some of the claimed benefits of the semantic data web.

To the general public, the message will be best delivered by the existence of online services that do interesting things with linked data, starting from search and going to more specialized derivative products of structured information on the web.

To those intending to apply some semantic data web things themselves, the benchmark activity should give a directory of products to look at. The reason why a benchmark suite backed by some industry consortium is useful is that it adds to the end user's confidence that the use case being measured is of somewhat general relevance and not just made to demonstrate any single product's strengths. Besides this, the TPC idea of disclosing scale, throughput, price per throughput and date is fine because it makes for easy tabulation of results. The intricacies in the full disclosure is effectively masked and it is my guess that very few read the actual full disclosures.

The inference that an evaluator draws from benchmark results is that some product figuring there consistently is somewhat serious and can be studied further. Being in the running is like a stamp of approval. The benchmarks are complex and the evaluator seldom goes to the trouble of really analyzing performance by individual query or transaction even if these are and must be given. It is a bit like Formula 1 viewers do not generally read the rules on car engine or aerodynamics, let alone understand their finer points.

For credibility to be thus given to products and hence the industry, we should just have a couple of well defined and agreed upon benchmarks, just like TPC.

The third public is the developer. As a DBMS developer, I am a great fan of TPC. The great benefit I derive from their work is that they give a test suite for measuring effects of code changes on performance. Also, assuming that the TPC workload mix is representative, it also allows ranking what optimizations are more important than others. Lastly, TPC gives a great way of describing results, e.g., changes resulting in x% improvement on throughput of y. In such usage, the benchmarks are pretty much never run by the rules but results obtained are still good for internal comparison.

Communication about IS should allow for short, simple messages: Release XX Halves Price per Throughput.

The existence of benchmarks is, if not absolutely necessary, then at least a great help for such communication. Besides, people are culturally used to all kinds of racing and sports results so this is even a familiar format.

Now the TPC is also not perfect. In the high end, the measured configurations are so large that one does not see them very often in practice. It is like the techno sports of Formula 1 or America's Cup. Interesting for the curiosity value but not immediately relevant to the regular car buyer or weekend yachtsman. Further, sponsoring a by-the-book audited TPC result is not so simple. Not as expensive as putting out an America's Cup challenge but still some trouble and expense.

So, for us to benefit by the benchmarking activity, we must find a group that can both agree and be somewhat representative. Then we must put out a simple message: This here is for integration of relational sources and this here for storage and query of RDF.

Furthermore, in so far we derive from relational or similar sources, the technology should not do less than the established alternative. This sends the wrong message.

Entering the running should not be overly difficult for vendors, hence we should not have too many benchmarks and the ones that there are should be representative and sufficiently varied workloads. The results should be compact and easy to state. One more reason why I like TPC's work is the fact that the benchmarks have an easy to understand, unified use case behind them. Approximately what is done in each becomes clear from a very short and succinct description even though the details can be complex. I suspect this is one side of their appeal. I would venture the guess that a single use case story is easier to sell than a composite metric of disparate tests. Also in the scientific computing world, we have use cases, like NAS for aerodynamics, so having a use case story is quite common and a factor for making a benchmark's relevance understandable.

Is this all possible?

To play the devil's advocate, I could say that the use cases are not as well settled as the relational ones hence formulating a generally representative benchmark is not possible. Now this is certainly not a message that this community wishes to send. Besides, there exists decades worth of history of the problems of information integration and a great deal of RDF data out there, , even a compilation of dozens of industry use cases by the SWEO, so we are not exactly in the dark here.

Can there be political agreement in reasonable time? If we look at the TPC as a precedent, judging by the rate of publication and revision, the process is not exactly quick. Now, for the TPC, it does not have to be. Judging by the frequency of published test results, hardware vendors are happy enough to have a forum to show off and do so at every turn.

Now we are not at this stage of maturity yet.

Composing a TPC style test spec is possible in a reasonable time for an individual but likely not for a committee. It is quite voluminous but also quite formulaic. While TPC's material is their own, I see no reason that we could not reference or link to it it where applicable.

Who would be motivated by such activity? How to pitch the activity to would be participants? I don't think that just talking about what to measure and how is interesting enough. This is covered ground. Vendors want to promote themselves and end users want to have vendors compete at solving their problems. Or so it would be in a simpler world.

Personally, I'd like to see a benchmark with a use case story people can relate to emerge in the next few months. Now I am not necessarily holding my breath waiting for this. For purposes of ongoing development, there is the real data out there and we can for example do the social web workload mix I suggested a couple of blog posts back on that and it is good enough for us. But that is not good enough for the industry's messaging.

I'd say that we have to assume that people play in good faith and simply ask who want to run and get an extra edge by being in on the design of the race track. By good faith I here mean a sincere wish to have the race take place in the first place.

The sport is exciting for the players and spectators alike if there is a use case story that they can relate to and an actual tournament. So this is what we should aim for. Because this is so far a niche public, we should not fragment the activity too much and we should consider how understandable and relevant the benchmark activity is to likely semantic data web adopters.

# PermaLink Comments [0]
11/21/2007 14:19 GMT Modified: 04/30/2008 14:28 GMT
On RDF and Vertical Storage

The topic of column-wise storage has not escaped us. We are not convinced that this is good for RDF. There is a point to this for business intelligence data warehouses, no doubt, although one could argue that one could get the same IO benefit with suitably selected covering indices but this is more design work. Column storage fits in less space and is more versatile For unexpected workloads.

But we can look at the RDF case in specific. You have a quad of G, S, P, O. You have a one part index on each and you have a unique row number for each quad. Given the row number, you must get the G, S, P, and O, and given any one of these, you must get the row numbers where this occurs. If there were multi-part keys, then this would be a row store with covering indices, like Virtuoso's RDF store.

Each datum is stored 8 times. What is nice is that one can use any combination of selection criteria with equal ease and in the same working set. With the RDF workload, you end up typically referencing all parts of each quad. It is not like in the business intelligence case where the typical query accesses 4 columns of the 15 column history table. Of the 4 RDF quad keys, at least 2 are generally given. So this becomes a merge intersection of two or three indices and random lookups for the unspecified columns. Complicated control path, even if the engine is meant to do this thing alone.

We'll have to try this. We could set up Virtuoso with 4 bitmap indices, each column to row ID and then a table with the 4 columns. Then we'd get bitmap ANDs for multi-column criteria and would have to get the row by row ID. As long as we run in memory, this should perform like a column store, close enough. We get the row with all the columns once, so we compensate for the fact that a column store has a special means for dereferencing the row ID for any column.

If we optimized this specially, which would not be so terribly hard, we'd have a column store. The main new thing would be making a special index by row ID that would have the ID just once per index leaf and a bitmap for dense allocation of row IDs. The rest is not too different.

For now, we will watch. If this is the next big thing, we can get there in little time.

# PermaLink Comments [0]
05/23/2007 14:08 GMT Modified: 04/24/2008 09:52 GMT
         
Powered by OpenLink Virtuoso Universal Server
Running on Linux platform