Details

Virtuso Data Space Bot
Burlington, United States

Subscribe

Post Categories

Recent Articles

Display Settings

articles per page.
order.
Showing posts in all categories RefreshRefresh
VLDB 2009 (1 of 5)

I was at the VLDB 2009 conference in Lyon, France. I will in the next few posts discuss some of the prominent themes and how they relate to our products or to RDF and Linked Data.

Firstly, RDF was as good as absent from the presentations and discussions we saw. There were a few mentions in the panel on structured data on the web, however RDF was not in any way seen to be essential for this. There were also a couple of RDF mentions in questions at other sessions, but that was about it.

It is a common perception that RDF and database people do not talk with each other. Evidence seems to bear this out.

As a database developer I did get a lot of readily applicable ideas from the VLDB talks. These run across the whole range of DBMS topics, from key compression and SQL optimization, to column storage, CPU cache optimization, and the like. In this sense, VLDB is directly relevant to all we do. In a conversation, someone was mildly confused that I should on one hand mention I was doing RDF, and on the other hand also be concerned about database performance. These things are not seen to belong together, even though making RDF do something useful certainly depends on a great deal of database optimization.

The question of all questions — that of infinite scale-out with complex queries, resilience, replication, and full database semantics — was strongly in the air.

But it was in the air more as a question than as an answer. Not very much at all was said about the performance of distributed query plans, of 2pc (two-phase commit), of the impact of interconnect latency, and such things. On the other hand, people were talking quite liberally about optimizing CPU cache and local multi-core execution, not to mention SQL plans and rewrites. Also, almost nothing was said about transactions.

Still, there is bound to be a great deal of work in scale-out of complex workloads by any number of players. Either these things are all figured out and considered self-evidently trivial, or they are so hot that people will go there only by way of allusion and vague reference. I think it is the latter.

By and large, we were confirmed in our understanding that infinite scale-out on the go, with redundancy, is the ticket, especially if one can offer complex queries and transactional semantics coupled with instant data loading and schema-last.

Column storage and cache optimizations seem to come right after these.

Certainly the database space is diversifying.

MapReduce was discussed quite a bit, as an intruder into what would be the database turf. We have no great problem with MapReduce; we do that in SQL procedures if one likes to program in this way. Greenplum also seems to have come by the same idea.

As said before, RDF and RDF reasoning were ignored. Do these actually offer something to the database side? Certainly for search, discovery, integration, and resource discovery, linked data has evident advantages.

Two points of the design space — the warehouse, and the web-scale key-value store — got a lot of attention. Would I do either in RDF? RDF is a slightly different design space point, like key-value with complex queries — on the surface, a fusion of the two. As opposed to RDF, the relational warehouse gains from fixed data-types and task-specific layout, whether row or column. The key-value store gains from having a concept of a semi-structured record, a bit like the RDF subject of a triple, but now with ad-hoc (if any) secondary indices, and inline blobs. The latter is much simpler and more compact than the generic RDF subject with graphs and all, and can be easily treated as a unit of version control and replication mastering. RDF, being more generic and more normalized, is representationally neither as ad-hoc nor as compact.

But RDF will be the natural choice when complex queries and ad-hoc schema meet, for example in web-wide integrations of application data.

There seems to be a huge divide in understanding between database-developing people and those who would be using databases. On one side, this has led to a back-to-basics movement with no SQL, no ACID, key-value pairs instead of schema, MapReduce instead of fancy but hard-to-follow parallel execution plans. On the other side, the database space specializes more and more; it is no longer simply transactions vs. analytics, but many more points of specialization.

Some frustration can be sensed in the ivory towers of science when it is seen that the ones most in need of database understanding in fact have the least. Google, Yahoo!, and Microsoft know what they are doing, with or without SQL, but the medium-size or fast-growing web sites seem to be in confusion when LAMP or Ruby or the scripting-du-jour can no longer cut it.

Can somebody using a database be expected to understand how it works? I would say no, not in general. Can a database be expected to unerringly self-configure based on workload? Sure, a database can suggest layouts, but it ought not restructure itself on the spur of the moment under full load.

It is safe to say that the community at large no longer believes in "one size fits all". Since there is no general solution, there is a fragmented space of specific solutions. We will be looking at some of these issues in the following posts.

# PermaLink Comments [0]
09/01/2009 11:30 GMT-0500 Modified: 09/01/2009 16:53 GMT-0500
Faceted Search: Unlimited Data in Interactive Time

Why not see the whole world of data as facets? Well, we'd like to, but there is the feeling that this is not practical.

The old problem has been that it is not really practical to pre-compute counts of everything for all possible combinations of search conditions and counting/grouping/sorting. The actual matches take time.

Well, neither is in fact necessary. When there are large numbers of items matching the conditions, counting them can take time but then this is the beginning of the search, and the user is not even likely to look very closely at the counts. It is enough to see that there are many of one and few of another. If the user already knows the precise predicate or class to look for, then the top-level faceted view is not even needed. The faceted view for guiding search and precise analytics are two different problems.

There are client-side faceted views like Exhibit or our own ODE. The problem with these is that there are a few orders of magnitude difference between the actual database size and what fits on the user agent. This is compounded by the fact that one does not know what to cache on the user agent because of the open nature of the data web. If this were about a fixed workflow, then a good guess would be possible — but we are talking about the data web, the very soul of serendipity and unexpected discovery.

So we made a web service that will do faceted search on arbitrary RDF. If it does not get complete results within a timeout, it will return what it has counted so far, using Virtuoso's Anytime feature. Looking for subjects with some specific combination of properties is however a bit limited, so this will also do JOINs. Many features are one or two JOINs away; take geographical locations or social networks, for example.

Yet a faceted search should be point-and-click, and should not involve a full query construction. We put the compromise at starting with full text or property or class, then navigating down properties or classes, to arbitrary depth, tree-wise. At each step, one can see the matching instances or their classes or properties, all with counts, faceted-style.

This is good enough for queries like 'what do Harry Potter fans also like' or 'who are the authors of articles tagged semantic web and machine learning and published in 2008'. For complex grouping, sub-queries, arithmetic or such, one must write the actual query.

But one can begin with facets, and then continue refining the query by hand since the service also returns SPARQL text. We made a small web interface on top of the service with all logic server side. This proves that the web service is usable and that an interface with no AJAX, and no problems with browser interoperability or such, is possible and easy. Also, the problem of syncing between a user-agent-based store and a database is entirely gone.

If we are working with a known data structure, the user interface should choose the display by the data type and offer links to related reports. This is all easy to build as web pages or AJAX. We show how the generic interface is done in Virtuoso PL, and you can adapt that or rewrite it in PHP, Java, JavaScript, or anything else, to accommodate use-case specific navigation needs such as data format.

The web service takes an XML representation of the search, which is more restricted and easier to process by machine than the SPARQL syntax. The web service returns the results, the SPARQL query it generated, whether the results are complete or not, and some resource use statistics.

The source of the PL functions, Web Service and Virtuoso Server Page (HTML UI) will be available as part of Virtuoso 6.0 and higher. A Programmer's Guide will be available as part of the standard Virtuoso Documentation collection, including the Virtuoso Open Source Edition Website.

# PermaLink Comments [0]
01/09/2009 22:03 GMT-0500 Modified: 01/09/2009 17:15 GMT-0500
         
Powered by OpenLink Virtuoso Universal Server
Running on Linux platform
OpenLink Software 1998-2006