Details

Virtuso Data Space Bot
Burlington, United States

Subscribe

Post Categories

Recent Articles

Display Settings

articles per page.
order.
Showing posts in all categories RefreshRefresh
Some Interesting VLDB 2009 Papers (2 of 5)

Intel on Hash Join

Intel and Oracle had measured hash and sort merge joins on Intel Core i7. The result was that hash join with both tables partitioned to match CPU cache was still the best but that sort/merge would catch up with more SIMD instructions in the future.

We should probably experiment with this but the most important partitioning of hash joins is still between cluster nodes. Within the process, we will see. The tradeoff of doing all in cache-sized partitions is larger intermediate results which in turn will impact the working set of disk pages in RAM. For one-off queries this is OK; for online use this has an effect.

1000 TABLE Queries

SAP presented a paper about federating relational databases. Queries would be expressed against VIEWs defined over remote TABLEs, UNIONed together and so forth. Traditional methods of optimization would run out of memory; a single 1000 TABLE plan is already a big thing. Enumerating multiple variations of such is not possible in practice. So the solution was to plan in two stages — first arrange the subqueries and derived TABLEs, and then do the JOIN orders locally. Further, local JOIN orders could even be adjusted at run time based on the actual data. Nice.

Oracle Subqueries and New Implementation of LOBs

Oracle presented some new SQL optimizations, combining and inlining subqueries and derived TABLEs. We do fairly similar things and might extend the repertoire of tricks in the direction outlined by Oracle as and when the need presents itself. This further confirms that SQL and other query optimization is really an incremental collection of specially recognized patterns. We still have not found any other way of doing it.

Another interesting piece by Oracle was about their re-implementation of large object support, where they compared LOB loading to file system and raw device speeds.

Amadeus CRS booking system, steady query time for arbitrary single table queries

There was a paper about a memory-resident database that could give steady time for any kind of single-table scan query. The innovation was to not use indices, but to have one partition of the table per processor core, all in memory. Then each core would have exactly two cursors — one reading, the other writing. The write cursor should keep ahead of the read cursor. Like this, there would be no read/write contention on pages, no locking, no multiple threads splitting a tree at different points, none of the complexity of a multithreaded database engine. Then, when the cursor would hit a row, it would look at the set of queries or updates and add the result to the output if there was a result. The data indexes the queries, not the other way around. We have done something similar for detecting changes in a full text corpus but never thought of doing queries this way.

Well, we are all about JOINs so this is not for us, but it deserves a mention for being original and clever. And indeed, anything one can ask about a table will likely be served with great predictability.

Greenplum

Google's chief economist said that the winning career choice would be to pick a scarce skill that made value from something that was plentiful. For the 2010s this career is that of the statistician/data analyst. We've said it before — the next web is analytics for all. The Greenplum talk was divided between the Fox use case, with 200TB of data about ads, web site traffic, and other things, growing 5TB a day. The message was that cubes and drill down are passé, that it is about complex statistical methods that have to run in the database, that the new kind of geek is the data geek, whose vocation it is to consume and spit out data, discover things in it, and so forth.

The technical part was about Greenplum, a SQL database running on a cluster with a PostgreSQL back-end. The interesting points were embedding MapReduce into SQL, and using relational tables for arrays and complex data types — pretty much what we also do. Greenplum emphasized scale-out and found column orientation more like a nice-to-have.

MonetDB, optimizing database for CPU cache

The MonetDB people from CWI in Amsterdam gave a 10 year best paper award talk about optimizing database for CPU cache. The key point was that if data is stored as columns, it ought also to be transferred as columns inside the execution engine. Materialize big chunks of state to cut down on interpretation overhead and use cache to best effect. They vector for CPU cache; we vector for scale-out, since the only way to ship operations is to ship many at a time. So we might as well vector also in single servers. This could be worth an experiment. Also we regularly visit the topic of column storage. But we are not yet convinced that it would be better than row-style covering indices for RDF quads. But something could certainly be tried, given time.

# PermaLink Comments [0]
09/01/2009 11:46 GMT-0500 Modified: 09/01/2009 17:32 GMT-0500
Web Scale and Fault Tolerance

One concern about Virtuoso Cluster is fault tolerance. This post talks about the basics of fault tolerance and what we can do with this, from improving resilience and optimizing performance to accommodating bulk loads without impacting interactive response. We will see that this is yet another step towards a 24/7 web-scale Linked Data Web. We will see how large scale, continuous operation, and redundancy are related.

It has been said many times — when things are large enough, failures become frequent. In view of this, basic storage of partitions in multiple copies is built into the Virtuoso cluster from the start. Until now, this feature has not been tested or used very extensively, aside from the trivial case of keeping all schema information in synchronous replicas on all servers.

Approaches to Fault Tolerance

Fault tolerance has many aspects but it starts with keeping data in at least two copies. There are shared-disk cluster databases like Oracle RAC that do not depend on partitioning. With these, as long as the disk image is intact, servers can come and go. The fault tolerance of the disk in turn comes from mirroring done by the disk controller. Raids other than mirrored disk are not really good for databases because of write speed.

With shared-nothing setups like Virtuoso, fault tolerance is based on multiple servers keeping the same logical data. The copies are synchronized transaction-by-transaction but are not bit-for-bit identical nor write-by-write synchronous as is the case with mirrored disks.

There are asynchronous replication schemes generally based on log shipping, where the replica replays the transaction log of the master copy. The master copy gets the updates, the replica replays them. Both can take queries. These do not guarantee an entirely ACID fail-over but for many applications they come close enough.

In a tightly coupled cluster, it is possible to do synchronous, transactional updates on multiple copies without great added cost. Sending the message to two places instead of one does not make much difference since it is the latency that counts. But once we go to wide area networks, this becomes as good as unworkable for any sort of update volume. Thus, wide area replication must in practice be asynchronous.

This is a subject for another discussion. For now, the short answer is that wide area log shipping must be adapted to the application's requirements for synchronicity and consistency. Also, exactly what content is shipped and to where depends on the application. Some application-specific logic will likely be involved; more than this one cannot say without a specific context.

Basics of Partition Fail-Over

For now, we will be concerned with redundancy protecting against broken hardware, software slowdown, or crashes inside a single site.

The basic idea is simple: Writes go to all copies; reads that must be repeatable or serializable (i.e., locking) go to the first copy; reads that refer to committed state without guarantee of repeatability can be balanced among all copies. When a copy goes offline, nobody needs to know, as long as there is at least one copy online for each partition. The exception in practice is when there are open cursors or such stateful things as aggregations pending on a copy that goes down. Then the query or transaction will abort and the application can retry. This looks like a deadlock to the application.

Coming back online is more complicated. This requires establishing that the recovering copy is actually in sync. In practice this requires a short window during which no transactions have uncommitted updates. Sometimes, forcing this can require aborting some transactions, which again looks like a deadlock to the application.

When an error is seen, such as a process no longer accepting connections and dropping existing cluster connections, we in practice go via two stages. First, the operations that directly depended on this process are aborted, as well as any computation being done on behalf of the disconnected server. At this stage, attempting to read data from the partition of the failed server will go to another copy but writes will still try to update all copies and will fail if the failed copy continues to be offline. After it is established that the failed copy will stay off for some time, writes may be re-enabled — but now having the failed copy rejoin the cluster will be more complicated, requiring an atomic window to ensure sync, as mentioned earlier.

For the DBA, there can be intermittent software crashes where a failed server automatically restarts itself, and there can be prolonged failures where this does not happen. Both are alerts but the first kind can wait. Since a system must essentially run itself, it will wait for some time for the failed server to restart itself. During this window, all reads of the failed partition go to the spare copy and writes give an error. If the spare does not come back up in time, the system will automatically re-enable writes on the spare but now the failed server may no longer rejoin the cluster without a complex sync cycle. This all can happen in well under a minute, faster than a human operator can react. The diagnostics can be done later.

If the situation was a hardware failure, recovery consists of taking a spare server and copying the database from the surviving online copy. This done, the spare server can come on line. Copying the database can be done while online and accepting updates but this may take some time, maybe an hour for every 200G of data copied over a network. In principle this could be automated by scripting, but we would normally expect a human DBA to be involved.

As a general rule, reacting to the failure goes automatically without disruption of service but bringing the failed copy online will usually require some operator action.

Levels of Tolerance and Performance

The only way to make failures totally invisible is to have all in duplicate and provisioned so that the system never runs at more than half the total capacity. This is often not economical or necessary. This is why we can do better, using the spare capacity for more than standby.

Imagine keeping a repository of linked data. Most of the content will come in through periodic bulk replacement of data sets. Some data will come in through pings from applications publishing FOAF and similar. Some data will come through on-demand RDFization of resources.

The performance of such a repository essentially depends on having enough memory. Having this memory in duplicate is just added cost. What we can do instead is have all copies store the whole partition but when routing queries, apply range partitioning on top of the basic hash partitioning. If one partition stores IDs 64K - 128K, the next partition 128K - 192K, and so forth, and all partitions are stored in two full copies, we can route reads to the first 32K IDs to the first copy and reads to the second 32K IDs to the second copy. In this way, the copies will keep different working sets. The RAM is used to full advantage.

Of course, if there is a failure, then the working set will degrade, but if this is not often and not for long, this can be quite tolerable. The alternate expense is buying twice as much RAM, likely meaning twice as many servers. This workload is memory intensive, thus servers should have the maximum memory they can have without going to parts that are so expensive one gets a new server for the price of doubling memory.

Background Bulk Processing

When loading data, the system is online in principle, but query response can be quite bad. A large RDF load will involve most memory and queries will miss the cache. The load will further keep most disks busy, so response is not good. This is the case as soon as a server's partition of the database is four times the size of RAM or greater. Whether the work is bulk-load or bulk-delete makes little difference.

But if partitions are replicated, we can temporarily split the database so that the first copies serve queries and the second copies do the load. If the copies serving on line activities do some updates also, these updates will be committed on both copies. But the load will be committed on the second copy only. This is fully appropriate as long as the data are different. When the bulk load is done, the second copy of each partition will have the full up to date state, including changes that came in during the bulk load. The online activity can be now redirected to the second copies and the first copies can be overwritten in the background by the second copies, so as to again have all data in duplicate.

Failures during such operations are not dangerous. If the copies doing the bulk load fail, the bulk load will have to be restarted. If the front end copies fail, the front end load goes to the copies doing the bulk load. Response times will be bad until the bulk load is stopped, but no data is lost.

This technique applies to all data intensive background tasks — calculation of entity search ranks, data cleansing, consistency checking, and so on. If two copies are needed to keep up with the online load, then data can be kept just as well in three copies instead of two. This method applies to any data-warehouse-style workload which must coexist with online access and occasional low volume updating.

Configurations of Redundancy

Right now, we can declare that two or more server processes in a cluster form a group. All data managed by one member of the group is stored by all others. The members of the group are interchangeable. Thus, if there is four-servers-worth of data, then there will be a minimum of eight servers. Each of these servers will have one server process per core. The first hardware failure will not affect operations. For the second failure, there is a 1/7 chance that it stops the whole system, if it falls on the server whose pair is down. If groups consist of three servers, for a total of 12, the two first failures are guaranteed not to interrupt operations; for the third, there is a 1/10 chance that it will.

We note that for big databases, as said before, the RAM cache capacity is the sum of all the servers' RAM when in normal operation.

There are other, more dynamic ways of splitting data among servers, so that partitions migrate between servers and spawn extra copies of themselves if not enough copies are online. The Google File System (GFS) does something of this sort at the file system level; Amazon's Dynamo does something similar at the database level. The analogies are not exact, though.

If data is partitioned in this manner, for example into 1K slices, each in duplicate, with the rule that the two duplicates will not be on the same physical server, the first failure will not break operations but the second probably will. Without extra logic, there is a probability that the partitions formerly hosted by the failed server have their second copies randomly spread over the remaining servers. This scheme equalizes load better but is less resilient.

Maintenance and Continuity

Databases may benefit from defragmentation, rebalancing of indices, and so on. While these are possible online, by definition they affect the working set and make response times quite bad as soon as the database is significantly larger than RAM. With duplicate copies, the problem is largely solved. Also, software version changes need not involve downtime.

Present Status

The basics of replicated partitions are operational. The items to finalize are about system administration procedures and automatic synchronization of recovering copies. This must be automatic because if it is not, the operator will find a way to forget something or do some steps in the wrong order. This also requires a management view that shows what the different processes are doing and whether something is hung or failing repeatedly. All this is for the recovery part; taking failed partitions offline is easy.

# PermaLink Comments [0]
04/01/2009 10:18 GMT-0500 Modified: 04/01/2009 11:18 GMT-0500
Virtuoso Update, Billion Triples and Outlook
Virtuoso Update, Billion Triples and Outlook

I will say a few things about what we have been doing and where we can go.

Firstly, we have a fairly scalable platform with Virtuoso 6 Cluster. It was most recently tested with the workload discussed in the previous Billion Triples post.

There is an updated version of the paper about this. This will be presented at the web scale workshop of ISWC 2008 in Karlsruhe.

Right now, we are polishing some things in Virtuoso 6 -- some optimizations for smarter balancing of interconnect traffic over multiple network interfaces, and some more SQL optimizations specific to RDF. The must-have basics, like parallel running of sub-queries and aggregates, and all-around unrolling of loops of every kind into large partitioned batches, is all there and proven to work.

We spent a lot of time around the Berlin SPARQL Benchmark story, so we got to the more advanced stuff like the Billion Triples Challenge rather late. We did along the way also run BSBM with an Oracle back-end, with Virtuoso mapping SPARQL to SQL. This merits its own analysis in the near future. This will be the basic how-to of mapping OLTP systems to RDF. Depending on the case, one can use this for lookups in real-time or ETL.

RDF will deliver value in complex situations. An example of a complex relational mapping use case came from Ordnance Survey, presented at the RDB2RDF XG. Examples of complex warehouses include the Neurocommons database, the Billion Triples Challenge, and the Garlik DataPatrol.

In comparison, the Berlin workload is really simple and one where RDF is not at its best, as amply discussed on the Linked Data forum. BSBM's primary value is as a demonstrator for the basic mapping tasks that will be repeated over and over for pretty much any online system when presence on the data web becomes as indispensable as presence on the HTML web.

I will now talk about the complex warehouse/web-harvesting side. I will come to the mapping in another post.

Now, all the things shown in the Billion Triples post can be done with a relational system specially built for each purpose. Since we are a general purpose RDBMS, we use this capability where it makes sense. For example, storing statistics about which tags or interests occur with which other tags or interests as RDF blank nodes makes no sense. We do not even make the experiment; we know ahead of time that the result is at least an order of magnitude in favor of the relational row-oriented solution in both space and time.

Whenever there is a data structure specially made for answering one specific question, like joint occurrence of tags, RDB and mapping is the way to go. With Virtuoso, this can fully-well coexist with physical triples, and can still be accessed in SPARQL and mixed with triples. This is territory that we have not extensively covered yet, but we will be giving some examples about this later.

The real value of RDF is in agility. When there is no time to design and load a new warehouse for every new question, RDF is unparalleled. Also SPARQL, once it has the necessary extensions of aggregating and sub-queries, is nicer than SQL, especially when we have sub-classes and sub-properties, transitivity, and "same as" enabled. These things have some run time cost and if there is a report one is hitting absolutely all the time, then chances are that resolving terms and identity at load-time and using materialized views in SQL is the reasonable thing. If one is inventing a new report every time, then RDF has a lot more convenience and flexibility.

We are just beginning to explore what we can do with data sets such as the online conversation space, linked data, and the open ontologies of UMBEL and OpenCyc. It is safe to say that we can run with real world scale without loss of query expressivity. There is an incremental cost for performance but this is not prohibitive. Serving the whole billion triples set from memory would cost about $32K in hardware. $8K will do if one can wait for disk part of the time. One can use these numbers as a basis for costing larger systems. For online search applications, one will note that running the indexes pretty much from memory is necessary for flat response time. For back office analytics this is not necessarily as critical. It all depends on the use case.

We expect to be able to combine geography, social proximity, subject matter, and named entities, with hierarchical taxonomies and traditional full text, and to present this through a simple user interface.

We expect to do this with online response times if we have a limited set of starting points and do not navigate more than 2 or 3 steps from each starting point. An example would be to have a full text pattern and news group, and get the cloud of interests from the authors of matching posts. Another would be to make a faceted view of the properties of the 1000 people most closely connected to one person.

Queries like finding the fastest online responders to questions about romance across the global board-scape, or finding the person who initiates the most long running conversations about crime, take a bit longer but are entirely possible.

The genius of RDF is to be able to do these things within a general purpose database, ad hoc, in a single query language, mostly without materializing intermediate results. Any of these things could be done with arbitrary efficiency in a custom built system. But what is special now is that the cost of access to this type of information and far beyond drops dramatically as we can do these things in a far less labor intensive way, with a general purpose system, with no redesigning and reloading of warehouses at every turn. The query becomes a commodity.

Still, one must know what to ask. In this respect, the self-describing nature of RDF is unmatched. A query like list the top 10 attributes with the most distinct values for all persons cannot be done in SQL. SQL simply does not allow the columns to be variable.

Further, we can accept queries as text, the way people are used to supplying them, and use structure for drill-down or result-relevance, and also recognize named entities and subject matter concepts in query text. Very simple NLP will go a long way towards keeping SPARQL out of the user experience.

The other way of keeping query complexity hidden is to publish hand-written SPARQL as parameter-fed canned reports.

Between now and ISWC 2008, the last week of October, we will put out demos showing some of these things. Stay tuned.

# PermaLink Comments [0]
10/02/2008 06:02 GMT-0500 Modified: 10/02/2008 12:47 GMT-0500
BSBM With Triples and Mapped Relational Data
BSBM With Triples and Mapped Relational Data

The special contribution of the Berlin SPARQL Benchmark (BSBM) to the RDF world is to raise the question of doing OLTP with RDF.

Of course, here we immediately hit the question of comparisons with relational databases. To this effect, BSBM also specifies a relational schema and can generate the data as either triples or SQL inserts.

The benchmark effectively simulates the case of exposing an existing RDBMS as RDF. OpenLink Software calls this RDF Views. Oracle is beginning to call this semantic covers. The RDB2RDF XG, a W3C incubator group, has been active in this area since Spring, 2008.

But why an OLTP workload with RDF to begin with?

We believe this is relevant because RDF promises to be the interoperability factor between potentially all of traditional IS. If data is online for human consumption, it may be online via a SPARQL end-point as well. The economic justification will come from discoverability and from applications integrating multi-source structured data. Online shopping is a fine use case.

Warehousing all the world's publishable data as RDF is not our first preference, nor would it be the publisher's. Considerations of duplicate infrastructure and maintenance are reason enough. Consequently, we need to show that mapping can outperform an RDF warehouse, which is what we'll do here.

What We Got

First, we found that making the query plan took much too long in proportion to the run time. With BSBM this is an issue because the queries have lots of joins but access relatively little data. So we made a faster compiler and along the way retouched the cost model a bit.

But the really interesting part with BSBM is mapping relational data to RDF. For us, BSBM is a great way of showing that mapping can outperform even the best triple store. A relational row store is as good as unbeatable with the query mix. And when there is a clear mapping, there is no reason the SPARQL could not be directly translated.

If Chris Bizer et al launched the mapping ship, we will be the ones to pilot it to harbor!

We filled two Virtuoso instances with a BSBM200000 data set, for 100M triples. One was filled with physical triples; the other was filled with the equivalent relational data plus mapping to triples. Performance figures are given in "query mixes per hour". (An update or follow-on to this post will provide elapsed times for each test run.)

With the unmodified benchmark we got:

Physical Triples:     1297 qmph
Mapped Triples:     3144 qmph

In both cases, most of the time was spent on Q6, which looks for products with one of three words in the label. We altered Q6 to use text index for the mapping, and altered the databases accordingly. (There is no such thing as an e-commerce site without a text index, so we are amply justified in making this change.)

The following were measured on the second run of a 100 query mix series, single test driver, warm cache.

Physical Triples:     5746 qmph
Mapped Triples:     7525 qmph

We then ran the same with 4 concurrent instances of the test driver. The qmph here is 400 / the longest run time.

Physical Triples:     19459 qmph
Mapped Triples:     24531 qmph

The system used was 64-bit Linux, 2GHz dual-Xeon 5130 (8 cores) with 8G RAM. The concurrent throughputs are a little under 4 times the single thread throughput, which is normal for SMP due to memory contention. The numbers do not evidence significant overhead from thread synchronization.

The query compilation represents about 1/3 of total server side CPU. In an actual online application of this type, queries would be parameterized, so the throughputs would be accordingly higher. We used the StopCompilerWhenXOverRunTime = 1 option here to cut needless compiler overhead, the queries being straightforward enough.

We also see that the advantage of mapping can be further increased by more compiler optimizations, so we expect in the end mapping will lead RDF warehousing by a factor of 4 or so.

Suggestions for BSBM

  • Reporting Rules. The benchmark spec should specify a form for disclosure of test run data, TPC style. This includes things like configuration parameters and exact text of queries. There should be accepted variants of query text, as with the TPC.

  • Multiuser operation. The test driver should get a stream number as parameter, so that each client makes a different query sequence. Also, disk performance in this type of benchmark can only be reasonably assessed with a naturally parallel multiuser workload.

  • Add business intelligence. SPARQL has aggregates now, at least with Jena and Virtuoso, so let's use these. The BSBM business intelligence metric should be a separate metric off the same data. Adding synthetic sales figures would make more interesting queries possible. For example, producing recommendations like "customers who bought this also bought xxx."

  • For the SPARQL community, BSBM sends the message that one ought to support parameterized queries and stored procedures. This would be a SPARQL protocol extension; the SPARUL syntax should also have a way of calling a procedure. Something like select proc (??, ??) would be enough, where ?? is a parameter marker, like ? in ODBC/JDBC.

  • Add transactions.Especially if we are contrasting mapping vs. storing triples, having an update flow is relevant. In practice, this could be done by having the test driver send web service requests for order entry and the SUT could implement these as updates to the triples or a mapped relational store. This could use stored procedures or logic in an app server.

Comments on Query Mix

The time of most queries is less than linear to the scale factor. Q6 is an exception if it is not implemented using a text index. Without the text index, Q6 will inevitably come to dominate query time as the scale is increased, and thus will make the benchmark less relevant at larger scales.

Next

We include the sources of our RDF view definitions and other material for running BSBM with our forthcoming Virtuoso Open Source 5.0.8 release. This also includes all the query optimization work done for BSBM. This will be available in the coming days.

# PermaLink Comments [0]
08/06/2008 15:41 GMT-0500 Modified: 08/06/2008 16:29 GMT-0500
Aspects of RDF to RDF Mapping
Aspects of RDF to RDF Mapping

The W3C has recently launched an incubator group about mapping relational data to RDF.

From participating in the group for the few initial sessions, I get the following impressions.

There is a segment of users, for example from the biomedical community, who do heavy duty data integration and look to RDF for managing complexity. Unifying heterogeneous data under OWL ontologies, reasoning, and data integrity, are points of interest.

There is another segment that is concerned with semantifying the document web, which topic includes initiatives such as Triplify and semantic web search such as Sindice. The emphasis there is on minimizing entry cost and creating critical mass. The next one to come will clean up the semantics, if these need be cleaned up at all.

(Some cleanup is taking place with Yago and Zitgist, but this is a matter for a different post.)

Thus, technically speaking, the mapping landscape is diverse, but ETL (extract-transform-load) seems to predominate. The biomedical people make data warehouses for answering specific questions. The web people are interested in putting data out in the expectation that the next player will warehouse it and allow running complex meshups against the whole of the RDF-ized web.

As one would expect, these groups see different issues and needs. Roughly speaking, one is about quality and structure and the other is about volume.

Where do we stand?

We are with the research data warehousers in saying that the mapping question is very complex and that it would indeed be nice to bypass ETL and go to the source RDBMS(s) on demand. Projects in this direction are ongoing.

We are with the web people in building large RDF stores with scalable query answering for arbitrary RDF, for example, hosting a lot of the Linking Open Data sets, and working with Zitgist.

These things are somewhat different.

At present, both the research warehousers and the web scalers predominantly go for ETL.

This is fine by us as we definitely are in the large RDF store race.

Still, mapping has its point. A relational store will perform quite a bit faster than a quad store if it has the right covering indices or application-specific compressed columnar layout. Thus, there is nothing to block us from querying analytics in SPARQL, once the obviously necessary extensions of sub-query, expressions and aggregation are in place.

To cite an example, the Ordnance Survey of the UK has a GIS system running on Oracle with an entry pretty much for each mailbox, lamp post, and hedgerow in the country. According to Ordnance Survey, this would be 1 petatriple, 1e15 triples. "Such a big server farm that we'd have to put it on our map," as Jenny Harding put it at ESWC2008. I'd add that an even bigger map entry would be the power plant needed to run the 100,000 or so PCs this would take. This is counting 10 gigatriples per PC, which would not even give very good working sets.

So, on-the-fly RDBMS-to-RDF mapping in some cases is simply necessary. Still, the benefits of RDF for integration can be preserved if the translation middleware is smart enough. Specifically, this entails knowing what tables can be joined with what other tables and pushing maximum processing to the RDBMS(s) involved in the query.

You can download the slide set I used for the Virtuoso presentation for the RDB to RDF mapping incubator group (PPT; other formats coming soon). The main point is that real integration is hard and needs smart query splitting and optimization, as well as real understanding of the databases and subject matter from the information architect. Sometimes in the web space it can suffice to put data out there with trivial RDF translation and hope that a search engine or such will figure out how to join this with something else. For the enterprise, things are not so. Benefits are clear if one can navigate between disjoint silos but making this accurate enough for deriving business conclusions, as well as efficient enough for production, is a soluble and non-trivial question.

We will show the basics of this with the TPC-H mapping, and by joining this with physical triples. We will also make a set of TPC-H format table sets, make mappings between keys in one to keys in the other, and show joins between the two. The SPARQL querying of one such data store is a done deal, including the SPARQL extensions for this. There is even a demo paper, Business Intelligence Extensions for SPARQL (PDF; other formats coming soon), by us on the subject in the ESWC 2008 proceedings. If there is an issue left, it is just the technicality of always producing SQL that looks hand-crafted and hence is better understood by the target RDBMS(s). For example, Oracle works better if one uses an IN sub-query instead of the equivalent existence test.

Follow this blog for more on the topic; published papers are always a limited view on the matter.

# PermaLink Comments [0]
06/09/2008 10:02 GMT-0500 Modified: 06/11/2008 13:15 GMT-0500
DBpedia Benchmark Revisited
DBpedia Benchmark Revisited

We ran the DBpedia benchmark queries again with different configurations of Virtuoso. I had not studied the details of the matter previously but now did have a closer look at the queries.

Comparing numbers given by different parties is a constant problem. In the case reported here, we loaded the full DBpedia 3, all languages, with about 198M triples, onto Virtuoso v5 and Virtuoso Cluster v6, all on the same 4 core 2GHz Xeon with 8G RAM. All databases were striped on 6 disks. The Cluster configuration was with 4 processes in the same box.

We ran the queries in two variants:

  • With graph specified in the SPARQL FROM clause, using the default indices.
  • With no graph specified anywhere, using an alternate indexing scheme.

The times below are for the sequence of 5 queries; individual query times are not reported. I did not do a line-by-line review of the execution plans since they seem to run well enough. We could get some extra mileage from cost model tweaks, especially for the numeric range conditions, but we will do this when somebody comes up with better times.

First, about Virtuoso v5: Because there is a query in the set that specifies no condition on S or O and only P, this simply cannot be done with the default indices. With Virtuoso Cluster v6 it sort-of can, because v6 is more space efficient.

So we added the index:

create bitmap index rdf_quad_pogs on rdf_quad (p, o, g, s);
  Virtuoso v5 with
gspo, ogps, pogs
Virtuoso Cluster v6 with
gspo, ogps
Virtuoso Cluster v6 with
gspo, ogps, pogs
cold 210 s 136 s 33.4 s
warm 0.600 s 4.01 s 0.628 s

OK, so now let us do it without a graph being specified. For all platforms, we drop any existing indices, and --

create table r2 (g iri_id_8, s, iri_id_8, p iri_id_8, o any, primary key (s, p, o, g))
alter index R2 on R2 partition (s int (0hexffff00));

log_enable (2);
insert into r2 (g, s, p, o) select g, s, p, o from rdf_quad;

drop table rdf_quad;
alter table r2 rename RDF_QUAD;
create bitmap index rdf_quad_opgs on rdf_quad (o, p, g, s) partition (o varchar (-1, 0hexffff));
create bitmap index rdf_quad_pogs on rdf_quad (p, o, g, s) partition (o varchar (-1, 0hexffff));
create bitmap index rdf_quad_gpos on rdf_quad (g, p, o, s) partition (o varchar (-1, 0hexffff));

The code is identical for v5 and v6, except that with v5 we use iri_id (32 bit) for the type, not iri_id_8 (64 bit). We note that we run out of IDs with v5 around a few billion triples, so with v6 we have double the ID length and still manage to be vastly more space efficient.

With the above 4 indices, we can query the data pretty much in any combination without hitting a full scan of any index. We note that all indices that do not begin with s end with s as a bitmap. This takes about 60% of the space of a non-bitmap index for data such as DBpedia.

If you intend to do completely arbitrary RDF queries in Virtuoso, then chances are you are best off with the above index scheme.

  Virtuoso v5 with
gspo, ogps, pogs
Virtuoso Cluster v6 with
spog, pogs, opgs, gpos
warm 0.595 s 0.617 s

The cold times were about the same as above, so not reproduced.

Graph or No Graph?

It is in the SPARQL spirit to specify a graph and for pretty much any application, there are entirely sensible ways of keeping the data in graphs and specifying which ones are concerned by queries. This is why Virtuoso is set up for this by default.

On the other hand, for the open web scenario, dealing with an unknown large number of graphs, enumerating graphs is not possible and questions like which graph of which source asserts x become relevant. We have two distinct use cases which warrant different setups of the database, simple as that.

The latter use case is not really within the SPARQL spec, so implementations may or may not support this. For example Oracle or Vertica would not do this well since they partition data according to graph or predicate, respectively. On the other hand, stores that work with one quad table, which is most of the ones out there, should do it maybe with some configuring, as shown above.

Frameworks like Jena are not to my knowledge geared towards having a wildcard for graph, although I would suppose this can be arranged by adding some "super-graph" object, a graph of all graphs. I don't think this is directly supported and besides most apps would not need it.

Once the indices are right, there is no difference between specifying a graph and not specifying a graph with the queries considered. With more complex queries, specifying a graph or set of graphs does allow some optimizations that cannot be done with no graph specified. For example, bitmap intersections are possible only when all leading key parts are given.

Conclusions

The best warm cache time is with v5; the five queries run under 600 ms after the first go. This is noted to show that all-in-memory with a single thread of execution is hard to beat.

Cluster v6 performs the same queries in 623 ms. What is gained in parallelism is lost in latency if all operations complete in microseconds. On the other hand, Cluster v6 leaves v5 in the dust in any situation that has less than 100% hit rate. This is due to actual benefit from parallelism if operations take longer than a few microseconds, such as in the case of disk reads. Cluster v6 has substantially better data layout on disk, as well as fewer pages to load for the same content.

This makes it possible to run the queries without the pogs index on Cluster v6 even when v5 takes prohibitively long.

The morale of the story is to have a lot of RAM and space-efficient data representation.

The DBpedia benchmark does not specify any random access pattern that would give a measure of sustained throughput under load, so we are left with the extremes of cold and warm cache of which neither is quite realistic.

Chris Bizer and I have talked on and off about benchmarks and I have made suggestions that we will see incorporated into the Berlin SPARQL benchmark, which will, I believe, be much more informative.

Appendix: Query Text

For reference, the query texts specifying the graph are below. To run without specifying the graph, just drop the FROM <http://dbpedia.org> from each query. The returned row counts are indicated below each query's text.

sparql SELECT ?p ?o FROM <http://dbpedia.org> WHERE {
  <http://dbpedia.org/resource/Metropolitan_Museum_of_Art> ?p ?o };

-- 1337 rows

sparql PREFIX p: <http://dbpedia.org/property/>
SELECT ?film1 ?actor1 ?film2 ?actor2
FROM <http://dbpedia.org> WHERE {
  ?film1 p:starring <http://dbpedia.org/resource/Kevin_Bacon> .
  ?film1 p:starring ?actor1 .
  ?film2 p:starring ?actor1 .
  ?film2 p:starring ?actor2 . };

--  23910 rows

sparql PREFIX p: <http://dbpedia.org/property/>
SELECT ?artist ?artwork ?museum ?director FROM <http://dbpedia.org> 
WHERE {
  ?artwork p:artist ?artist .
  ?artwork p:museum ?museum .
  ?museum p:director ?director };

-- 303 rows

sparql PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?s ?homepage FROM <http://dbpedia.org>  WHERE {
   <http://dbpedia.org/resource/Berlin> geo:lat ?berlinLat .
   <http://dbpedia.org/resource/Berlin> geo:long ?berlinLong . 
   ?s geo:lat ?lat .
   ?s geo:long ?long .
   ?s foaf:homepage ?homepage .
   FILTER (
     ?lat        <=     ?berlinLat + 0.03190235436 &&
     ?long       >=     ?berlinLong - 0.08679199218 &&
     ?lat        >=     ?berlinLat - 0.03190235436 && 
     ?long       <=     ?berlinLong + 0.08679199218) };

-- 56 rows

sparql PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX p: <http://dbpedia.org/property/>
SELECT ?s ?a ?homepage FROM <http://dbpedia.org>  WHERE {
   <http://dbpedia.org/resource/New_York_City> geo:lat ?nyLat .
   <http://dbpedia.org/resource/New_York_City> geo:long ?nyLong . 
   ?s geo:lat ?lat .
   ?s geo:long ?long .
   ?s p:architect ?a .
   ?a foaf:homepage ?homepage .
   FILTER (
     ?lat        <=     ?nyLat + 0.3190235436 &&
     ?long       >=     ?nyLong - 0.8679199218 &&
     ?lat        >=     ?nyLat - 0.3190235436 && 
     ?long       <=     ?nyLong + 0.8679199218) };

-- 13 rows
# PermaLink Comments [0]
05/09/2008 15:33 GMT-0500 Modified: 05/12/2008 11:24 GMT-0500
RDBMS to RDF Mapping Workshop, and Benchmarks
RDBMS to RDF Mapping Workshop, and Benchmarks

I was recently in Boston for the Mapping Relational Data to RDF workshop of the W3C.

The common feeling was that mapping everything to RDF and querying it in terms of a generic domain ontology, mapped on demand into whatever line of business systems, would be very good if it only could be done. However, since this is not so easily done, the next best is to extract the data and then warehouse it as RDF.

The obstacles perceived were of the following types:

  • Lack of quality in the data. The different line of business systems do not in and of themselves hold enough semantics. If the meaning of data columns in relational tables were really known and explicit, these could be meaningfully used for joining across systems. But this is more complex than just mapping the metal lead to the chemical symbol Pb and back.

  • Lack of performance in RDF storage. Data sets even in the tens-of-millions of triples do not run very well in some stores. Well, we had the Banff life sciences demo with 450M triples in a small server box running Virtuoso, so this is not universal, plus of course we are coming up with a whole different order of magnitude, as often discussed on this blog.

  • Lack of functionality in mapping and possibly lack of pushing through enough of the query processing to the underlying data stores.

Personally, I am quite aware of what to do with regard to performance of mapping and storage, and see these as eminently solvable issues. After all, we have a great investment of talent in databases in general and it can be well deployed towards RDF, as we have been doing these past couple of years. So we talk about the promise of a 360-degree view of information, with RDF being the top layer. Everybody agrees that this is a nice concept. But this is a nice concept especially when it can do the things that are the most common baseline expectation of any regular DBMS, i.e., aggregation, grouping, sub-queries, VIEWs. Now, I would not go sell a DBMS that has no COUNT operator to a data warehousing shop.

The fact that OpenLink and Oracle allow RDF inside SQL, and OpenLink even adds native aggregates and grouping to SPARQL, fixes the problem with regard to specific products, but leaves the standardization issue open. Of course, any vendor will solve these questions one way or another because a database with no aggregation is a non-starter.

I talked to Lee Feigenbaum, chair of the W3C DAWG, about the question of aggregates and general BI capabilities in SPARQL. He told me that, prior to his time with the DAWG, these were left out because they conflicted with the open-world assumption around RDF: You cannot count a set because by definition you do not know that you have all the members, the world being open and all that.

Say what? Talk about the road to hell being paved with good intentions. Now, this is in no way Lee's or the present day DAWG's fault; as a member myself, I can attest to the good work and would under no circumstances wish any delays or revisions to SPARQL at this point. I am just pointing out a matter that all implementations should address, as a sort of precondition of entry into the real world IS space. If this can be done interoperably, so much the better.

Now, out of the deliberations at the Boston workshop arose at least two ideas for follow-up activity.

The first was an incubator group for RDF store and mapping benchmarking. This is very appropriate in order to dispel the bad name RDF storage and querying performance has been saddled with. As a first step in this direction, I will outline a social web oriented benchmark on this blog.

The second activity was an incubator group for preparing standardization of mapping methodologies from relational schemas to RDF. We will be active on this as well.

The two offshoots appear logically separate but are not necessarily so in practice. A benchmark is after all something that is supposed to promote a technology to a user base. The user base seems to wish to put all online systems and data warehouses under a common top level RDF model and then query away, introducing no further replication of data or performance cost or ETL latencies.

Updating would also be nice but even query only would be very good. Personally, I'd say the RDF strength is all on the query side. Transactions are taken care of well enough by what there already is, RDF stands out in integration and the ad-hoc and discovery side of the matter. Given this, we expect the value to be consumed in a heterogeneous, multi-database, federated environment. Thus a benchmark should measure this aspect of the use-case. With the right mapping and queries, we could probably demonstrate the added cost of RDF to be very low, as long as we could push all queries that can be answered by a single source to the responsible DBMS. For distributed joins, we are back at the question of optimizing distributed queries but this is a familiar one and RDF is not the principal cost factor.

The subject does become quite complex at this point. We would have to take supposedly representative synthetic OLTP and BI data sets (like the ones in TPC-D, TPC-E, and TPC-H), and invent queries across them that would both make sense and be implementable in SPARQL extended with aggregates and sub-queries. Reliance on SPARQL extensions is simply unavoidable. Setting up the test systems would be non-trivial, even though there is a lot of industry experience in these matters on the database side.

So, while this is probably the benchmark most relevant to the target audience, we may have to start with a simpler one. I will next outline something to the effect.

# PermaLink Comments [0]
11/21/2007 08:07 GMT-0500 Modified: 04/25/2008 16:29 GMT-0500
Virtuoso and cluster capacity allocation
Virtuoso and cluster capacity allocation

I just read Google's Bigtable paper. It is relevant here because it talks about keeping petabyte scale (1024TB) tables on a variable size cluster of machines.

I have talked about partitioning versus distributed cache in the second to last post. The problem in short is that you do not expect a DBA to really know how to partition things, and even if the indices are correctly partitioned initially, repartitioning them is so bad that doing it online can be a problem. And repartitioning is needed whenever adding machines, unless the size increment is a doubling, which it will never be.

So Oracle has really elegantly stepped around the whole problem by not partitioning for clustering in the first place. So incremental capacity change does not require repartitioning. Oracle has partitioning for other purposes but this is not tied to their cluster proposition.

I did not go the cache fusion route because I could not figure a way to know with near certainty where to send a request for a given key value. In the case we are interested in, the job simply must go to the data and not the other way around. Besides, not being totally dependent on a microsecond latency interconnect and a SAN for performance enhances deployment options. Sending large batches of functions tolerates latency better than cache consistency messages which are a page at a time, unless of course you kill yourself with extra trickery for batching these too.

So how to adapt to capacity change? Well, by making the unit of capacity allocation much smaller than a machine, of course.

Google has done this in Bigtable by a scheme of dynamic range partitioning. The partition size is in the tens to hundreds of megabytes, something that can be moved around within reason. When the partition, called a tablet, gets too big, it splits. Just like a Btree index. The tree top must be common knowledge, as well as the allocation of partitions to servers but these can be cached here and there and do not change all the time.

So how could we do something of the sort here? I know for an experiential fact that when people cannot change the server memory pool size, let alone correctly set up disk striping, they simply cannot be expected to deal with partitioning. Besides, even if you know exactly what you are doing and why, configuring and refilling large numbers of partitions by hand is error prone, tedious, time consuming, and will run out of disk and require restoring backups and all sorts of DBA activity that will have everything down for a long time, unless of course you have MIS staff such as is not easily found.

The solution is not so complex. We start with a set number of machines and make a file group on each. A file group has a bunch of disk stripes and a log file and can be laid out on the local file system in the usual manner. The data goes into the file group, partitioned as defined. You still specify partitioning columns but not where each partition goes. The system will decide this by itself. When a server's file group gets too big, it splits. One half of each key's partition in the original stays where it was and the other half goes to the copy. The copies will hold rows that no longer belong there but these can be removed in the background. The new file group will be managed by the same server process and the partitioning information on all servers gets updated to reflect the existence of the new file group and the range of hash values that belong there.

If a file group is kept at some reasonable size, under a few GB, these can be moved around between servers, even dynamically.

If data is kept replicated, then the replicas have to split at the same time and the system will have to make sure that the replicas are kept on separate machines.

So what happens to disk locality when file groups split? Nothing much. Firstly, partitioning will be set up so that consecutive values go to the same hash value, so that key compression is not ruined. Thus, consecutive numbers will be on the same page. Imagine an integer key partitioned two ways on bits 10-20. Values 0-1K go together, values 1K-2K go another way, values 2K-3K go the first way etc.

Now let us suppose the first partition, the even K's splits. It could split so that multiples of 4 go one way and the rest another way. Now we'd have 0-1K in place, 2K-3K in the new partition, 4K-5K in place and so on. A sequential disk read, with some read ahead, would scan the partitions in parallel but the disk access would be made sequential by the read ahead logic — remember that these are controlled by the same server process.

For purposes of sending functions, the file group would be the recipient, not the host, per se. The allocation of file groups to hosts could change.

Now picture a transaction that touches multiple file groups. The requests going to collocated file groups can travel in the same batch and the recipient server process can run them sequentially or with a thread per file group, as may be convenient. Multiple threads per query on the same index make contention and needless thread switches. But since distinct file groups have their distinct mutexes there is less interference.

For purposes of transactions, we might view a file group as deserving a its own branch. In this way we would not have to abort transactions if file groups moved. A file group split would probably have to kill all uncommitted transactions on it so as not to have to split one branch in two or deal with uncommitted data in the split. This is hardly a problem, the event being rare. For purposes of checkpoints, logging, log archival, recovery, and such, a file group is its own unit. The Bigtable paper had some ideas about combining transaction logs and such, all quite straightforward and intuitive.

Writing the clustering logic with the file group, not the database process, as the main unit of location is a good idea and an entirely trivial change. This will make it possible to adjust capacity in almost real time without bringing everything to a halt by re-inserting terabytes of data in system wide repartitioning runs.

Implementing this on the current Virtuoso is not a real difficulty. There is already a concept of file group, although we use only two, one for the data and one for temp. Using multiple ones is not a big deal.

Supporting capacity allocation at the file group level instead of the server level can be introduced towards the middle of the clustering effort and will not greatly impact timetables.

# PermaLink Comments [0]
08/28/2007 07:54 GMT-0500 Modified: 04/30/2008 15:08 GMT-0500
Virtuoso Cluster Preview
Virtuoso Cluster Preview

I wrote the basics of the Virtuoso clustering support over the past three weeks.  It can now manage connections, decide where things go, do two phase commits, insert and select data from tables partitioned over multiple Virtuoso instances.  It works about enough to be measured, of which I will blog more over the next two weeks.

I will in the following give a features preview of what will be in the Virtuoso clustering support when it is released in the fall of this year (2007).

Data Partitioning

A Virtuoso database consists of indices only, so that the row of a table is stored together with the primary key.  Blobs are stored on separate pages when they do not fit inline within the row.  With clustering, partitioning can be specified index by index. Partitioning means that values of specific columns are used for determining where the containing index entry will be stored.  Virtuoso partitions by hash and allows specifying what parts of partitioning columns are used for the hash, for example bits 14-6 of an integer or the first 5 characters of a string.  Like this, key compression gains are not lost by storing consecutive values on different partitions.

Once the partitioning is specified, we specify which set of cluster nodes stores this index.  Not every index has to be split evenly across all nodes.  Also, all nodes do not have to have equal slices of the partitioned index, accommodating differences in capacity between cluster nodes.

Each Virtuoso instance can manage up to 32TB of data.  A cluster has no definite size limit.

Load Balancing and Fault Tolerance

When data is partitioned, an operation on the data goes where the data is.  This provides a certain natural parallelism but we will discuss this further below.

Some data may be stored multiple times in the cluster, either for fail-over or for splitting read load.  Some data, such as database schema, is replicated on all nodes.  When specifying a set of nodes for storing the partitions of a key, it is possible to specify multiple nodes for the same partition.  If this is the case, updates go to all nodes and reads go to a randomly picked node from the group.

If one of the nodes in the group fails, operation can resume with the surviving node.  The failed node can be brought back online from the transaction logs of the surviving nodes. A few transactions may be rolled back at the time of failure and again at the time of the failed node rejoining the cluster but these are aborts as in the case of deadlock and lose no committed data.

Shared Nothing

The Virtuoso architecture does not require a SAN for disk sharing across nodes.  This is reasonable since a few disks on a local controller can easily provide 300MB/s of read and passing this over an interconnect fabric that would also have to carry inter-node messages could saturate even a fast network.

Client View

A SQL or HTTP client can connect to any node of the cluster and get an identical view of all data with full transactional semantics.  DDL operations like table creation and package installation are limited to one node, though.

Applications such as ODS will run unmodified.  They are installed on all nodes with a single install command.  After this, the data partitioning must be declared, which is a one time operation to be done cluster by cluster.  The only application change is specifying the partitioning columns for each index.  The gain is optional redundant storage and capacity not limited to a single machine.  The penalty is that single operations may take a little longer when not all data is managed by the same process but then the parallel throughput is increased.  We note that the main ODS performance factor is web page logic and not database access.  Thus splitting the web server logic over multiple nodes gives basically linear scaling.

Parallel Query Execution

Message latency is the principal performance factor in a clustered database.  Due to this, Virtuoso packs the maximum number of operations in a single message.  For example, when doing a loop join that reads one table sequentially and retrieves a row of another table for each row of the outer table, a large number of the join of the inner loop are run in parallel.  So, if there is a join of five tables that gets one row from each table and all rows are on different nodes, the time will be spent on message latency.  If each step of the join gets 10 rows, for a total of 100000 results, the message latency is not a significant factor and the cluster will clearly outperform a single node.

Also, if the workload consists of large numbers of concurrent short updates or queries, the message latencies will even out and throughput will scale up even if doing a single transaction were faster on a single node.

Parallel SQL

There are SQL extensions for stored procedures allowing parallelizing operations.  For example, if a procedure has a loop doing inserts, the inserted rows can be buffered until a sufficient number is available, at which point they are sent in batches to the nodes concerned.  Transactional semantics are kept but error detection is deferred to the actual execution.

Transactions

Each transaction is owned by one node of the cluster, the node to which the client is connected.  When more than one node besides the owner of the transaction is updated, two phase commit is used.  This is transparent to the application code.  No external transaction monitor is required, the Virtuoso instances perform these functions internally.  There is a distributed deadlock detection scheme based on the nodes periodically sharing transaction waiting information.

Since read transactions can operate without locks, reading the last committed state of uncommitted updated rows, waiting for locks is not very common.

Interconnect and Threading

Virtuoso uses TCP to connect between instances.  A single instance can have multiple listeners at different network interfaces for cluster activity.  The interfaces will be used in a round-robin fashion by the peers, spreading the load over all network interfaces. A separate thread is created for monitoring each interface.  Long messages, such as transfers of blobs are done on a separate thread, thus allowing normal service on the cluster node while the transfer is proceeding.

We will have to test the performance of TCP over Infiniband to see if there is clear gain in going to a lower level interface like MPI.  The Virtuoso architecture is based on streams connecting cluster nodes point to point.  The design does not per se gain from remote DMA or other features provided by MPI.  Typically, messages are quite short, under 100K.  Flow control for transfer of blobs is however nice to have but can be written at the application level if needed.  We will get real data on the performance of different interconnects in the next weeks.

Deployment and Management

Configuring is quite simple, with each process sharing a copy of the same configuration file.  One line in the file differs from host to host, telling it which one it is.  Otherwise the database configuration files are individual per host, accommodating different file system layouts etc.  Setting up a node requires copying the executable and two configuration files, no more.   All functionality is contained in a single process.  There are no installers to be run or such.

Changing the number or network interface of cluster nodes requires a cluster restart.  Changing data partitioning requires copying the data into a new table and renaming this over the old one.  This is time consuming and does not mix well with updates.  Splitting an existing cluster node requires no copying with repartitioning but shifting data between partitions does.

A consolidated status report shows the general state and level of intra-cluster traffic as count of messages and count of bytes.

Start, shutdown, backup, and package installation commands can only be issued from a single master node. Otherwise all is symmetrical.

Present State and Next Developments

The basics are now in place.  Some code remains to be written for such things as distributed deadlock detection, 2-phase commit recovery cycle, management functions, etc.  Some SQL operations like text index, statistics sampling, and index intersection need special support, yet to be written.

The RDF capabilities are not specifically affected by clustering except in a couple of places.  Loading will be slightly revised to use larger batches of rows to minimize latency, for example.

There is a pretty much infinite world of SQL optimizations for splitting aggregates, taking advantage of co-located joins etc.  These will be added gradually.  These are however not really central to the first application of RDF storage but are quite important for business intelligence, for example.

We will run some benchmarks for comparing single host and clustered Virtuoso instances over the next weeks.  Some of this will be with real data, giving an estimate on when we can move some of the RDF data we presently host to the new platform.  We will benchmark against Oracle and DB2 later but first we get things to work and compare against ourselves.

We roughly expect a halving in space consumption and a significant increase in single query performance and linearly scaling parallel throughput through addition of cluster nodes.

The next update will be on this blog within two weeks.

# PermaLink Comments [0]
08/27/2007 05:51 GMT-0500 Modified: 04/25/2008 11:59 GMT-0500
RDF, Clustered Databases, and Partitioning Vs. Cache Fusion
RDF, Clustered Databases, and Partitioning Vs. Cache Fusion

I recently read Oracle's papers about RAC, Real Application Clusters. This is relevant as we are presently working on the Virtuoso equivalent.

Caveat: The following is quite technical and not the final word on the matter.

Oracle's claim is roughly as follows: Take a number of machines with access to a shared pool of disks and get scalability in processing power and memory without having to explicitly partition the data or perform other complicated configuration.

This works through implementing a cache consistency protocol between the participating boxes and by parallelizing queries just as one would do on a shared memory SMP box. Each disk page has a box assigned to keep track of it and the responsibility migrates so that the box most often needing the page gets to be the page's guardian, so as not to have to ask anybody else for permission to write the page.

This is a compelling proposition. Surely, it must be unrealistic to expect people to manually partition databases. This would require some understanding of first principles which is scarce out there.

So, should we implement clustering a la Oracle?

Let's look at some basics. If we have an OLTP workload like TPC-C, we usually have affinity between clients and the data they access. This will make each client's pages migrate to be managed by the box the client is connected to. This will work pretty well, no worse than with a single box. If two clients are updating the same data but are connected to two different boxes, this is quite bad since the box that does not have responsibility for the page must ask the other box for write access. This is a round trip, at least tens of microseconds (µs). Consider in comparison that finding a row out of a million takes some 3µs.

Would it not be better to have each partition in a known place and leave all processing to that place? The write contention would be resolved in the box owning the partition and there would be a message but now for requesting the update, not dealing with cache consistency. At what level should one communicate between cluster nodes? Talk about disk pages or about logical operations? If there is complete affinity between boxes and data, the RAC style shared cache needs no messages, each box ends up managing the pages of its clients and all works just as with a local situation. If on the other hand any client will update any page at random, most updates must request the write permission from another node. I will here presume that index tree tops get eventually cached on all nodes. If this were not so, even index lookups would have to most often request each index page from a remote box. Never forget that with a tree index, it takes about 1µs to descend one level and 50µs for a message round trip between processes, not counting any transport latency.

What of RDF query workloads? After all, we are in the first instance concerned with winning the RDF storage race and after this the TPC ones. We design for both but do RDF first since this is our chosen specialty.

The disadvantage of having to specify partitioning is less weighty with RDF since there are only a few big tables and they will be at default settings, pretty much always. We do not expect the application developer to ever change these settings although it is in principle possible.

What about queries? The RDF workload is mostly random access and loop joins. How would these run on RAC? For now, let's make a thought experiment and compare cache fusion to a hash partitioned cluster. In the following, I do not describe how Oracle actually works but will just describe how I would do things if I implemented a RAC style clustering. With a RAC style cluster, I'd split the outer loop into equal partitions and run them in parallel on different boxes. Each would build a working set for its part of the query and pages that were needed by more than one box would be read once from disk and a second time from the node that had them in cache. The top nodes of index trees would end up cached on all boxes. It would seem that all boxes would fill their cache with the same data. Now it may be that RAC makes it so that a page is cached only on one box and other boxes wanting the page must go to that box to get access to the page. But this would be a disaster in index lookups. It is less than a microsecond per local index tree level, but if there is a round trip, it would be at least 50µs per level of the index tree. I don't know for sure about Oracle but if I did RAC, I'd have to allow duplicate read content in caches. This would have the effect that the aggregate cache size would be closer to the single cache size than to the sum of the cache sizes. A physically partitioned database would not ship pages so caches would not overlap and the aggregate cache would indeed be the sum of the sizes. Now this is good only insofar all boxes participate but with evenly distributed data this is a good possibility.

Of course, if RAC knew to split queries so that data and nodes had real affinity then the problem would be less. For indices one would need a map of key values to boxes. A little like a top level index shared among all nodes. The key value would give the node, just like with partitioning.

This would make partitioning on the fly. Joins that were made the most frequently would cause migration that would make these joins co-located.

We must optimize the number of messages needed to execute long series of loop joins. For parallelizing single queries, the most obvious approach would be to partition the first/outermost loop that is more than one iteration into equal size chunks. With RDF data, the join keys will mostly begin GS or GO with a possible P appended. If GO or GS specify the partition, partitioning by hash will yield the node that will provide the result.

The number of messages can be reduced to a minimum of the number of join steps times number of boxes minus one if the loops are short enough and multiple operations are carried by one message.

With RAC style clustering, each index lookup would have to be sent to the node most likely to hold the answer. If pages have to be fetched from other nodes, we have disastrous performance, at least 50µs for each non-local page. If there are two non-local pages in a lookup, the overhead will exceed the overhead of delegating the single lookup. Index page access in lookups cannot be easily batched, the way index lookups going to the same node can be batched. Batching multiple hopefully long operations into a single message is the only way to defeat the extreme cost of sending messages. An index lookup does not know what page it will need until it needs it. A way of batching these would be to run multiple lookups in parallel and to combine remote page requests grouping them by destination. This would not be impossible, simply we would have to run 100 index lookups in parallel on a thread, 100 first levels, 100 second levels and so forth. Suppose an outer loop that gives 100 rows and then an inner loop that retrieves 1 row for each. A query to get the email address of Mary would do this, supposing 100 Marys in the db. {?person firstName "Mary" . ?person email ?m .}. Suppose a cluster of 10 nodes. The first node gets the 100 rows of the outer loop, splits these into 10x10, 10 on each node and then each node does 10 lookups in parallel, meaning 10 first levels, 10 second levels, 10 third levels. The index tree would be 4 deep, branching 300 ways. Running the query a second time would find all data in memory and run with only 18 messages after getting the 100 rows of the first loop. The first run would send lots of messages, almost two per page, for about 800 messages after getting the 100 rows of the first loop.

With partitioning, the situation would be sending 18 messages constant. 9 batches of 10 index lookups and their replies. The latency is 50µs and the lookup is 4µs. We would in fact gain in real time, counting 50µs for messages and 4µs per lookup, the time through the whole exercise of 10x10 random lookups would be 90µs.

If I did RAC style clustering, I'd have to allow replicating the tops of index trees to all caches, and I'd have to batch page request messages from index lookups, effectively doing the lookup vector processing style, meaning 100 first levels, 100 second levels etc. Given a key beginning, I'd have to know what node to send this to, meaning pretty much doing the first levels of the lookup before deciding where to send the lookup, only to have the lookup redone by the box ending up with the lookup. Doing things this way would make Oracle RAC style clustering work with the use case.

Given this, it appears that hash partitioning is easier to implement. Cache fusion clustering without the above mentioned gimmicks would be easiest of all but it would have a disastrous number of messages or it would fill all the caches with the same data. Avoiding this is possible but hard, as described above.

We will have to experiment with Oracle RAC itself a bit farther down the road. Deciding to use partitioning instead of cache fusion does bring along conversion cost and a very high cost for repartitioning.

Now let us look at the issue of co-location of joins. In a loop join this means that the node that holds the row from the outer loop also holds the row in the inner loop. For example, if order and order line are partitioned on order id, joining them on order id will be a  co-located join. Such joins do not involve necessary messages in partitioned clusters. In RAC, they do not involve messages if the pages have migrated to be managed by the node doing the join, otherwise they do, up to 20 or so for the worst case.

Do we get any benefit from co-location with RDF? Supposing joins that go from S to O to S (e.g., population of the city where Mary lives), we do not get much guarantee of co-location.

Suppose the indices GSPO partitioned of GS and OGPS on OG, we know the box with Marys and then we'd know the box where the residence of each Mary was, based on GS. given the city as S, we would again know the box that had the population. All the three triples could be on different boxes. This cannot be helped at design time. At run time this can be helped by batching messages that go to the same node. Let's see how this fans out. 100 Marys from the first node. To get the city, we get 10 batches of 10 messages. We get the 100 cities and then we get their populations, again 10 batches of 10. In this scenario, the scheduling is centrally done by one thread. Suppose it were done by the 10 batches of 10 for getting the city of each Mary. For 10 cities, we'd get 10 lookups for population, each potentially to a different node. For this case, managing the execution by one thread instead of several makes bigger batches and less messages, as one would expect.

It seems that with the RDF case, one may as well forget co-location. In the relational case, one must take advantage of it when there is co-location and when not, try to compensate with longer batches of function shipping.

Excellent as some of the RAC claims are, it still seems that making it work well for an RDF workload would take such magical heuristics of location choice that implementing them would be hard and the result not altogether certain. I could get it to work eventually but hash partitioning seems by far the more predictable route. Also hash partitioning will work in shared nothing scenarios whereas RAC requires shared disks. Shared nothing will not require a SAN, which may make it somewhat lower cost. Also, if messages are grouped in large batches, the performance of the interconnect is not so critical, meaning that maybe even gigabit ethernet might do in cases. RAC style cache maintenance is more sensitive to interconnect latency than batched function shipping. Batched cache consistency is conceivable as discussed above but tough to do.

For recovery and hot software updates, things can be arranged if there is non-local disk access or if partitions are mirrored. A RAC type cluster could use a SAN with internal mirroring. A hash partitioned system could mirror partitions to more than one box with local disk, thus using no mirrored disks. Repartitioning remains the bane of partitioning and not much can be done about that, it seems. The only easy repartitioning is doubling the cluster size. So it seems.

# PermaLink Comments [0]
07/19/2007 08:29 GMT-0500 Modified: 04/25/2008 11:59 GMT-0500
 <<     | 1 | 2 |     >>
Powered by OpenLink Virtuoso Universal Server
Running on Linux platform
OpenLink Software 1998-2006