Details

Virtuso Data Space Bot
Burlington, United States

Subscribe

Post Categories

Recent Articles

Display Settings

articles per page.
order.
Showing posts in all categories RefreshRefresh
Virtuoso Directions for 2011

At the start of 2010, I wrote that 2010 would be the year when RDF became performance- and cost-competitive with relational technology for data warehousing and analytics. More specifically, RDF would shine where data was heterogenous and/or where there was a high frequency of schema change.

I will now discuss what we have done towards this end in 2010 and how you will gain by this in 2011.

At the start of 2010, we had internally demonstrated 4x space efficiency gains from column-wise compression and 3x loop join speed gains from vectored execution. To recap, column-wise compression means a column-wise storage layout where values of consecutive rows of a single column are consecutive in memory/disk and are compressed in a manner that benefits from the homogenous data type and possible sort order of the column. Vectored execution means passing large numbers of query variable bindings between query operators and possibly sorting inputs to joins for improving locality. Furthermore, always operating on large sets of values gives extra opportunities for parallelism, from instruction level to threads to scale out.

So, during 2010, we integrated these technologies into Virtuoso, for relational- and graph-based applications alike. Further, even if we say that RDF will be close to relational speed in Virtuoso, the point is moot if Virtuoso's relational speed is not up there with the best of analytics-oriented RDBMS. RDF performance does rest on the basis of general-purpose database performance; what is sauce for the goose is sauce for the gander. So we reimplemented HASH JOIN and GROUP BY, and fine-tuned many of the tricks required by TPC-H. TPC-H is not the sole final destination, but it is a step on the way and a valuable checklist for what a database ought to do.

At the Semdata workshop of VLDB 2010 we presented some results of our column store applied to RDF and relational tasks. As noted in the paper, the implementation did demonstrate significant gains over the previous row-wise architecture but was not yet well optimized, so not ready to be compared with the best of the relational analytics world. A good part of the fall of 2010 went into optimizing the column store and completing functionality such as transaction support with columns.

A lot of this work is not specifically RDF oriented, but all of this work is constantly informed by the specific requirements of RDF. For example, the general idea of vectored execution is to eliminate overheads and optimize CPU cache and other locality by doing single query operations on arrays of operands so that the whole batch runs more or less in CPU cache. Are the gains not lost if data is typed at run time, as in RDF? In fact, the cost of run-time-typing turns out to be small, since data in practice tends to be of homogenous type and with locality of reference in values. Virtuoso's column store implementation resembles in broad outline other column stores like Vertica or VectorWise, the main difference being the built-in support for run-time heterogenous types.

The LOD2 EU FP 7 project started in September 2010. In this project OpenLink and the celebrated heroes of the column store, CWI of MonetDB and VectorWise fame, represent the database side.

The first database task of LOD2 is making a survey of the state of the art and a round of benchmarking of RDF stores. The Berlin SPARQL Benchmark (BSBM) has accordingly evolved to include a business intelligence section and an update stream. Initial results from running these will become available in February/March, 2011. The specifics of this process merit another post; let it for now be said that benchmarking is making progress. In the end, it is our conviction that we need a situation where vendors may publish results as and when they are available and where there exists a well defined process for documenting and checking results.

LOD2 will continue by linking the universe, as I half-facetiously put it on a presentation slide. This means alignment of anything from schema to instance identifiers, with and without supervision, and always with provenance, summarization, visualization, and so forth. In fact, putting it this way, this gets to sound like the old chimera of generating applications from data or allowing users to derive actionable intelligence from data of which they do not even know the structure. No, we are not that unrealistic. But we are moving toward more ad-hoc discovery and faster time to answer. And since we provide an infrastructure element under all this, we want to do away with the "RDF tax," by which we mean any significant extra cost of RDF compared to an alternate technology. To put it another way, you ought to pay for unpredictable heterogeneity or complex inference only when you actually use them, not as a fixed up-front overhead.

So much for promises. When will you see something? It is safe to say that we cannot very well publish benchmarks of systems that are not generally available in some form. This places an initial technology preview cut of Virtuoso 7 with vectored execution somewhere in January or early February. The column store feature will be built in, but more than likely the row-wise compressed RDF format of Virtuoso 6 will still be the default. Version 6 and 7 databases will be interchangeable unless column-store structures are used.

For now, our priority is to release the substantial gains that have already been accomplished.

After an initial preview cut, we will return to the agenda of making sure Virtuoso is up there with the best in relational analytics, and that the equivalent workload with an RDF data model runs as close as possible to relational performance. As a first step this means taking TPC-H as is, and then converting the data and queries to the trivially equivalent RDF and SPARQL and seeing how it goes. In the September paper we dabbled a little with the data at a small scale but now we must run the full set of queries at 100GB and 300GB scales, which come to about 14 billion and 42 billion triples, respectively. A well done analysis of the issues encountered, covering similarities and dissimilarities of the implementation of the workload as SQL and SPARQL, should make a good VLDB paper.

Database performance is an entirely open-ended quest and the bag of potentially applicable tricks is as good as infinite. Having said this, it seems that the scales comfortably reached in the TPC benchmarks are more than adequate for pretty much anything one is likely to encounter in real world applications involving comparable workloads. Businesses getting over 6 million new order transactions per minute (the high score of TPC-C) or analyzing a warehouse of 60 billion orders shipped to 6 billion customers over 7 years (10000GB or 10TB TPC-H) are not very common if they exist at all.

The real world frontier has moved on. Scaling up the TPC workloads remains a generally useful exercise that continues to contribute to the state of the art but the applications requiring this advance are changing.

Someone once said that for a new technology to become mainstream, it needs to solve a new class of problem. Yes, while it is a preparatory step to run TPC-H translated to SPARQL without dying of overheads, there is little point in doing this in production since SQL is anyway likely better and already known, proven, and deployed.

The new class of problem, as LOD2 sees it, is the matter of web-wide cross-organizational data integration. Web-wide does not necessarily mean crawling the whole web, but does tend to mean running into significant heterogeneity of sources, both in terms of modeling and in terms of usage of more-or-less standard data models. Around this topic we hear two messages. The database people say that inference beyond what you can express in SQL views is theoretically nice but practically not needed; on the other side, we hear that the inference now being standardized in efforts like RIF and OWL is not expressive enough for the real world. As one expert put it, if enterprise data integration in the 1980s was between a few databases, today it is more like between 1000 databases, which makes this matter similar to searching the web. How can one know in such a situation that the data being aggregated is in fact meaningfully aggregate-able?

Add to this the prevalence of unstructured data in the world and the need to mine it for actionable intelligence. Think of combining data from CRM, worldwide media coverage of own and competitive brands, and in-house emails for assessing organizational response to events on the market.

These are the actual use cases for which we need RDF at relational DW performance and scale. This is not limited to RDF and OWL profiles, since we fully believe that inference needs are more diverse. The reason why this is RDF and not SQL plus some extension of Datalog, is the widespread adoption of RDF and linked data as a data publishing format, with all the schema-last and open world aspects that have been there from the start.

Stay tuned for more news later this month!

Related

# PermaLink Comments [0]
01/19/2011 11:29 GMT-0500 Modified: 01/20/2011 12:54 GMT-0500
Comparing Virtuoso Performance on Different Processors

Over the years we have run Virtuoso on different hardware. We will here give a few figures that help identify the best price point for machines running Virtuoso.

Our test is very simple: Load 20 warehouses of TPC-C data, and then run one client per warehouse for 10,000 new orders. The way this is set up, disk I/O does not play a role and lock contention between the clients is minimal.

The test essentially has 20 server and 20 client threads running the same workload in parallel. The load time gives the single thread number; the 20 clients run gives the multi-threaded number. The test uses about 2-3 GB of data, so all is in RAM but is large enough not to be all in processor cache.

All times reported are real times, starting from the start of the first client and ending with the completion of the last client.

Do not confuse these results with official TPC-C. The measurement protocols are entirely incomparable.

Test Platform Load
(seconds)
Run
(seconds)
GHz / cores / threads
1 Amazon EC2 Extra Large
(4 virtual cores)
340 42 1.2 GHz? / 4 / 1
1 Amazon EC2 Extra Large
(4 virtual cores)
305 43.3 1.2 GHz? / 4 / 1
2 1 x dual-core AMD 5900 263 58.2 2.9 GHz / 2 / 1
3 2 x dual-core Xeon 5130 ("Woodcrest") 245 35.7 2.0 GHz / 4 / 1
4 2 x quad-core Xeon 5410 ("Harpertown") 237 18.0 2.33 GHz / 8 / 1
5 2 x quad-core Xeon 5520 ("Nehalem") 162 18.3 2.26 GHz / 8 / 2

We tried two different EC2 instances to see if there would be variation. The variation was quite small. The tested EC2 instances costs 20 US cents per hour. The AMD dual-core costs 550 US dollars with 8G. The 3 Xeon configurations are Supermicro boards with 667MHz memory for the Xeon 5130 ("Woodcrest") and Xeon 5410 ("Harpertown"), and 800MHz memory for the Nehalem. The Xeon systems cost between 4000 and 7000 US dollars, with 5000 for a configuration with 2 x Xeon 5520 ("Nehalem"), 72 GB RAM, and 8 x 500 GB SATA disks.

Caveat: Due to slow memory (we could not get faster within available time), the results for the Nehalem do not take full advantage of its principal edge over the previous generation, i.e., memory subsystem. We'll see another time with faster memories.

The operating systems were various 64 bit Linux distributions.

We did some further measurements comparing Harpertown and Nehalem processors. The Nehalem chip was a bit faster for a slightly lower clock but we did not see any of the twofold and greater differences advertised by Intel.

We tried some RDF operations on the two last systems:

operation Harpertown Nehalem
Build text index for DBpedia 1080s 770s
Entity Rank iteration 263s 251s

Then we tried to see if the core multithreading of Nehalem could be seen anywhere. To this effect, we ran the Fibonacci function in SQL to serve as an example of an all in-cache integer operation. 16 concurrent operations took exactly twice as long as 8 concurrent ones, as expected.

For something that used memory, we took a count of RDF quads on two different indices, getting the same count. The database was a cluster setup with one process per core, so a count involved one thread per core. The counts in series took 5.02s and in parallel they took 4.27s.

Then we took a more memory intensive piece that read the RDF quads table in the order of one index and for each row checked that there is the equal row on another, differently-partitioned index. This is a cross-partition join. One of the indices is read sequentially and the other at random. The throughput can be reported as random-lookups-per-second. The data was English DBpedia, about 140M triples. One such query takes a couple of minutes with a 650% CPU utilization. Running multiple such queries should show effects of core multithreading since we expect frequent cache misses.

  1. On the host OS of the Nehalem system —
    n cpu% rows per second
    1 query 503 906,413
    2 queries 1263 1,578,585
    3 queries 1204 1,566,849
  2. In a VM under Xen, on the Nehalem system —
    n cpu% rows per second
    1 query 652 799,293
    2 queries 1266 1,486,710
    3 queries 1222 1,484,093
  3. On the host OS of the Harpertown system —
    n cpu% rows per second
    1 query 648 1,041,448
    2 queries 708 1,124,866

The CPU percentages are as reported by the OS: user + system CPU divided by real time.

So, Nehalem is in general somewhat faster, around 20-30%, than Harpertown. The effect of core multithreading can be noticed but is not huge, another 20% or so for situations with more threads than cores. The join where Harpertown did better could be attributed to its larger cache — 12 MB vs 8 MB.

We see that Xen has a measurable but not prohibitive overhead; count a little under 10% for everything, also tasks with no I/O. The VM was set up to have all CPU for the test and the queries did not do disk I/O.

The executables were compiled with gcc with default settings. Specifying -march=nocona (Core 2 target) dropped the cross-partition join time mentioned above from 128s to 122s on Harpertown. We did not try this on Nehalem but presume the effect would be the same, since the out-of-order unit is not much different. We did not do anything about process-to-memory affinity on Nehalem, which is a non-uniform architecture. We would expect this to increase performance since we have many equal size processes with even load.

The mainstay of the Nehalem value proposition is a better memory subsystem. Since the unit we got was at 800 MHz memory, we did not see any great improvement. So if you buy Nehalem, you should make sure it is with 1333 MHz memory, else the best case will not be over 50% over a 667 MHz Core 2-based Xeon.

Nehalem remains a better deal for us because of more memory per board. One Nehalem box with 72 GB costs less than two Harpertown boxes with 32 GB and offers almost the same performance. Having a lot of memory in a small space is key. With faster memory, it might even outperform two Harpertown boxes, but this remains to be seen.

If space were not a constraint, we could make a cluster of 12 small workstations for the price of our largest system and get still more memory and more processor power per unit of memory. The Nehalem box was almost 4x faster than the AMD box but then it has 9x the memory, so the CPU to memory ratio might be better with the smaller boxes.

# PermaLink Comments [0]
05/28/2009 10:54 GMT-0500 Modified: 05/28/2009 11:15 GMT-0500
De Paradigmata and The Foundational Issues
De Paradigmata and The Foundational Issues

I thought that we had talked ourselves to exhaustion and beyond over the issue of the semantic web layer cake. Apparently not. There was a paper called Functional Architecture for the Semantic Web by Aurona Gerber et al at ESWC2008.

The thrust of the matter was that for newcomers the layer cake was confusing and did not clearly indicate the architecture. Why, sure. My point is that no rearranging of the boxes will cut it for the general case.

Any diagram containing the boxes of the layer cake (i.e., URI, XML, SPARQL, OWL, RIF, Crypto, etc., etc.) in whatever order or arrangement can at best be a sort of overview of how these standards reference each other.

Such diagrams are a little like saying that a car combines the combustion properties of fuel/air mixes with the tension and compression resistance properties of metals and composites for producing motion and secondly links to Newton's laws of motion and to aerodynamics.

Not false. But it does not say that a car is good for economical commute or showing off at the strip or any number of niches that a mature industry has grown to serve.

Now, talking of software engineering, modules and interfaces are good and even necessary. The trick is to know where to put the interface.

Such a thing cannot possibly be inferred from the standards' inter-reference picture. APIs, especially if these are Web service APIs, should go where there is low data volume and tolerance for latency. For example, either inference is a preprocessing step or it is embedded right inside a SPARQL engine. Such a thing cannot be seen from the picture. Same for trust. Trust is not an after-thought at the top of the picture, except maybe in the sense of referring to the other parts.

We hear it over and over. Scale and speed are critical. Arrange the blocks of any real system as makes sense for data flow; do not confuse literature references with control or data structure.

The even-more foundational issue is the promotion of the general concept of a Web of Data.

The core idea that the Web would be a query-able collection of data with meaningful reference between data of different provenance cannot be inferred from the picture, even though this should be its primary message. Or it is better to say that the first picture shown should stress this idea and then one could leave the layer cake, in whatever version, for explaining the standards' order of evolution or inter-reference.

So, the value proposition:

Why? Explosion of data volume, increased need of keeping up-to-date, increasing opportunity cost of not keeping in real time.

What? An architecture that is designed for unanticipated joining and evolution of data across heterogeneous sources, either at Web or enterprise scale.

How? URI everything and everything is cool, or, give things global names. Use RDF. Reuse names or ontologies where can. (An ontology is a set of classes and property names plus some more.) Map relational data on the fly or store as RDF, whichever works. Query with SPARQL, easier than SQL.

So, my challenge for the graphics people would be to make an illustration of the above. Forget the alphabet soup. Show the layer cake as a historical reference or literature guide. Do not imply that this proliferation of boxes equates to an equal proliferation of Web services, for example.

# PermaLink Comments [0]
06/09/2008 10:02 GMT-0500 Modified: 06/11/2008 15:54 GMT-0500
WWW 2008
WWW 2008

Following my return from WWW 2008 in Beijing, I will write a series of blog posts discussing diverse topics that were brought up in presentations and conversations during the week.

Linked data was our main interest in the conference and there was a one day workshop on this, unfortunately overlapping with a day of W3C Advisory Committee meetings. Hence Tim Berners-Lee, one of the chairs of the workshop, could not attend for most of the day. Still, he was present to say that "Linked open data is the semantic web and the web done as it ought to be done."

For my part, I will draw some architecture conclusions from the different talks and extrapolate about the requirements on database platforms for linked data.

Chris Bizer predicted that 2008 would be the year of data web search, if 2007 was the year of SPARQL. This may be the case, as linked data is now pretty much a reality and the questions of discovery become prevalent. There was a birds-of-a-feather session on this and I will make some comments on what we intend to explore in bridging between the text index based semantic web search engines and SPARQL.

Andy Seaborne convened a birds-of-a-feather session on the future of SPARQL. Many of the already anticipated and implemented requirements were confirmed and a few were introduced. A separate blog post will discuss these further.

From the various discussions held throughout the conference, we conclude that plug-and-play operation with the major semantic web frameworks of Jena, Sesame, and Redland, is our major immediate-term deliverable. Our efforts in this direction thus far are insufficient and we will next have these done with the right supervision and proper interop testing. The issues are fortunately simple but doing things totally right require some small server side support and some JDBC/ODBC tweaks, so to the interested, we advise to wait for an update to be published on this blog.

I further had a conversation with Andy Seaborne about using Jena reasoning capabilities with Virtuoso and generally the issues of "impedance mismatch" between reasoning and typical database workloads. More on this later.

# PermaLink Comments [0]
04/29/2008 10:37 GMT-0500 Modified: 04/29/2008 13:35 GMT-0500
Architectures for Infinity
Architectures for Infinity

A while back, a friend suggested he and I go check out the Singularity Summit, a conference where they talk of strong AI. Well, I am not a singularist. But since singularists are so brazenly provocative, I looked a little to see if there is anything there, engineering-wise.

So, for a change, we'll be visionary. Read on, I also mention RDF at the end.

I will not even begin with arguments about indefinitely continuing a trend. I will just say that nature has continuous and discontinuous features. Computing is at a qualitative bend and things are not as simple as they are blithely cracked up to be. Read HEC, the high end crusader; he has good commentary on architecture. Having absorbed and understood this, you can talk again about a billion-fold increase in computing price/performance.

When I looked further, about uploading, i.e., the sci-fi process of scanning a brain and having it continue its function inside a computer simulation, I actually found some serious work. Dharmendra Modha at IBM had made a rat-scale cortical simulator running on a 32K node IBM BlueGene, with a 9x slowdown from real time. This is real stuff and in no way meant to be associated with singularism by being mentioned in the same paragraph. Anyway, I was intrigued.

So I asked myself if I could do better. I gave it a fair try, as fair as can be without an actual experiment. The end result is that latency rules. To have deterministic semantics, one must sync across tens of thousands of processors and one cannot entirely eliminate multi-hop messages on the interconnect fabric. If one optimistically precalculates and spreads optimistic results over the cluster, rolling them back will be expensive. Local optimistic computation may have little point since most data points will directly depend on non local data. One needs two sets of wiring, a 3D torus for predominantly close range bulk and a tree for sync, just like the BlueGene has. Making the same from newer hardware makes it more economical but there's another 8 or so orders of magnitude to go before energy efficiency parity with biology. Anyway, a many-times-over qualitative gap. Human scale in real time might be just there somewhere in reach with the stuff now in pre-release if there were no bounds on cost or power consumption. We'd be talking billions at least and even then it is iffy. But this is of little relevance as long as the rat scale is at the level of a systems test for the platform and not actually simulating a plausible organism in some sort of virtual environment. Anyway, of biology I cannot judge and for the computer science, Modha et al have it figured out about as well as can.

Simulation workloads are not the same as database. Database is easier in a way since any global sync is very rare. 2PC seldom touches every partition and if it does, the time to update was anyway greater than the time to commit. Databases are generally multi-user, with a low level of sync between users and a lot of asynchrony. On the other hand, the general case of database does not have predictable cluster locality. Well, if one has an OLTP app with a controlled set of transactions and reports, then one can partition just for that and have almost 100% affinity between the host serving the connection and the data. With RDF for example, such is not generally possible or would require such acrobatics of data layout that a DBA would not even begin.

So, for database, there is a pretty much even probability for any connection between the node running the query and any other. True, function shipping can make the messages large and fairly async and latency tolerant.

On the simulation side, it would seem that the wiring can mimic locality of the simulated process. Messages to neighbors are more likely than messages to remote nodes. So a 3D torus works well there, complemented by a tree for control messages, like sending all nodes the count of messages to expect. Of course, the control wiring (reduce tree) must have far less latency than a long path through the torus wiring and steps with long message deliveries on the torus must be rare for this to bring any profit. Also, long distance messages can go through the tree wiring if the volume is not excessive, else the tree top gets congested.

So, looking past the multicore/multithread and a single level switched interconnect, what would the architecture of the total knowledge machine be? For the neural simulation, the above-described is the best I can come up with and IBM already has come up with it anyway. But here I am more concerned with a database/symbol processing workload than about physical simulation. For the record, I'll say that I expect no strong AI to emerge from these pursuits but that they will still be useful, basically as a support for a linked data "planetary datasphere." Like Google, except it supports arbitrary queries joining arbitrary data at web scale. This would be a couple of orders of magnitude above the text index of same. Now, in practice, most queries will be rather trivial, so it is not that the 2 orders of magnitude are always realized. This would also involve a bit more updating than the equivalent text index since reports would now and then include private materialized inference results. As a general rule, backward chaining would be nicer, since this is read only but some private workspaces for distilling data cannot be avoided.

So, given this general spec, where should architecture go? We can talk of processors, networks and software in turn.

Processors

For evaluating processors, the archetypal task is doing a single random lookup out of a big index. Whether it is a B tree or hash is not really essential since both will have to be built of fixed size pages and will have more than one level. Both need some serialization for checking if things are in memory and for pinning them in cache for the duration of processing. This is eternally true as long as updates are permitted, even if RAM were the bottom of the hierarchy. And if not, then there is the checking of whether a page is in memory and the logic for pinning it.

This means critical sections on shared data structures with small memory writes inside. This is anathema, yet true. So, what the processor needs is shared memory for threads and if possible an instruction for entry into a read-write lock. If the lock is busy, there should be a settable interval before the thread is removed from the core. With a multithread core, this should be just like a memory cache miss. Only if this were really prolonged would there be an interrupt to run the OS scheduler to vacate the thread and load something else, and not even this if the number of executable threads were less or equal the number of threads on the cores.

For thread sync structures, the transactional memory ideas on the SPARC ROC might be going in this direction. For on-core threading, I have not tried how well SPARC T2 does but with the right OS support, it might be in the ballpark. The X86_64 chips have nice thread speed but the OS, whether Solaris or Linux, is a disaster if a mutex is busy. Don't know why but so it is.

Things like IBM Cell, with multiple integrated distributed memory processors on a chip might be workable if they had hardware for global critical sections and if sub-microsecond tasks could be profitably dispatched to the specialized cores (synergistic unit in Cell terminology). If an index lookup, about 2-4 microseconds of real time, could be profitably carried out on a synergistic core without sinking it all in sync delays on the main core, there might be some gain to this. Still, operations like cache lookups tend to be pointer chasing and latency of small memory reads and sometimes writes is a big factor. I have not measured Cell for this but it is not advertised for this sort of workload. It might do nicely for the neuron simulator but not for generic database, I would guess.

Networks and Data Center

The point at which distributed memory wins over shared determines the size of the single compute node. Problems of threads and critical sections fall off but are replaced by network and the troubles of scheduling and serializing and de-serializing messages. There are really huge shared memory systems (by Cray, for example) but even there, hitting the wrong address sends a high latency message over an internal switched fabric, a bit like a network page fault. Well, if one has millions of threads runnable for catching the slack of memory access over interconnect, this might not be so bad, but this is not really the architecture for a general purpose database. So, at some point, we have clearly demarcated distributed memory with affinity of data to processor, simple as that.

For now, the high end clustered database benchmarks run off 1 Gbit ethernet fabrics with some 30 compute nodes on a single switch. This is for shared nothing systems. For shared disk, cache fusion systems like Oracle RAC, we have more heavy duty networking like Infiniband, as one would expect. I have discussed the merits of cache fusion vs shared nothing partitioning on in a previous post.

As long as we are at a scale that fits on a single switch with even port to port latency, we are set. For an RDF workload, throughput is not really the issue but latency can be. With today's technology, nodes with 4-8 cores and 16G RAM are practical and their number is most often not really large. Adding two orders of magnitude, we get more questions. Let's say that 2 billion triples fit with relative comfort but not without disk on a 16G RAM node. This would make 500 nodes for a trillion triples.

This is an entirely relevant and reasonable scale, considering that loading all public biomedical sets plus a pharma company's in house data could approach this ballpark. Let alone anything on the scale of the social web, i.e., the online conversation space.

So how to manage clusters like this? The current cluster gear is oriented toward switch trees and is fairly expensive, about $1000 per node. To make this easy, we would need a wiring free, modular architecture. Picture a shelf with SATA drive size bays, each would get a compute node of 8 cores and 16G plus interconnect. To simplify the network, the module would have a laser on each side, for a cube network topology, the enclosure could connect the edges with fiber optics for the 3D torus. Mass storage would be in the same form factor, as disks or flashes to be interspersed in the proper ratio with compute nodes. All would have the same communication chip, something like a Cray SeaStar with 6 or 7 external ports. A seventh port could be used for a reduce tree. The rack would provide cooling on the shelves by circulating a coolant fluid. Reconfiguring and scaling would be a matter of adding shelves to the cabinet and laying out Lego bricks, blue for compute and red for storage. The network could achieve a latency for an arbitrary point to point message of no more than 10-20 microseconds by the right mix of cube and tree. This in a box of thousands of nodes.

This type of layout would accommodate web scale without needing rows and rows of rack cabinets.

The external interfaces for web requests and replies are no big deal after this, since the intra-cluster traffic is orders of magnitude higher than the result set traffic. After all, the end user does not want dumps of the database but highly refined and ranked results.

Software

We have previously talked about dynamic partitions a la Google Bigtable or Amazon Dynamo. These techniques are entirely fine and will serve for the universal knowledge store.

But what about query logic? OK, having a consistent map of partitions shared over tens of thousands of nodes is entirely possible. So is replication and logic for using a spatially closer partition when multiple copies are know to exist. These things take some programming but there is nothing really new about them. These are a direct straightforward extension of what we do for clustering right now.

But look to windward. How to run complex queries and inference on the platform outlined above? There are some features of RDF querying like same-as that can be easily parallelized, backward-chaining style: Just proceed with the value at hand and initiate the lookup of synonyms and let them get processed when they become available. Same for subclass and sub-property. We already do this, but could do it with more parallelism.

No matter what advances in architecture take place, I do not see a world where every user materializes the entailment of their own same-as-es and rules over a web scale data set. So, backward chaining approaches to inference must develop. Luckily, what most queries need is simple. A rule-oriented language, like Prolog without cut will parallelize well enough. Some degree of memoization may be appropriate for cutting down on re-proving the same thing over and over. Memoization over a cluster is a problem though, since this involves messages. I should say that one should not go looking for pre-proven things beyond the node at hand and that computation should not spread too quickly or too promiscuously so as not to make long message paths. We must remember that a totally even round trip time on a large cluster just will not happen.

Query planning on any system critically depends on correct ballpark guesses on cardinality. If predicates are amplified with transitivity and same-as at run time, the cardinality guessing becomes harder. This can kill any plan no matter the hardware. Probably, an executing query must be annotated with the underlying cardinality assumptions. If these prove radically false, the execution may abort and a new plan be made to better match what was found out. This bears some looking into.

There are some network algorithms like shortest path, traveling salesman and similar that probably deserve a special operator in the query language. These can benefit from parallelization and a sequential implementation running on a cluster with latency will be a disaster. Expressing the message flow in a rule language is not really simple and pretty much no programmer will either appreciate the necessity or go to the trouble. Therefore such things should likely be offered by the platform and made by the few people who understand such matters.

For forward chaining, it seems that any results should generally go into their own graph so as not to pollute the base data. This graph, supposing it is small enough, can have different partitioning from the large base data set. If the data comes from far and wide but results are local, there is better usage of a RETE like algorithm for triggering inference when data comes in. RETE will parallelize well enough, also for clusters,results just have to be broadcast to the nodes that may have a use for them.

The programming model will typically be using a set of local overlays on a large shared data set. Queries will most often not be against a single graph. Strict transactionality will be the exception rather than the rule. At the database node level, there must be real transactionality even if the whole system most often did not run with strict ACID semantics. This is due to ACID requirements of some internal ops, e.g., some bitmap index operations, log checkpoints, DDL changes, etc.

For procedural tasks, map-reduce is OK. We have it even in our SQL. Map-reduce is not the basis for making a DBMS but it is a nice feature for some parts of query evaluation and application logic.

We have not talked about linking the data itself, but there is a whole workshop on this next week in Beijing; I will write about it separately. Let this just serve to state that we are serious about the platform for this, present and future.

Conclusion

The web scale database, the Google with arbitrary joining and inference, is one generation away, talking of economical implementation. Today, a price tag of $100K will buy some 50-100 billion triples worth with reasonable query response. Unlimited budget will take this a bit further, like one order of magnitude. and then, returns might be decreasing.

Of course, this is worth nothing if the software isn't there. Virtuoso with its cluster edition will allow one to use unlimited RAM and processors. The frontier is now at getting just the right parallelism for each task, including inference ones.

# PermaLink Comments [0]
04/14/2008 05:41 GMT-0500 Modified: 04/14/2008 13:19 GMT-0500
Virtuoso Cluster Preview
Virtuoso Cluster Preview

I wrote the basics of the Virtuoso clustering support over the past three weeks.  It can now manage connections, decide where things go, do two phase commits, insert and select data from tables partitioned over multiple Virtuoso instances.  It works about enough to be measured, of which I will blog more over the next two weeks.

I will in the following give a features preview of what will be in the Virtuoso clustering support when it is released in the fall of this year (2007).

Data Partitioning

A Virtuoso database consists of indices only, so that the row of a table is stored together with the primary key.  Blobs are stored on separate pages when they do not fit inline within the row.  With clustering, partitioning can be specified index by index. Partitioning means that values of specific columns are used for determining where the containing index entry will be stored.  Virtuoso partitions by hash and allows specifying what parts of partitioning columns are used for the hash, for example bits 14-6 of an integer or the first 5 characters of a string.  Like this, key compression gains are not lost by storing consecutive values on different partitions.

Once the partitioning is specified, we specify which set of cluster nodes stores this index.  Not every index has to be split evenly across all nodes.  Also, all nodes do not have to have equal slices of the partitioned index, accommodating differences in capacity between cluster nodes.

Each Virtuoso instance can manage up to 32TB of data.  A cluster has no definite size limit.

Load Balancing and Fault Tolerance

When data is partitioned, an operation on the data goes where the data is.  This provides a certain natural parallelism but we will discuss this further below.

Some data may be stored multiple times in the cluster, either for fail-over or for splitting read load.  Some data, such as database schema, is replicated on all nodes.  When specifying a set of nodes for storing the partitions of a key, it is possible to specify multiple nodes for the same partition.  If this is the case, updates go to all nodes and reads go to a randomly picked node from the group.

If one of the nodes in the group fails, operation can resume with the surviving node.  The failed node can be brought back online from the transaction logs of the surviving nodes. A few transactions may be rolled back at the time of failure and again at the time of the failed node rejoining the cluster but these are aborts as in the case of deadlock and lose no committed data.

Shared Nothing

The Virtuoso architecture does not require a SAN for disk sharing across nodes.  This is reasonable since a few disks on a local controller can easily provide 300MB/s of read and passing this over an interconnect fabric that would also have to carry inter-node messages could saturate even a fast network.

Client View

A SQL or HTTP client can connect to any node of the cluster and get an identical view of all data with full transactional semantics.  DDL operations like table creation and package installation are limited to one node, though.

Applications such as ODS will run unmodified.  They are installed on all nodes with a single install command.  After this, the data partitioning must be declared, which is a one time operation to be done cluster by cluster.  The only application change is specifying the partitioning columns for each index.  The gain is optional redundant storage and capacity not limited to a single machine.  The penalty is that single operations may take a little longer when not all data is managed by the same process but then the parallel throughput is increased.  We note that the main ODS performance factor is web page logic and not database access.  Thus splitting the web server logic over multiple nodes gives basically linear scaling.

Parallel Query Execution

Message latency is the principal performance factor in a clustered database.  Due to this, Virtuoso packs the maximum number of operations in a single message.  For example, when doing a loop join that reads one table sequentially and retrieves a row of another table for each row of the outer table, a large number of the join of the inner loop are run in parallel.  So, if there is a join of five tables that gets one row from each table and all rows are on different nodes, the time will be spent on message latency.  If each step of the join gets 10 rows, for a total of 100000 results, the message latency is not a significant factor and the cluster will clearly outperform a single node.

Also, if the workload consists of large numbers of concurrent short updates or queries, the message latencies will even out and throughput will scale up even if doing a single transaction were faster on a single node.

Parallel SQL

There are SQL extensions for stored procedures allowing parallelizing operations.  For example, if a procedure has a loop doing inserts, the inserted rows can be buffered until a sufficient number is available, at which point they are sent in batches to the nodes concerned.  Transactional semantics are kept but error detection is deferred to the actual execution.

Transactions

Each transaction is owned by one node of the cluster, the node to which the client is connected.  When more than one node besides the owner of the transaction is updated, two phase commit is used.  This is transparent to the application code.  No external transaction monitor is required, the Virtuoso instances perform these functions internally.  There is a distributed deadlock detection scheme based on the nodes periodically sharing transaction waiting information.

Since read transactions can operate without locks, reading the last committed state of uncommitted updated rows, waiting for locks is not very common.

Interconnect and Threading

Virtuoso uses TCP to connect between instances.  A single instance can have multiple listeners at different network interfaces for cluster activity.  The interfaces will be used in a round-robin fashion by the peers, spreading the load over all network interfaces. A separate thread is created for monitoring each interface.  Long messages, such as transfers of blobs are done on a separate thread, thus allowing normal service on the cluster node while the transfer is proceeding.

We will have to test the performance of TCP over Infiniband to see if there is clear gain in going to a lower level interface like MPI.  The Virtuoso architecture is based on streams connecting cluster nodes point to point.  The design does not per se gain from remote DMA or other features provided by MPI.  Typically, messages are quite short, under 100K.  Flow control for transfer of blobs is however nice to have but can be written at the application level if needed.  We will get real data on the performance of different interconnects in the next weeks.

Deployment and Management

Configuring is quite simple, with each process sharing a copy of the same configuration file.  One line in the file differs from host to host, telling it which one it is.  Otherwise the database configuration files are individual per host, accommodating different file system layouts etc.  Setting up a node requires copying the executable and two configuration files, no more.   All functionality is contained in a single process.  There are no installers to be run or such.

Changing the number or network interface of cluster nodes requires a cluster restart.  Changing data partitioning requires copying the data into a new table and renaming this over the old one.  This is time consuming and does not mix well with updates.  Splitting an existing cluster node requires no copying with repartitioning but shifting data between partitions does.

A consolidated status report shows the general state and level of intra-cluster traffic as count of messages and count of bytes.

Start, shutdown, backup, and package installation commands can only be issued from a single master node. Otherwise all is symmetrical.

Present State and Next Developments

The basics are now in place.  Some code remains to be written for such things as distributed deadlock detection, 2-phase commit recovery cycle, management functions, etc.  Some SQL operations like text index, statistics sampling, and index intersection need special support, yet to be written.

The RDF capabilities are not specifically affected by clustering except in a couple of places.  Loading will be slightly revised to use larger batches of rows to minimize latency, for example.

There is a pretty much infinite world of SQL optimizations for splitting aggregates, taking advantage of co-located joins etc.  These will be added gradually.  These are however not really central to the first application of RDF storage but are quite important for business intelligence, for example.

We will run some benchmarks for comparing single host and clustered Virtuoso instances over the next weeks.  Some of this will be with real data, giving an estimate on when we can move some of the RDF data we presently host to the new platform.  We will benchmark against Oracle and DB2 later but first we get things to work and compare against ourselves.

We roughly expect a halving in space consumption and a significant increase in single query performance and linearly scaling parallel throughput through addition of cluster nodes.

The next update will be on this blog within two weeks.

# PermaLink Comments [0]
08/27/2007 05:51 GMT-0500 Modified: 04/25/2008 11:59 GMT-0500
         
Powered by OpenLink Virtuoso Universal Server
Running on Linux platform
OpenLink Software 1998-2006