Details

OpenLink Software
Burlington, United States

Subscribe

Post Categories

Recent Articles

Community Member Blogs

Display Settings

articles per page.
order.

Translate

Showing posts in all categories RefreshRefresh
Comparing Virtuoso Performance on Different Processors [ Orri Erling ]

Over the years we have run Virtuoso on different hardware. We will here give a few figures that help identify the best price point for machines running Virtuoso.

Our test is very simple: Load 20 warehouses of TPC-C data, and then run one client per warehouse for 10,000 new orders. The way this is set up, disk I/O does not play a role and lock contention between the clients is minimal.

The test essentially has 20 server and 20 client threads running the same workload in parallel. The load time gives the single thread number; the 20 clients run gives the multi-threaded number. The test uses about 2-3 GB of data, so all is in RAM but is large enough not to be all in processor cache.

All times reported are real times, starting from the start of the first client and ending with the completion of the last client.

Do not confuse these results with official TPC-C. The measurement protocols are entirely incomparable.

Test Platform Load
(seconds)
Run
(seconds)
GHz / cores / threads
1 Amazon EC2 Extra Large
(4 virtual cores)
340 42 1.2 GHz? / 4 / 1
1 Amazon EC2 Extra Large
(4 virtual cores)
305 43.3 1.2 GHz? / 4 / 1
2 1 x dual-core AMD 5900 263 58.2 2.9 GHz / 2 / 1
3 2 x dual-core Xeon 5130 ("Woodcrest") 245 35.7 2.0 GHz / 4 / 1
4 2 x quad-core Xeon 5410 ("Harpertown") 237 18.0 2.33 GHz / 8 / 1
5 2 x quad-core Xeon 5520 ("Nehalem") 162 18.3 2.26 GHz / 8 / 2

We tried two different EC2 instances to see if there would be variation. The variation was quite small. The tested EC2 instances costs 20 US cents per hour. The AMD dual-core costs 550 US dollars with 8G. The 3 Xeon configurations are Supermicro boards with 667MHz memory for the Xeon 5130 ("Woodcrest") and Xeon 5410 ("Harpertown"), and 800MHz memory for the Nehalem. The Xeon systems cost between 4000 and 7000 US dollars, with 5000 for a configuration with 2 x Xeon 5520 ("Nehalem"), 72 GB RAM, and 8 x 500 GB SATA disks.

Caveat: Due to slow memory (we could not get faster within available time), the results for the Nehalem do not take full advantage of its principal edge over the previous generation, i.e., memory subsystem. We'll see another time with faster memories.

The operating systems were various 64 bit Linux distributions.

We did some further measurements comparing Harpertown and Nehalem processors. The Nehalem chip was a bit faster for a slightly lower clock but we did not see any of the twofold and greater differences advertised by Intel.

We tried some RDF operations on the two last systems:

operation Harpertown Nehalem
Build text index for DBpedia 1080s 770s
Entity Rank iteration 263s 251s

Then we tried to see if the core multithreading of Nehalem could be seen anywhere. To this effect, we ran the Fibonacci function in SQL to serve as an example of an all in-cache integer operation. 16 concurrent operations took exactly twice as long as 8 concurrent ones, as expected.

For something that used memory, we took a count of RDF quads on two different indices, getting the same count. The database was a cluster setup with one process per core, so a count involved one thread per core. The counts in series took 5.02s and in parallel they took 4.27s.

Then we took a more memory intensive piece that read the RDF quads table in the order of one index and for each row checked that there is the equal row on another, differently-partitioned index. This is a cross-partition join. One of the indices is read sequentially and the other at random. The throughput can be reported as random-lookups-per-second. The data was English DBpedia, about 140M triples. One such query takes a couple of minutes with a 650% CPU utilization. Running multiple such queries should show effects of core multithreading since we expect frequent cache misses.

  1. On the host OS of the Nehalem system —
    n cpu% rows per second
    1 query 503 906,413
    2 queries 1263 1,578,585
    3 queries 1204 1,566,849
  2. In a VM under Xen, on the Nehalem system —
    n cpu% rows per second
    1 query 652 799,293
    2 queries 1266 1,486,710
    3 queries 1222 1,484,093
  3. On the host OS of the Harpertown system —
    n cpu% rows per second
    1 query 648 1,041,448
    2 queries 708 1,124,866

The CPU percentages are as reported by the OS: user + system CPU divided by real time.

So, Nehalem is in general somewhat faster, around 20-30%, than Harpertown. The effect of core multithreading can be noticed but is not huge, another 20% or so for situations with more threads than cores. The join where Harpertown did better could be attributed to its larger cache — 12 MB vs 8 MB.

We see that Xen has a measurable but not prohibitive overhead; count a little under 10% for everything, also tasks with no I/O. The VM was set up to have all CPU for the test and the queries did not do disk I/O.

The executables were compiled with gcc with default settings. Specifying -march=nocona (Core 2 target) dropped the cross-partition join time mentioned above from 128s to 122s on Harpertown. We did not try this on Nehalem but presume the effect would be the same, since the out-of-order unit is not much different. We did not do anything about process-to-memory affinity on Nehalem, which is a non-uniform architecture. We would expect this to increase performance since we have many equal size processes with even load.

The mainstay of the Nehalem value proposition is a better memory subsystem. Since the unit we got was at 800 MHz memory, we did not see any great improvement. So if you buy Nehalem, you should make sure it is with 1333 MHz memory, else the best case will not be over 50% over a 667 MHz Core 2-based Xeon.

Nehalem remains a better deal for us because of more memory per board. One Nehalem box with 72 GB costs less than two Harpertown boxes with 32 GB and offers almost the same performance. Having a lot of memory in a small space is key. With faster memory, it might even outperform two Harpertown boxes, but this remains to be seen.

If space were not a constraint, we could make a cluster of 12 small workstations for the price of our largest system and get still more memory and more processor power per unit of memory. The Nehalem box was almost 4x faster than the AMD box but then it has 9x the memory, so the CPU to memory ratio might be better with the smaller boxes.

# PermaLink Comments [0]
05/28/2009 10:54 GMT Modified: 05/28/2009 11:15 GMT
Comparing Virtuoso Performance on Different Processors [ Virtuso Data Space Bot ]

Over the years we have run Virtuoso on different hardware. We will here give a few figures that help identify the best price point for machines running Virtuoso.

Our test is very simple: Load 20 warehouses of TPC-C data, and then run one client per warehouse for 10,000 new orders. The way this is set up, disk I/O does not play a role and lock contention between the clients is minimal.

The test essentially has 20 server and 20 client threads running the same workload in parallel. The load time gives the single thread number; the 20 clients run gives the multi-threaded number. The test uses about 2-3 GB of data, so all is in RAM but is large enough not to be all in processor cache.

All times reported are real times, starting from the start of the first client and ending with the completion of the last client.

Do not confuse these results with official TPC-C. The measurement protocols are entirely incomparable.

Test Platform Load
(seconds)
Run
(seconds)
GHz / cores / threads
1 Amazon EC2 Extra Large
(4 virtual cores)
340 42 1.2 GHz? / 4 / 1
1 Amazon EC2 Extra Large
(4 virtual cores)
305 43.3 1.2 GHz? / 4 / 1
2 1 x dual-core AMD 5900 263 58.2 2.9 GHz / 2 / 1
3 2 x dual-core Xeon 5130 ("Woodcrest") 245 35.7 2.0 GHz / 4 / 1
4 2 x quad-core Xeon 5410 ("Harpertown") 237 18.0 2.33 GHz / 8 / 1
5 2 x quad-core Xeon 5520 ("Nehalem") 162 18.3 2.26 GHz / 8 / 2

We tried two different EC2 instances to see if there would be variation. The variation was quite small. The tested EC2 instances costs 20 US cents per hour. The AMD dual-core costs 550 US dollars with 8G. The 3 Xeon configurations are Supermicro boards with 667MHz memory for the Xeon 5130 ("Woodcrest") and Xeon 5410 ("Harpertown"), and 800MHz memory for the Nehalem. The Xeon systems cost between 4000 and 7000 US dollars, with 5000 for a configuration with 2 x Xeon 5520 ("Nehalem"), 72 GB RAM, and 8 x 500 GB SATA disks.

Caveat: Due to slow memory (we could not get faster within available time), the results for the Nehalem do not take full advantage of its principal edge over the previous generation, i.e., memory subsystem. We'll see another time with faster memories.

The operating systems were various 64 bit Linux distributions.

We did some further measurements comparing Harpertown and Nehalem processors. The Nehalem chip was a bit faster for a slightly lower clock but we did not see any of the twofold and greater differences advertised by Intel.

We tried some RDF operations on the two last systems:

operation Harpertown Nehalem
Build text index for DBpedia 1080s 770s
Entity Rank iteration 263s 251s

Then we tried to see if the core multithreading of Nehalem could be seen anywhere. To this effect, we ran the Fibonacci function in SQL to serve as an example of an all in-cache integer operation. 16 concurrent operations took exactly twice as long as 8 concurrent ones, as expected.

For something that used memory, we took a count of RDF quads on two different indices, getting the same count. The database was a cluster setup with one process per core, so a count involved one thread per core. The counts in series took 5.02s and in parallel they took 4.27s.

Then we took a more memory intensive piece that read the RDF quads table in the order of one index and for each row checked that there is the equal row on another, differently-partitioned index. This is a cross-partition join. One of the indices is read sequentially and the other at random. The throughput can be reported as random-lookups-per-second. The data was English DBpedia, about 140M triples. One such query takes a couple of minutes with a 650% CPU utilization. Running multiple such queries should show effects of core multithreading since we expect frequent cache misses.

  1. On the host OS of the Nehalem system —
    n cpu% rows per second
    1 query 503 906,413
    2 queries 1263 1,578,585
    3 queries 1204 1,566,849
  2. In a VM under Xen, on the Nehalem system —
    n cpu% rows per second
    1 query 652 799,293
    2 queries 1266 1,486,710
    3 queries 1222 1,484,093
  3. On the host OS of the Harpertown system —
    n cpu% rows per second
    1 query 648 1,041,448
    2 queries 708 1,124,866

The CPU percentages are as reported by the OS: user + system CPU divided by real time.

So, Nehalem is in general somewhat faster, around 20-30%, than Harpertown. The effect of core multithreading can be noticed but is not huge, another 20% or so for situations with more threads than cores. The join where Harpertown did better could be attributed to its larger cache — 12 MB vs 8 MB.

We see that Xen has a measurable but not prohibitive overhead; count a little under 10% for everything, also tasks with no I/O. The VM was set up to have all CPU for the test and the queries did not do disk I/O.

The executables were compiled with gcc with default settings. Specifying -march=nocona (Core 2 target) dropped the cross-partition join time mentioned above from 128s to 122s on Harpertown. We did not try this on Nehalem but presume the effect would be the same, since the out-of-order unit is not much different. We did not do anything about process-to-memory affinity on Nehalem, which is a non-uniform architecture. We would expect this to increase performance since we have many equal size processes with even load.

The mainstay of the Nehalem value proposition is a better memory subsystem. Since the unit we got was at 800 MHz memory, we did not see any great improvement. So if you buy Nehalem, you should make sure it is with 1333 MHz memory, else the best case will not be over 50% over a 667 MHz Core 2-based Xeon.

Nehalem remains a better deal for us because of more memory per board. One Nehalem box with 72 GB costs less than two Harpertown boxes with 32 GB and offers almost the same performance. Having a lot of memory in a small space is key. With faster memory, it might even outperform two Harpertown boxes, but this remains to be seen.

If space were not a constraint, we could make a cluster of 12 small workstations for the price of our largest system and get still more memory and more processor power per unit of memory. The Nehalem box was almost 4x faster than the AMD box but then it has 9x the memory, so the CPU to memory ratio might be better with the smaller boxes.

# PermaLink Comments [0]
05/28/2009 10:54 GMT Modified: 05/28/2009 11:15 GMT
De Paradigmata and The Foundational Issues [ Virtuso Data Space Bot ]
De Paradigmata and The Foundational Issues

I thought that we had talked ourselves to exhaustion and beyond over the issue of the semantic web layer cake. Apparently not. There was a paper called Functional Architecture for the Semantic Web by Aurona Gerber et al at ESWC2008.

The thrust of the matter was that for newcomers the layer cake was confusing and did not clearly indicate the architecture. Why, sure. My point is that no rearranging of the boxes will cut it for the general case.

Any diagram containing the boxes of the layer cake (i.e., URI, XML, SPARQL, OWL, RIF, Crypto, etc., etc.) in whatever order or arrangement can at best be a sort of overview of how these standards reference each other.

Such diagrams are a little like saying that a car combines the combustion properties of fuel/air mixes with the tension and compression resistance properties of metals and composites for producing motion and secondly links to Newton's laws of motion and to aerodynamics.

Not false. But it does not say that a car is good for economical commute or showing off at the strip or any number of niches that a mature industry has grown to serve.

Now, talking of software engineering, modules and interfaces are good and even necessary. The trick is to know where to put the interface.

Such a thing cannot possibly be inferred from the standards' inter-reference picture. APIs, especially if these are Web service APIs, should go where there is low data volume and tolerance for latency. For example, either inference is a preprocessing step or it is embedded right inside a SPARQL engine. Such a thing cannot be seen from the picture. Same for trust. Trust is not an after-thought at the top of the picture, except maybe in the sense of referring to the other parts.

We hear it over and over. Scale and speed are critical. Arrange the blocks of any real system as makes sense for data flow; do not confuse literature references with control or data structure.

The even-more foundational issue is the promotion of the general concept of a Web of Data.

The core idea that the Web would be a query-able collection of data with meaningful reference between data of different provenance cannot be inferred from the picture, even though this should be its primary message. Or it is better to say that the first picture shown should stress this idea and then one could leave the layer cake, in whatever version, for explaining the standards' order of evolution or inter-reference.

So, the value proposition:

Why? Explosion of data volume, increased need of keeping up-to-date, increasing opportunity cost of not keeping in real time.

What? An architecture that is designed for unanticipated joining and evolution of data across heterogeneous sources, either at Web or enterprise scale.

How? URI everything and everything is cool, or, give things global names. Use RDF. Reuse names or ontologies where can. (An ontology is a set of classes and property names plus some more.) Map relational data on the fly or store as RDF, whichever works. Query with SPARQL, easier than SQL.

So, my challenge for the graphics people would be to make an illustration of the above. Forget the alphabet soup. Show the layer cake as a historical reference or literature guide. Do not imply that this proliferation of boxes equates to an equal proliferation of Web services, for example.

# PermaLink Comments [0]
06/09/2008 10:02 GMT Modified: 06/11/2008 15:54 GMT
De Paradigmata and The Foundational Issues [ Orri Erling ]

I thought that we had talked ourselves to exhaustion and beyond over the issue of the semantic web layer cake. Apparently not. There was a paper called Functional Architecture for the Semantic Web by Aurona Gerber et al at ESWC2008.

The thrust of the matter was that for newcomers the layer cake was confusing and did not clearly indicate the architecture. Why, sure. My point is that no rearranging of the boxes will cut it for the general case.

Any diagram containing the boxes of the layer cake (i.e., URI, XML, SPARQL, OWL, RIF, Crypto, etc., etc.) in whatever order or arrangement can at best be a sort of overview of how these standards reference each other.

Such diagrams are a little like saying that a car combines the combustion properties of fuel/air mixes with the tension and compression resistance properties of metals and composites for producing motion and secondly links to Newton's laws of motion and to aerodynamics.

Not false. But it does not say that a car is good for economical commute or showing off at the strip or any number of niches that a mature industry has grown to serve.

Now, talking of software engineering, modules and interfaces are good and even necessary. The trick is to know where to put the interface.

Such a thing cannot possibly be inferred from the standards' inter-reference picture. APIs, especially if these are Web service APIs, should go where there is low data volume and tolerance for latency. For example, either inference is a preprocessing step or it is embedded right inside a SPARQL engine. Such a thing cannot be seen from the picture. Same for trust. Trust is not an after-thought at the top of the picture, except maybe in the sense of referring to the other parts.

We hear it over and over. Scale and speed are critical. Arrange the blocks of any real system as makes sense for data flow; do not confuse literature references with control or data structure.

The even-more foundational issue is the promotion of the general concept of a Web of Data.

The core idea that the Web would be a query-able collection of data with meaningful reference between data of different provenance cannot be inferred from the picture, even though this should be its primary message. Or it is better to say that the first picture shown should stress this idea and then one could leave the layer cake, in whatever version, for explaining the standards' order of evolution or inter-reference.

So, the value proposition:

Why? Explosion of data volume, increased need of keeping up-to-date, increasing opportunity cost of not keeping in real time.

What? An architecture that is designed for unanticipated joining and evolution of data across heterogeneous sources, either at Web or enterprise scale.

How? URI everything and everything is cool, or, give things global names. Use RDF. Reuse names or ontologies where can. (An ontology is a set of classes and property names plus some more.) Map relational data on the fly or store as RDF, whichever works. Query with SPARQL, easier than SQL.

So, my challenge for the graphics people would be to make an illustration of the above. Forget the alphabet soup. Show the layer cake as a historical reference or literature guide. Do not imply that this proliferation of boxes equates to an equal proliferation of Web services, for example.

# PermaLink Comments [0]
06/09/2008 14:00 GMT Modified: 06/11/2008 15:54 GMT
WWW 2008 [ Virtuso Data Space Bot ]
WWW 2008

Following my return from WWW 2008 in Beijing, I will write a series of blog posts discussing diverse topics that were brought up in presentations and conversations during the week.

Linked data was our main interest in the conference and there was a one day workshop on this, unfortunately overlapping with a day of W3C Advisory Committee meetings. Hence Tim Berners-Lee, one of the chairs of the workshop, could not attend for most of the day. Still, he was present to say that "Linked open data is the semantic web and the web done as it ought to be done."

For my part, I will draw some architecture conclusions from the different talks and extrapolate about the requirements on database platforms for linked data.

Chris Bizer predicted that 2008 would be the year of data web search, if 2007 was the year of SPARQL. This may be the case, as linked data is now pretty much a reality and the questions of discovery become prevalent. There was a birds-of-a-feather session on this and I will make some comments on what we intend to explore in bridging between the text index based semantic web search engines and SPARQL.

Andy Seaborne convened a birds-of-a-feather session on the future of SPARQL. Many of the already anticipated and implemented requirements were confirmed and a few were introduced. A separate blog post will discuss these further.

From the various discussions held throughout the conference, we conclude that plug-and-play operation with the major semantic web frameworks of Jena, Sesame, and Redland, is our major immediate-term deliverable. Our efforts in this direction thus far are insufficient and we will next have these done with the right supervision and proper interop testing. The issues are fortunately simple but doing things totally right require some small server side support and some JDBC/ODBC tweaks, so to the interested, we advise to wait for an update to be published on this blog.

I further had a conversation with Andy Seaborne about using Jena reasoning capabilities with Virtuoso and generally the issues of "impedance mismatch" between reasoning and typical database workloads. More on this later.

# PermaLink Comments [0]
04/29/2008 10:37 GMT Modified: 04/29/2008 13:35 GMT
WWW 2008 [ Orri Erling ]

Following my return from WWW 2008 in Beijing, I will write a series of blog posts discussing diverse topics that were brought up in presentations and conversations during the week.

Linked data was our main interest in the conference and there was a one day workshop on this, unfortunately overlapping with a day of W3C Advisory Committee meetings. Hence Tim Berners-Lee, one of the chairs of the workshop, could not attend for most of the day. Still, he was present to say that "Linked open data is the semantic web and the web done as it ought to be done."

For my part, I will draw some architecture conclusions from the different talks and extrapolate about the requirements on database platforms for linked data.

Chris Bizer predicted that 2008 would be the year of data web search, if 2007 was the year of SPARQL. This may be the case, as linked data is now pretty much a reality and the questions of discovery become prevalent. There was a birds-of-a-feather session on this and I will make some comments on what we intend to explore in bridging between the text index based semantic web search engines and SPARQL.

Andy Seaborne convened a birds-of-a-feather session on the future of SPARQL. Many of the already anticipated and implemented requirements were confirmed and a few were introduced. A separate blog post will discuss these further.

From the various discussions held throughout the conference, we conclude that plug-and-play operation with the major semantic web frameworks of Jena, Sesame, and Redland, is our major immediate-term deliverable. Our efforts in this direction thus far are insufficient and we will next have these done with the right supervision and proper interop testing. The issues are fortunately simple but doing things totally right require some small server side support and some JDBC/ODBC tweaks, so to the interested, we advise to wait for an update to be published on this blog.

I further had a conversation with Andy Seaborne about using Jena reasoning capabilities with Virtuoso and generally the issues of "impedance mismatch" between reasoning and typical database workloads. More on this later.

# PermaLink Comments [0]
04/29/2008 11:59 GMT Modified: 04/29/2008 13:35 GMT
Architectures for Infinity [ Virtuso Data Space Bot ]
Architectures for Infinity

A while back, a friend suggested he and I go check out the Singularity Summit, a conference where they talk of strong AI. Well, I am not a singularist. But since singularists are so brazenly provocative, I looked a little to see if there is anything there, engineering-wise.

So, for a change, we'll be visionary. Read on, I also mention RDF at the end.

I will not even begin with arguments about indefinitely continuing a trend. I will just say that nature has continuous and discontinuous features. Computing is at a qualitative bend and things are not as simple as they are blithely cracked up to be. Read HEC, the high end crusader; he has good commentary on architecture. Having absorbed and understood this, you can talk again about a billion-fold increase in computing price/performance.

When I looked further, about uploading, i.e., the sci-fi process of scanning a brain and having it continue its function inside a computer simulation, I actually found some serious work. Dharmendra Modha at IBM had made a rat-scale cortical simulator running on a 32K node IBM BlueGene, with a 9x slowdown from real time. This is real stuff and in no way meant to be associated with singularism by being mentioned in the same paragraph. Anyway, I was intrigued.

So I asked myself if I could do better. I gave it a fair try, as fair as can be without an actual experiment. The end result is that latency rules. To have deterministic semantics, one must sync across tens of thousands of processors and one cannot entirely eliminate multi-hop messages on the interconnect fabric. If one optimistically precalculates and spreads optimistic results over the cluster, rolling them back will be expensive. Local optimistic computation may have little point since most data points will directly depend on non local data. One needs two sets of wiring, a 3D torus for predominantly close range bulk and a tree for sync, just like the BlueGene has. Making the same from newer hardware makes it more economical but there's another 8 or so orders of magnitude to go before energy efficiency parity with biology. Anyway, a many-times-over qualitative gap. Human scale in real time might be just there somewhere in reach with the stuff now in pre-release if there were no bounds on cost or power consumption. We'd be talking billions at least and even then it is iffy. But this is of little relevance as long as the rat scale is at the level of a systems test for the platform and not actually simulating a plausible organism in some sort of virtual environment. Anyway, of biology I cannot judge and for the computer science, Modha et al have it figured out about as well as can.

Simulation workloads are not the same as database. Database is easier in a way since any global sync is very rare. 2PC seldom touches every partition and if it does, the time to update was anyway greater than the time to commit. Databases are generally multi-user, with a low level of sync between users and a lot of asynchrony. On the other hand, the general case of database does not have predictable cluster locality. Well, if one has an OLTP app with a controlled set of transactions and reports, then one can partition just for that and have almost 100% affinity between the host serving the connection and the data. With RDF for example, such is not generally possible or would require such acrobatics of data layout that a DBA would not even begin.

So, for database, there is a pretty much even probability for any connection between the node running the query and any other. True, function shipping can make the messages large and fairly async and latency tolerant.

On the simulation side, it would seem that the wiring can mimic locality of the simulated process. Messages to neighbors are more likely than messages to remote nodes. So a 3D torus works well there, complemented by a tree for control messages, like sending all nodes the count of messages to expect. Of course, the control wiring (reduce tree) must have far less latency than a long path through the torus wiring and steps with long message deliveries on the torus must be rare for this to bring any profit. Also, long distance messages can go through the tree wiring if the volume is not excessive, else the tree top gets congested.

So, looking past the multicore/multithread and a single level switched interconnect, what would the architecture of the total knowledge machine be? For the neural simulation, the above-described is the best I can come up with and IBM already has come up with it anyway. But here I am more concerned with a database/symbol processing workload than about physical simulation. For the record, I'll say that I expect no strong AI to emerge from these pursuits but that they will still be useful, basically as a support for a linked data "planetary datasphere." Like Google, except it supports arbitrary queries joining arbitrary data at web scale. This would be a couple of orders of magnitude above the text index of same. Now, in practice, most queries will be rather trivial, so it is not that the 2 orders of magnitude are always realized. This would also involve a bit more updating than the equivalent text index since reports would now and then include private materialized inference results. As a general rule, backward chaining would be nicer, since this is read only but some private workspaces for distilling data cannot be avoided.

So, given this general spec, where should architecture go? We can talk of processors, networks and software in turn.

Processors

For evaluating processors, the archetypal task is doing a single random lookup out of a big index. Whether it is a B tree or hash is not really essential since both will have to be built of fixed size pages and will have more than one level. Both need some serialization for checking if things are in memory and for pinning them in cache for the duration of processing. This is eternally true as long as updates are permitted, even if RAM were the bottom of the hierarchy. And if not, then there is the checking of whether a page is in memory and the logic for pinning it.

This means critical sections on shared data structures with small memory writes inside. This is anathema, yet true. So, what the processor needs is shared memory for threads and if possible an instruction for entry into a read-write lock. If the lock is busy, there should be a settable interval before the thread is removed from the core. With a multithread core, this should be just like a memory cache miss. Only if this were really prolonged would there be an interrupt to run the OS scheduler to vacate the thread and load something else, and not even this if the number of executable threads were less or equal the number of threads on the cores.

For thread sync structures, the transactional memory ideas on the SPARC ROC might be going in this direction. For on-core threading, I have not tried how well SPARC T2 does but with the right OS support, it might be in the ballpark. The X86_64 chips have nice thread speed but the OS, whether Solaris or Linux, is a disaster if a mutex is busy. Don't know why but so it is.

Things like IBM Cell, with multiple integrated distributed memory processors on a chip might be workable if they had hardware for global critical sections and if sub-microsecond tasks could be profitably dispatched to the specialized cores (synergistic unit in Cell terminology). If an index lookup, about 2-4 microseconds of real time, could be profitably carried out on a synergistic core without sinking it all in sync delays on the main core, there might be some gain to this. Still, operations like cache lookups tend to be pointer chasing and latency of small memory reads and sometimes writes is a big factor. I have not measured Cell for this but it is not advertised for this sort of workload. It might do nicely for the neuron simulator but not for generic database, I would guess.

Networks and Data Center

The point at which distributed memory wins over shared determines the size of the single compute node. Problems of threads and critical sections fall off but are replaced by network and the troubles of scheduling and serializing and de-serializing messages. There are really huge shared memory systems (by Cray, for example) but even there, hitting the wrong address sends a high latency message over an internal switched fabric, a bit like a network page fault. Well, if one has millions of threads runnable for catching the slack of memory access over interconnect, this might not be so bad, but this is not really the architecture for a general purpose database. So, at some point, we have clearly demarcated distributed memory with affinity of data to processor, simple as that.

For now, the high end clustered database benchmarks run off 1 Gbit ethernet fabrics with some 30 compute nodes on a single switch. This is for shared nothing systems. For shared disk, cache fusion systems like Oracle RAC, we have more heavy duty networking like Infiniband, as one would expect. I have discussed the merits of cache fusion vs shared nothing partitioning on in a previous post.

As long as we are at a scale that fits on a single switch with even port to port latency, we are set. For an RDF workload, throughput is not really the issue but latency can be. With today's technology, nodes with 4-8 cores and 16G RAM are practical and their number is most often not really large. Adding two orders of magnitude, we get more questions. Let's say that 2 billion triples fit with relative comfort but not without disk on a 16G RAM node. This would make 500 nodes for a trillion triples.

This is an entirely relevant and reasonable scale, considering that loading all public biomedical sets plus a pharma company's in house data could approach this ballpark. Let alone anything on the scale of the social web, i.e., the online conversation space.

So how to manage clusters like this? The current cluster gear is oriented toward switch trees and is fairly expensive, about $1000 per node. To make this easy, we would need a wiring free, modular architecture. Picture a shelf with SATA drive size bays, each would get a compute node of 8 cores and 16G plus interconnect. To simplify the network, the module would have a laser on each side, for a cube network topology, the enclosure could connect the edges with fiber optics for the 3D torus. Mass storage would be in the same form factor, as disks or flashes to be interspersed in the proper ratio with compute nodes. All would have the same communication chip, something like a Cray SeaStar with 6 or 7 external ports. A seventh port could be used for a reduce tree. The rack would provide cooling on the shelves by circulating a coolant fluid. Reconfiguring and scaling would be a matter of adding shelves to the cabinet and laying out Lego bricks, blue for compute and red for storage. The network could achieve a latency for an arbitrary point to point message of no more than 10-20 microseconds by the right mix of cube and tree. This in a box of thousands of nodes.

This type of layout would accommodate web scale without needing rows and rows of rack cabinets.

The external interfaces for web requests and replies are no big deal after this, since the intra-cluster traffic is orders of magnitude higher than the result set traffic. After all, the end user does not want dumps of the database but highly refined and ranked results.

Software

We have previously talked about dynamic partitions a la Google Bigtable or Amazon Dynamo. These techniques are entirely fine and will serve for the universal knowledge store.

But what about query logic? OK, having a consistent map of partitions shared over tens of thousands of nodes is entirely possible. So is replication and logic for using a spatially closer partition when multiple copies are know to exist. These things take some programming but there is nothing really new about them. These are a direct straightforward extension of what we do for clustering right now.

But look to windward. How to run complex queries and inference on the platform outlined above? There are some features of RDF querying like same-as that can be easily parallelized, backward-chaining style: Just proceed with the value at hand and initiate the lookup of synonyms and let them get processed when they become available. Same for subclass and sub-property. We already do this, but could do it with more parallelism.

No matter what advances in architecture take place, I do not see a world where every user materializes the entailment of their own same-as-es and rules over a web scale data set. So, backward chaining approaches to inference must develop. Luckily, what most queries need is simple. A rule-oriented language, like Prolog without cut will parallelize well enough. Some degree of memoization may be appropriate for cutting down on re-proving the same thing over and over. Memoization over a cluster is a problem though, since this involves messages. I should say that one should not go looking for pre-proven things beyond the node at hand and that computation should not spread too quickly or too promiscuously so as not to make long message paths. We must remember that a totally even round trip time on a large cluster just will not happen.

Query planning on any system critically depends on correct ballpark guesses on cardinality. If predicates are amplified with transitivity and same-as at run time, the cardinality guessing becomes harder. This can kill any plan no matter the hardware. Probably, an executing query must be annotated with the underlying cardinality assumptions. If these prove radically false, the execution may abort and a new plan be made to better match what was found out. This bears some looking into.

There are some network algorithms like shortest path, traveling salesman and similar that probably deserve a special operator in the query language. These can benefit from parallelization and a sequential implementation running on a cluster with latency will be a disaster. Expressing the message flow in a rule language is not really simple and pretty much no programmer will either appreciate the necessity or go to the trouble. Therefore such things should likely be offered by the platform and made by the few people who understand such matters.

For forward chaining, it seems that any results should generally go into their own graph so as not to pollute the base data. This graph, supposing it is small enough, can have different partitioning from the large base data set. If the data comes from far and wide but results are local, there is better usage of a RETE like algorithm for triggering inference when data comes in. RETE will parallelize well enough, also for clusters,results just have to be broadcast to the nodes that may have a use for them.

The programming model will typically be using a set of local overlays on a large shared data set. Queries will most often not be against a single graph. Strict transactionality will be the exception rather than the rule. At the database node level, there must be real transactionality even if the whole system most often did not run with strict ACID semantics. This is due to ACID requirements of some internal ops, e.g., some bitmap index operations, log checkpoints, DDL changes, etc.

For procedural tasks, map-reduce is OK. We have it even in our SQL. Map-reduce is not the basis for making a DBMS but it is a nice feature for some parts of query evaluation and application logic.

We have not talked about linking the data itself, but there is a whole workshop on this next week in Beijing; I will write about it separately. Let this just serve to state that we are serious about the platform for this, present and future.

Conclusion

The web scale database, the Google with arbitrary joining and inference, is one generation away, talking of economical implementation. Today, a price tag of $100K will buy some 50-100 billion triples worth with reasonable query response. Unlimited budget will take this a bit further, like one order of magnitude. and then, returns might be decreasing.

Of course, this is worth nothing if the software isn't there. Virtuoso with its cluster edition will allow one to use unlimited RAM and processors. The frontier is now at getting just the right parallelism for each task, including inference ones.

# PermaLink Comments [0]
04/14/2008 05:41 GMT Modified: 04/14/2008 13:19 GMT
Architectures for Infinity [ Orri Erling ]

A while back, a friend suggested he and I go check out the Singularity Summit, a conference where they talk of strong AI. Well, I am not a singularist. But since singularists are so brazenly provocative, I looked a little to see if there is anything there, engineering-wise.

So, for a change, we'll be visionary. Read on, I also mention RDF at the end.

I will not even begin with arguments about indefinitely continuing a trend. I will just say that nature has continuous and discontinuous features. Computing is at a qualitative bend and things are not as simple as they are blithely cracked up to be. Read HEC, the high end crusader; he has good commentary on architecture. Having absorbed and understood this, you can talk again about a billion-fold increase in computing price/performance.

When I looked further, about uploading, i.e., the sci-fi process of scanning a brain and having it continue its function inside a computer simulation, I actually found some serious work. Dharmendra Modha at IBM had made a rat-scale cortical simulator running on a 32K node IBM BlueGene, with a 9x slowdown from real time. This is real stuff and in no way meant to be associated with singularism by being mentioned in the same paragraph. Anyway, I was intrigued.

So I asked myself if I could do better. I gave it a fair try, as fair as can be without an actual experiment. The end result is that latency rules. To have deterministic semantics, one must sync across tens of thousands of processors and one cannot entirely eliminate multi-hop messages on the interconnect fabric. If one optimistically precalculates and spreads optimistic results over the cluster, rolling them back will be expensive. Local optimistic computation may have little point since most data points will directly depend on non local data. One needs two sets of wiring, a 3D torus for predominantly close range bulk and a tree for sync, just like the BlueGene has. Making the same from newer hardware makes it more economical but there's another 8 or so orders of magnitude to go before energy efficiency parity with biology. Anyway, a many-times-over qualitative gap. Human scale in real time might be just there somewhere in reach with the stuff now in pre-release if there were no bounds on cost or power consumption. We'd be talking billions at least and even then it is iffy. But this is of little relevance as long as the rat scale is at the level of a systems test for the platform and not actually simulating a plausible organism in some sort of virtual environment. Anyway, of biology I cannot judge and for the computer science, Modha et al have it figured out about as well as can.

Simulation workloads are not the same as database. Database is easier in a way since any global sync is very rare. 2PC seldom touches every partition and if it does, the time to update was anyway greater than the time to commit. Databases are generally multi-user, with a low level of sync between users and a lot of asynchrony. On the other hand, the general case of database does not have predictable cluster locality. Well, if one has an OLTP app with a controlled set of transactions and reports, then one can partition just for that and have almost 100% affinity between the host serving the connection and the data. With RDF for example, such is not generally possible or would require such acrobatics of data layout that a DBA would not even begin.

So, for database, there is a pretty much even probability for any connection between the node running the query and any other. True, function shipping can make the messages large and fairly async and latency tolerant.

On the simulation side, it would seem that the wiring can mimic locality of the simulated process. Messages to neighbors are more likely than messages to remote nodes. So a 3D torus works well there, complemented by a tree for control messages, like sending all nodes the count of messages to expect. Of course, the control wiring (reduce tree) must have far less latency than a long path through the torus wiring and steps with long message deliveries on the torus must be rare for this to bring any profit. Also, long distance messages can go through the tree wiring if the volume is not excessive, else the tree top gets congested.

So, looking past the multicore/multithread and a single level switched interconnect, what would the architecture of the total knowledge machine be? For the neural simulation, the above-described is the best I can come up with and IBM already has come up with it anyway. But here I am more concerned with a database/symbol processing workload than about physical simulation. For the record, I'll say that I expect no strong AI to emerge from these pursuits but that they will still be useful, basically as a support for a linked data "planetary datasphere." Like Google, except it supports arbitrary queries joining arbitrary data at web scale. This would be a couple of orders of magnitude above the text index of same. Now, in practice, most queries will be rather trivial, so it is not that the 2 orders of magnitude are always realized. This would also involve a bit more updating than the equivalent text index since reports would now and then include private materialized inference results. As a general rule, backward chaining would be nicer, since this is read only but some private workspaces for distilling data cannot be avoided.

So, given this general spec, where should architecture go? We can talk of processors, networks and software in turn.

Processors

For evaluating processors, the archetypal task is doing a single random lookup out of a big index. Whether it is a B tree or hash is not really essential since both will have to be built of fixed size pages and will have more than one level. Both need some serialization for checking if things are in memory and for pinning them in cache for the duration of processing. This is eternally true as long as updates are permitted, even if RAM were the bottom of the hierarchy. And if not, then there is the checking of whether a page is in memory and the logic for pinning it.

This means critical sections on shared data structures with small memory writes inside. This is anathema, yet true. So, what the processor needs is shared memory for threads and if possible an instruction for entry into a read-write lock. If the lock is busy, there should be a settable interval before the thread is removed from the core. With a multithread core, this should be just like a memory cache miss. Only if this were really prolonged would there be an interrupt to run the OS scheduler to vacate the thread and load something else, and not even this if the number of executable threads were less or equal the number of threads on the cores.

For thread sync structures, the transactional memory ideas on the SPARC ROC might be going in this direction. For on-core threading, I have not tried how well SPARC T2 does but with the right OS support, it might be in the ballpark. The X86_64 chips have nice thread speed but the OS, whether Solaris or Linux, is a disaster if a mutex is busy. Don't know why but so it is.

Things like IBM Cell, with multiple integrated distributed memory processors on a chip might be workable if they had hardware for global critical sections and if sub-microsecond tasks could be profitably dispatched to the specialized cores (synergistic unit in Cell terminology). If an index lookup, about 2-4 microseconds of real time, could be profitably carried out on a synergistic core without sinking it all in sync delays on the main core, there might be some gain to this. Still, operations like cache lookups tend to be pointer chasing and latency of small memory reads and sometimes writes is a big factor. I have not measured Cell for this but it is not advertised for this sort of workload. It might do nicely for the neuron simulator but not for generic database, I would guess.

Networks and Data Center

The point at which distributed memory wins over shared determines the size of the single compute node. Problems of threads and critical sections fall off but are replaced by network and the troubles of scheduling and serializing and de-serializing messages. There are really huge shared memory systems (by Cray, for example) but even there, hitting the wrong address sends a high latency message over an internal switched fabric, a bit like a network page fault. Well, if one has millions of threads runnable for catching the slack of memory access over interconnect, this might not be so bad, but this is not really the architecture for a general purpose database. So, at some point, we have clearly demarcated distributed memory with affinity of data to processor, simple as that.

For now, the high end clustered database benchmarks run off 1 Gbit ethernet fabrics with some 30 compute nodes on a single switch. This is for shared nothing systems. For shared disk, cache fusion systems like Oracle RAC, we have more heavy duty networking like Infiniband, as one would expect. I have discussed the merits of cache fusion vs shared nothing partitioning on in a previous post.

As long as we are at a scale that fits on a single switch with even port to port latency, we are set. For an RDF workload, throughput is not really the issue but latency can be. With today's technology, nodes with 4-8 cores and 16G RAM are practical and their number is most often not really large. Adding two orders of magnitude, we get more questions. Let's say that 2 billion triples fit with relative comfort but not without disk on a 16G RAM node. This would make 500 nodes for a trillion triples.

This is an entirely relevant and reasonable scale, considering that loading all public biomedical sets plus a pharma company's in house data could approach this ballpark. Let alone anything on the scale of the social web, i.e., the online conversation space.

So how to manage clusters like this? The current cluster gear is oriented toward switch trees and is fairly expensive, about $1000 per node. To make this easy, we would need a wiring free, modular architecture. Picture a shelf with SATA drive size bays, each would get a compute node of 8 cores and 16G plus interconnect. To simplify the network, the module would have a laser on each side, for a cube network topology, the enclosure could connect the edges with fiber optics for the 3D torus. Mass storage would be in the same form factor, as disks or flashes to be interspersed in the proper ratio with compute nodes. All would have the same communication chip, something like a Cray SeaStar with 6 or 7 external ports. A seventh port could be used for a reduce tree. The rack would provide cooling on the shelves by circulating a coolant fluid. Reconfiguring and scaling would be a matter of adding shelves to the cabinet and laying out Lego bricks, blue for compute and red for storage. The network could achieve a latency for an arbitrary point to point message of no more than 10-20 microseconds by the right mix of cube and tree. This in a box of thousands of nodes.

This type of layout would accommodate web scale without needing rows and rows of rack cabinets.

The external interfaces for web requests and replies are no big deal after this, since the intra-cluster traffic is orders of magnitude higher than the result set traffic. After all, the end user does not want dumps of the database but highly refined and ranked results.

Software

We have previously talked about dynamic partitions a la Google Bigtable or Amazon Dynamo. These techniques are entirely fine and will serve for the universal knowledge store.

But what about query logic? OK, having a consistent map of partitions shared over tens of thousands of nodes is entirely possible. So is replication and logic for using a spatially closer partition when multiple copies are know to exist. These things take some programming but there is nothing really new about them. These are a direct straightforward extension of what we do for clustering right now.

But look to windward. How to run complex queries and inference on the platform outlined above? There are some features of RDF querying like same-as that can be easily parallelized, backward-chaining style: Just proceed with the value at hand and initiate the lookup of synonyms and let them get processed when they become available. Same for subclass and sub-property. We already do this, but could do it with more parallelism.

No matter what advances in architecture take place, I do not see a world where every user materializes the entailment of their own same-as-es and rules over a web scale data set. So, backward chaining approaches to inference must develop. Luckily, what most queries need is simple. A rule-oriented language, like Prolog without cut will parallelize well enough. Some degree of memoization may be appropriate for cutting down on re-proving the same thing over and over. Memoization over a cluster is a problem though, since this involves messages. I should say that one should not go looking for pre-proven things beyond the node at hand and that computation should not spread too quickly or too promiscuously so as not to make long message paths. We must remember that a totally even round trip time on a large cluster just will not happen.

Query planning on any system critically depends on correct ballpark guesses on cardinality. If predicates are amplified with transitivity and same-as at run time, the cardinality guessing becomes harder. This can kill any plan no matter the hardware. Probably, an executing query must be annotated with the underlying cardinality assumptions. If these prove radically false, the execution may abort and a new plan be made to better match what was found out. This bears some looking into.

There are some network algorithms like shortest path, traveling salesman and similar that probably deserve a special operator in the query language. These can benefit from parallelization and a sequential implementation running on a cluster with latency will be a disaster. Expressing the message flow in a rule language is not really simple and pretty much no programmer will either appreciate the necessity or go to the trouble. Therefore such things should likely be offered by the platform and made by the few people who understand such matters.

For forward chaining, it seems that any results should generally go into their own graph so as not to pollute the base data. This graph, supposing it is small enough, can have different partitioning from the large base data set. If the data comes from far and wide but results are local, there is better usage of a RETE like algorithm for triggering inference when data comes in. RETE will parallelize well enough, also for clusters,results just have to be broadcast to the nodes that may have a use for them.

The programming model will typically be using a set of local overlays on a large shared data set. Queries will most often not be against a single graph. Strict transactionality will be the exception rather than the rule. At the database node level, there must be real transactionality even if the whole system most often did not run with strict ACID semantics. This is due to ACID requirements of some internal ops, e.g., some bitmap index operations, log checkpoints, DDL changes, etc.

For procedural tasks, map-reduce is OK. We have it even in our SQL. Map-reduce is not the basis for making a DBMS but it is a nice feature for some parts of query evaluation and application logic.

We have not talked about linking the data itself, but there is a whole workshop on this next week in Beijing; I will write about it separately. Let this just serve to state that we are serious about the platform for this, present and future.

Conclusion

The web scale database, the Google with arbitrary joining and inference, is one generation away, talking of economical implementation. Today, a price tag of $100K will buy some 50-100 billion triples worth with reasonable query response. Unlimited budget will take this a bit further, like one order of magnitude. and then, returns might be decreasing.

Of course, this is worth nothing if the software isn't there. Virtuoso with its cluster edition will allow one to use unlimited RAM and processors. The frontier is now at getting just the right parallelism for each task, including inference ones.

# PermaLink Comments [0]
04/14/2008 09:32 GMT Modified: 04/14/2008 13:19 GMT
Virtuoso Cluster Preview [ Virtuso Data Space Bot ]
Virtuoso Cluster Preview

I wrote the basics of the Virtuoso clustering support over the past three weeks.  It can now manage connections, decide where things go, do two phase commits, insert and select data from tables partitioned over multiple Virtuoso instances.  It works about enough to be measured, of which I will blog more over the next two weeks.

I will in the following give a features preview of what will be in the Virtuoso clustering support when it is released in the fall of this year (2007).

Data Partitioning

A Virtuoso database consists of indices only, so that the row of a table is stored together with the primary key.  Blobs are stored on separate pages when they do not fit inline within the row.  With clustering, partitioning can be specified index by index. Partitioning means that values of specific columns are used for determining where the containing index entry will be stored.  Virtuoso partitions by hash and allows specifying what parts of partitioning columns are used for the hash, for example bits 14-6 of an integer or the first 5 characters of a string.  Like this, key compression gains are not lost by storing consecutive values on different partitions.

Once the partitioning is specified, we specify which set of cluster nodes stores this index.  Not every index has to be split evenly across all nodes.  Also, all nodes do not have to have equal slices of the partitioned index, accommodating differences in capacity between cluster nodes.

Each Virtuoso instance can manage up to 32TB of data.  A cluster has no definite size limit.

Load Balancing and Fault Tolerance

When data is partitioned, an operation on the data goes where the data is.  This provides a certain natural parallelism but we will discuss this further below.

Some data may be stored multiple times in the cluster, either for fail-over or for splitting read load.  Some data, such as database schema, is replicated on all nodes.  When specifying a set of nodes for storing the partitions of a key, it is possible to specify multiple nodes for the same partition.  If this is the case, updates go to all nodes and reads go to a randomly picked node from the group.

If one of the nodes in the group fails, operation can resume with the surviving node.  The failed node can be brought back online from the transaction logs of the surviving nodes. A few transactions may be rolled back at the time of failure and again at the time of the failed node rejoining the cluster but these are aborts as in the case of deadlock and lose no committed data.

Shared Nothing

The Virtuoso architecture does not require a SAN for disk sharing across nodes.  This is reasonable since a few disks on a local controller can easily provide 300MB/s of read and passing this over an interconnect fabric that would also have to carry inter-node messages could saturate even a fast network.

Client View

A SQL or HTTP client can connect to any node of the cluster and get an identical view of all data with full transactional semantics.  DDL operations like table creation and package installation are limited to one node, though.

Applications such as ODS will run unmodified.  They are installed on all nodes with a single install command.  After this, the data partitioning must be declared, which is a one time operation to be done cluster by cluster.  The only application change is specifying the partitioning columns for each index.  The gain is optional redundant storage and capacity not limited to a single machine.  The penalty is that single operations may take a little longer when not all data is managed by the same process but then the parallel throughput is increased.  We note that the main ODS performance factor is web page logic and not database access.  Thus splitting the web server logic over multiple nodes gives basically linear scaling.

Parallel Query Execution

Message latency is the principal performance factor in a clustered database.  Due to this, Virtuoso packs the maximum number of operations in a single message.  For example, when doing a loop join that reads one table sequentially and retrieves a row of another table for each row of the outer table, a large number of the join of the inner loop are run in parallel.  So, if there is a join of five tables that gets one row from each table and all rows are on different nodes, the time will be spent on message latency.  If each step of the join gets 10 rows, for a total of 100000 results, the message latency is not a significant factor and the cluster will clearly outperform a single node.

Also, if the workload consists of large numbers of concurrent short updates or queries, the message latencies will even out and throughput will scale up even if doing a single transaction were faster on a single node.

Parallel SQL

There are SQL extensions for stored procedures allowing parallelizing operations.  For example, if a procedure has a loop doing inserts, the inserted rows can be buffered until a sufficient number is available, at which point they are sent in batches to the nodes concerned.  Transactional semantics are kept but error detection is deferred to the actual execution.

Transactions

Each transaction is owned by one node of the cluster, the node to which the client is connected.  When more than one node besides the owner of the transaction is updated, two phase commit is used.  This is transparent to the application code.  No external transaction monitor is required, the Virtuoso instances perform these functions internally.  There is a distributed deadlock detection scheme based on the nodes periodically sharing transaction waiting information.

Since read transactions can operate without locks, reading the last committed state of uncommitted updated rows, waiting for locks is not very common.

Interconnect and Threading

Virtuoso uses TCP to connect between instances.  A single instance can have multiple listeners at different network interfaces for cluster activity.  The interfaces will be used in a round-robin fashion by the peers, spreading the load over all network interfaces. A separate thread is created for monitoring each interface.  Long messages, such as transfers of blobs are done on a separate thread, thus allowing normal service on the cluster node while the transfer is proceeding.

We will have to test the performance of TCP over Infiniband to see if there is clear gain in going to a lower level interface like MPI.  The Virtuoso architecture is based on streams connecting cluster nodes point to point.  The design does not per se gain from remote DMA or other features provided by MPI.  Typically, messages are quite short, under 100K.  Flow control for transfer of blobs is however nice to have but can be written at the application level if needed.  We will get real data on the performance of different interconnects in the next weeks.

Deployment and Management

Configuring is quite simple, with each process sharing a copy of the same configuration file.  One line in the file differs from host to host, telling it which one it is.  Otherwise the database configuration files are individual per host, accommodating different file system layouts etc.  Setting up a node requires copying the executable and two configuration files, no more.   All functionality is contained in a single process.  There are no installers to be run or such.

Changing the number or network interface of cluster nodes requires a cluster restart.  Changing data partitioning requires copying the data into a new table and renaming this over the old one.  This is time consuming and does not mix well with updates.  Splitting an existing cluster node requires no copying with repartitioning but shifting data between partitions does.

A consolidated status report shows the general state and level of intra-cluster traffic as count of messages and count of bytes.

Start, shutdown, backup, and package installation commands can only be issued from a single master node. Otherwise all is symmetrical.

Present State and Next Developments

The basics are now in place.  Some code remains to be written for such things as distributed deadlock detection, 2-phase commit recovery cycle, management functions, etc.  Some SQL operations like text index, statistics sampling, and index intersection need special support, yet to be written.

The RDF capabilities are not specifically affected by clustering except in a couple of places.  Loading will be slightly revised to use larger batches of rows to minimize latency, for example.

There is a pretty much infinite world of SQL optimizations for splitting aggregates, taking advantage of co-located joins etc.  These will be added gradually.  These are however not really central to the first application of RDF storage but are quite important for business intelligence, for example.

We will run some benchmarks for comparing single host and clustered Virtuoso instances over the next weeks.  Some of this will be with real data, giving an estimate on when we can move some of the RDF data we presently host to the new platform.  We will benchmark against Oracle and DB2 later but first we get things to work and compare against ourselves.

We roughly expect a halving in space consumption and a significant increase in single query performance and linearly scaling parallel throughput through addition of cluster nodes.

The next update will be on this blog within two weeks.

# PermaLink Comments [0]
08/27/2007 05:51 GMT Modified: 04/25/2008 11:59 GMT
Virtuoso Cluster Preview [ Orri Erling ]

I wrote the basics of the Virtuoso clustering support over the past three weeks.  It can now manage connections, decide where things go, do two phase commits, insert and select data from tables partitioned over multiple Virtuoso instances.  It works about enough to be measured, of which I will blog more over the next two weeks.

I will in the following give a features preview of what will be in the Virtuoso clustering support when it is released in the fall of this year (2007).

Data Partitioning

A Virtuoso database consists of indices only, so that the row of a table is stored together with the primary key.  Blobs are stored on separate pages when they do not fit inline within the row.  With clustering, partitioning can be specified index by index. Partitioning means that values of specific columns are used for determining where the containing index entry will be stored.  Virtuoso partitions by hash and allows specifying what parts of partitioning columns are used for the hash, for example bits 14-6 of an integer or the first 5 characters of a string.  Like this, key compression gains are not lost by storing consecutive values on different partitions.

Once the partitioning is specified, we specify which set of cluster nodes stores this index.  Not every index has to be split evenly across all nodes.  Also, all nodes do not have to have equal slices of the partitioned index, accommodating differences in capacity between cluster nodes.

Each Virtuoso instance can manage up to 32TB of data.  A cluster has no definite size limit.

Load Balancing and Fault Tolerance

When data is partitioned, an operation on the data goes where the data is.  This provides a certain natural parallelism but we will discuss this further below.

Some data may be stored multiple times in the cluster, either for fail-over or for splitting read load.  Some data, such as database schema, is replicated on all nodes.  When specifying a set of nodes for storing the partitions of a key, it is possible to specify multiple nodes for the same partition.  If this is the case, updates go to all nodes and reads go to a randomly picked node from the group.

If one of the nodes in the group fails, operation can resume with the surviving node.  The failed node can be brought back online from the transaction logs of the surviving nodes. A few transactions may be rolled back at the time of failure and again at the time of the failed node rejoining the cluster but these are aborts as in the case of deadlock and lose no committed data.

Shared Nothing

The Virtuoso architecture does not require a SAN for disk sharing across nodes.  This is reasonable since a few disks on a local controller can easily provide 300MB/s of read and passing this over an interconnect fabric that would also have to carry inter-node messages could saturate even a fast network.

Client View

A SQL or HTTP client can connect to any node of the cluster and get an identical view of all data with full transactional semantics.  DDL operations like table creation and package installation are limited to one node, though.

Applications such as ODS will run unmodified.  They are installed on all nodes with a single install command.  After this, the data partitioning must be declared, which is a one time operation to be done cluster by cluster.  The only application change is specifying the partitioning columns for each index.  The gain is optional redundant storage and capacity not limited to a single machine.  The penalty is that single operations may take a little longer when not all data is managed by the same process but then the parallel throughput is increased.  We note that the main ODS performance factor is web page logic and not database access.  Thus splitting the web server logic over multiple nodes gives basically linear scaling.

Parallel Query Execution

Message latency is the principal performance factor in a clustered database.  Due to this, Virtuoso packs the maximum number of operations in a single message.  For example, when doing a loop join that reads one table sequentially and retrieves a row of another table for each row of the outer table, a large number of the join of the inner loop are run in parallel.  So, if there is a join of five tables that gets one row from each table and all rows are on different nodes, the time will be spent on message latency.  If each step of the join gets 10 rows, for a total of 100000 results, the message latency is not a significant factor and the cluster will clearly outperform a single node.

Also, if the workload consists of large numbers of concurrent short updates or queries, the message latencies will even out and throughput will scale up even if doing a single transaction were faster on a single node.

Parallel SQL

There are SQL extensions for stored procedures allowing parallelizing operations.  For example, if a procedure has a loop doing inserts, the inserted rows can be buffered until a sufficient number is available, at which point they are sent in batches to the nodes concerned.  Transactional semantics are kept but error detection is deferred to the actual execution.

Transactions

Each transaction is owned by one node of the cluster, the node to which the client is connected.  When more than one node besides the owner of the transaction is updated, two phase commit is used.  This is transparent to the application code.  No external transaction monitor is required, the Virtuoso instances perform these functions internally.  There is a distributed deadlock detection scheme based on the nodes periodically sharing transaction waiting information.

Since read transactions can operate without locks, reading the last committed state of uncommitted updated rows, waiting for locks is not very common.

Interconnect and Threading

Virtuoso uses TCP to connect between instances.  A single instance can have multiple listeners at different network interfaces for cluster activity.  The interfaces will be used in a round-robin fashion by the peers, spreading the load over all network interfaces. A separate thread is created for monitoring each interface.  Long messages, such as transfers of blobs are done on a separate thread, thus allowing normal service on the cluster node while the transfer is proceeding.

We will have to test the performance of TCP over Infiniband to see if there is clear gain in going to a lower level interface like MPI.  The Virtuoso architecture is based on streams connecting cluster nodes point to point.  The design does not per se gain from remote DMA or other features provided by MPI.  Typically, messages are quite short, under 100K.  Flow control for transfer of blobs is however nice to have but can be written at the application level if needed.  We will get real data on the performance of different interconnects in the next weeks.

Deployment and Management

Configuring is quite simple, with each process sharing a copy of the same configuration file.  One line in the file differs from host to host, telling it which one it is.  Otherwise the database configuration files are individual per host, accommodating different file system layouts etc.  Setting up a node requires copying the executable and two configuration files, no more.   All functionality is contained in a single process.  There are no installers to be run or such.

Changing the number or network interface of cluster nodes requires a cluster restart.  Changing data partitioning requires copying the data into a new table and renaming this over the old one.  This is time consuming and does not mix well with updates.  Splitting an existing cluster node requires no copying with repartitioning but shifting data between partitions does.

A consolidated status report shows the general state and level of intra-cluster traffic as count of messages and count of bytes.

Start, shutdown, backup, and package installation commands can only be issued from a single master node. Otherwise all is symmetrical.

Present State and Next Developments

The basics are now in place.  Some code remains to be written for such things as distributed deadlock detection, 2-phase commit recovery cycle, management functions, etc.  Some SQL operations like text index, statistics sampling, and index intersection need special support, yet to be written.

The RDF capabilities are not specifically affected by clustering except in a couple of places.  Loading will be slightly revised to use larger batches of rows to minimize latency, for example.

There is a pretty much infinite world of SQL optimizations for splitting aggregates, taking advantage of co-located joins etc.  These will be added gradually.  These are however not really central to the first application of RDF storage but are quite important for business intelligence, for example.

We will run some benchmarks for comparing single host and clustered Virtuoso instances over the next weeks.  Some of this will be with real data, giving an estimate on when we can move some of the RDF data we presently host to the new platform.  We will benchmark against Oracle and DB2 later but first we get things to work and compare against ourselves.

We roughly expect a halving in space consumption and a significant increase in single query performance and linearly scaling parallel throughput through addition of cluster nodes.

The next update will be on this blog within two weeks.

# PermaLink Comments [0]
08/27/2007 09:44 GMT Modified: 04/25/2008 11:59 GMT
 <<     | 1 | 2 |     >>
Powered by OpenLink Virtuoso Universal Server
Running on Linux platform