Architectures for Infinity

A while back, a friend suggested he and I go check out the Singularity Summit, a conference where they talk of strong AI. Well, I am not a singularist. But since singularists are so brazenly provocative, I looked a little to see if there is anything there, engineering-wise.

So, for a change, we'll be visionary. Read on, I also mention RDF at the end.

I will not even begin with arguments about indefinitely continuing a trend. I will just say that nature has continuous and discontinuous features. Computing is at a qualitative bend and things are not as simple as they are blithely cracked up to be. Read HEC, the high end crusader; he has good commentary on architecture. Having absorbed and understood this, you can talk again about a billion-fold increase in computing price/performance.

When I looked further, about uploading, i.e., the sci-fi process of scanning a brain and having it continue its function inside a computer simulation, I actually found some serious work. Dharmendra Modha at IBM had made a rat-scale cortical simulator running on a 32K node IBM BlueGene, with a 9x slowdown from real time. This is real stuff and in no way meant to be associated with singularism by being mentioned in the same paragraph. Anyway, I was intrigued.

So I asked myself if I could do better. I gave it a fair try, as fair as can be without an actual experiment. The end result is that latency rules. To have deterministic semantics, one must sync across tens of thousands of processors and one cannot entirely eliminate multi-hop messages on the interconnect fabric. If one optimistically precalculates and spreads optimistic results over the cluster, rolling them back will be expensive. Local optimistic computation may have little point since most data points will directly depend on non local data. One needs two sets of wiring, a 3D torus for predominantly close range bulk and a tree for sync, just like the BlueGene has. Making the same from newer hardware makes it more economical but there's another 8 or so orders of magnitude to go before energy efficiency parity with biology. Anyway, a many-times-over qualitative gap. Human scale in real time might be just there somewhere in reach with the stuff now in pre-release if there were no bounds on cost or power consumption. We'd be talking billions at least and even then it is iffy. But this is of little relevance as long as the rat scale is at the level of a systems test for the platform and not actually simulating a plausible organism in some sort of virtual environment. Anyway, of biology I cannot judge and for the computer science, Modha et al have it figured out about as well as can.

Simulation workloads are not the same as database. Database is easier in a way since any global sync is very rare. 2PC seldom touches every partition and if it does, the time to update was anyway greater than the time to commit. Databases are generally multi-user, with a low level of sync between users and a lot of asynchrony. On the other hand, the general case of database does not have predictable cluster locality. Well, if one has an OLTP app with a controlled set of transactions and reports, then one can partition just for that and have almost 100% affinity between the host serving the connection and the data. With RDF for example, such is not generally possible or would require such acrobatics of data layout that a DBA would not even begin.

So, for database, there is a pretty much even probability for any connection between the node running the query and any other. True, function shipping can make the messages large and fairly async and latency tolerant.

On the simulation side, it would seem that the wiring can mimic locality of the simulated process. Messages to neighbors are more likely than messages to remote nodes. So a 3D torus works well there, complemented by a tree for control messages, like sending all nodes the count of messages to expect. Of course, the control wiring (reduce tree) must have far less latency than a long path through the torus wiring and steps with long message deliveries on the torus must be rare for this to bring any profit. Also, long distance messages can go through the tree wiring if the volume is not excessive, else the tree top gets congested.

So, looking past the multicore/multithread and a single level switched interconnect, what would the architecture of the total knowledge machine be? For the neural simulation, the above-described is the best I can come up with and IBM already has come up with it anyway. But here I am more concerned with a database/symbol processing workload than about physical simulation. For the record, I'll say that I expect no strong AI to emerge from these pursuits but that they will still be useful, basically as a support for a linked data "planetary datasphere." Like Google, except it supports arbitrary queries joining arbitrary data at web scale. This would be a couple of orders of magnitude above the text index of same. Now, in practice, most queries will be rather trivial, so it is not that the 2 orders of magnitude are always realized. This would also involve a bit more updating than the equivalent text index since reports would now and then include private materialized inference results. As a general rule, backward chaining would be nicer, since this is read only but some private workspaces for distilling data cannot be avoided.

So, given this general spec, where should architecture go? We can talk of processors, networks and software in turn.

Processors

For evaluating processors, the archetypal task is doing a single random lookup out of a big index. Whether it is a B tree or hash is not really essential since both will have to be built of fixed size pages and will have more than one level. Both need some serialization for checking if things are in memory and for pinning them in cache for the duration of processing. This is eternally true as long as updates are permitted, even if RAM were the bottom of the hierarchy. And if not, then there is the checking of whether a page is in memory and the logic for pinning it.

This means critical sections on shared data structures with small memory writes inside. This is anathema, yet true. So, what the processor needs is shared memory for threads and if possible an instruction for entry into a read-write lock. If the lock is busy, there should be a settable interval before the thread is removed from the core. With a multithread core, this should be just like a memory cache miss. Only if this were really prolonged would there be an interrupt to run the OS scheduler to vacate the thread and load something else, and not even this if the number of executable threads were less or equal the number of threads on the cores.

For thread sync structures, the transactional memory ideas on the SPARC ROC might be going in this direction. For on-core threading, I have not tried how well SPARC T2 does but with the right OS support, it might be in the ballpark. The X86_64 chips have nice thread speed but the OS, whether Solaris or Linux, is a disaster if a mutex is busy. Don't know why but so it is.

Things like IBM Cell, with multiple integrated distributed memory processors on a chip might be workable if they had hardware for global critical sections and if sub-microsecond tasks could be profitably dispatched to the specialized cores (synergistic unit in Cell terminology). If an index lookup, about 2-4 microseconds of real time, could be profitably carried out on a synergistic core without sinking it all in sync delays on the main core, there might be some gain to this. Still, operations like cache lookups tend to be pointer chasing and latency of small memory reads and sometimes writes is a big factor. I have not measured Cell for this but it is not advertised for this sort of workload. It might do nicely for the neuron simulator but not for generic database, I would guess.

Networks and Data Center

The point at which distributed memory wins over shared determines the size of the single compute node. Problems of threads and critical sections fall off but are replaced by network and the troubles of scheduling and serializing and de-serializing messages. There are really huge shared memory systems (by Cray, for example) but even there, hitting the wrong address sends a high latency message over an internal switched fabric, a bit like a network page fault. Well, if one has millions of threads runnable for catching the slack of memory access over interconnect, this might not be so bad, but this is not really the architecture for a general purpose database. So, at some point, we have clearly demarcated distributed memory with affinity of data to processor, simple as that.

For now, the high end clustered database benchmarks run off 1 Gbit ethernet fabrics with some 30 compute nodes on a single switch. This is for shared nothing systems. For shared disk, cache fusion systems like Oracle RAC, we have more heavy duty networking like Infiniband, as one would expect. I have discussed the merits of cache fusion vs shared nothing partitioning on in a previous post.

As long as we are at a scale that fits on a single switch with even port to port latency, we are set. For an RDF workload, throughput is not really the issue but latency can be. With today's technology, nodes with 4-8 cores and 16G RAM are practical and their number is most often not really large. Adding two orders of magnitude, we get more questions. Let's say that 2 billion triples fit with relative comfort but not without disk on a 16G RAM node. This would make 500 nodes for a trillion triples.

This is an entirely relevant and reasonable scale, considering that loading all public biomedical sets plus a pharma company's in house data could approach this ballpark. Let alone anything on the scale of the social web, i.e., the online conversation space.

So how to manage clusters like this? The current cluster gear is oriented toward switch trees and is fairly expensive, about $1000 per node. To make this easy, we would need a wiring free, modular architecture. Picture a shelf with SATA drive size bays, each would get a compute node of 8 cores and 16G plus interconnect. To simplify the network, the module would have a laser on each side, for a cube network topology, the enclosure could connect the edges with fiber optics for the 3D torus. Mass storage would be in the same form factor, as disks or flashes to be interspersed in the proper ratio with compute nodes. All would have the same communication chip, something like a Cray SeaStar with 6 or 7 external ports. A seventh port could be used for a reduce tree. The rack would provide cooling on the shelves by circulating a coolant fluid. Reconfiguring and scaling would be a matter of adding shelves to the cabinet and laying out Lego bricks, blue for compute and red for storage. The network could achieve a latency for an arbitrary point to point message of no more than 10-20 microseconds by the right mix of cube and tree. This in a box of thousands of nodes.

This type of layout would accommodate web scale without needing rows and rows of rack cabinets.

The external interfaces for web requests and replies are no big deal after this, since the intra-cluster traffic is orders of magnitude higher than the result set traffic. After all, the end user does not want dumps of the database but highly refined and ranked results.

Software

We have previously talked about dynamic partitions a la Google Bigtable or Amazon Dynamo. These techniques are entirely fine and will serve for the universal knowledge store.

But what about query logic? OK, having a consistent map of partitions shared over tens of thousands of nodes is entirely possible. So is replication and logic for using a spatially closer partition when multiple copies are know to exist. These things take some programming but there is nothing really new about them. These are a direct straightforward extension of what we do for clustering right now.

But look to windward. How to run complex queries and inference on the platform outlined above? There are some features of RDF querying like same-as that can be easily parallelized, backward-chaining style: Just proceed with the value at hand and initiate the lookup of synonyms and let them get processed when they become available. Same for subclass and sub-property. We already do this, but could do it with more parallelism.

No matter what advances in architecture take place, I do not see a world where every user materializes the entailment of their own same-as-es and rules over a web scale data set. So, backward chaining approaches to inference must develop. Luckily, what most queries need is simple. A rule-oriented language, like Prolog without cut will parallelize well enough. Some degree of memoization may be appropriate for cutting down on re-proving the same thing over and over. Memoization over a cluster is a problem though, since this involves messages. I should say that one should not go looking for pre-proven things beyond the node at hand and that computation should not spread too quickly or too promiscuously so as not to make long message paths. We must remember that a totally even round trip time on a large cluster just will not happen.

Query planning on any system critically depends on correct ballpark guesses on cardinality. If predicates are amplified with transitivity and same-as at run time, the cardinality guessing becomes harder. This can kill any plan no matter the hardware. Probably, an executing query must be annotated with the underlying cardinality assumptions. If these prove radically false, the execution may abort and a new plan be made to better match what was found out. This bears some looking into.

There are some network algorithms like shortest path, traveling salesman and similar that probably deserve a special operator in the query language. These can benefit from parallelization and a sequential implementation running on a cluster with latency will be a disaster. Expressing the message flow in a rule language is not really simple and pretty much no programmer will either appreciate the necessity or go to the trouble. Therefore such things should likely be offered by the platform and made by the few people who understand such matters.

For forward chaining, it seems that any results should generally go into their own graph so as not to pollute the base data. This graph, supposing it is small enough, can have different partitioning from the large base data set. If the data comes from far and wide but results are local, there is better usage of a RETE like algorithm for triggering inference when data comes in. RETE will parallelize well enough, also for clusters,results just have to be broadcast to the nodes that may have a use for them.

The programming model will typically be using a set of local overlays on a large shared data set. Queries will most often not be against a single graph. Strict transactionality will be the exception rather than the rule. At the database node level, there must be real transactionality even if the whole system most often did not run with strict ACID semantics. This is due to ACID requirements of some internal ops, e.g., some bitmap index operations, log checkpoints, DDL changes, etc.

For procedural tasks, map-reduce is OK. We have it even in our SQL. Map-reduce is not the basis for making a DBMS but it is a nice feature for some parts of query evaluation and application logic.

We have not talked about linking the data itself, but there is a whole workshop on this next week in Beijing; I will write about it separately. Let this just serve to state that we are serious about the platform for this, present and future.

Conclusion

The web scale database, the Google with arbitrary joining and inference, is one generation away, talking of economical implementation. Today, a price tag of $100K will buy some 50-100 billion triples worth with reasonable query response. Unlimited budget will take this a bit further, like one order of magnitude. and then, returns might be decreasing.

Of course, this is worth nothing if the software isn't there. Virtuoso with its cluster edition will allow one to use unlimited RAM and processors. The frontier is now at getting just the right parallelism for each task, including inference ones.

Orri Erling's Weblog

Details

Subscribe

Tag Cloud

Post Categories