A while back, a friend suggested he and I go check out the Singularity
Summit, a conference where they talk of strong AI. Well, I am not a
singularist. But since singularists are so brazenly provocative, I
looked a little to see if there is anything there, engineering-wise.
So, for a change, we'll be visionary. Read on, I also mention RDF at the end.
I will not even begin with arguments about indefinitely continuing a
trend. I will just say that nature has continuous and discontinuous
features. Computing is at a qualitative bend and things are not as
simple as they are blithely cracked up to be. Read HEC, the high end
crusader; he has good commentary on architecture. Having absorbed and
understood this, you can talk again about a billion-fold increase in
computing price/performance.
When I looked further, about uploading, i.e., the sci-fi process of
scanning a brain and having it continue its function inside a
computer simulation, I actually found some serious work. Dharmendra
Modha at IBM had made a rat-scale cortical simulator running on a 32K
node IBM BlueGene, with a 9x slowdown from real time. This is real
stuff and in no way meant to be associated with singularism by being
mentioned in the same paragraph. Anyway, I was intrigued.
So I asked myself if I could do better. I gave it a fair try, as fair
as can be without an actual experiment. The end result is that
latency rules. To have deterministic semantics, one must sync across
tens of thousands of processors and one cannot entirely eliminate
multi-hop messages on the interconnect fabric. If one optimistically
precalculates and spreads optimistic results over the cluster, rolling
them back will be expensive. Local optimistic computation may have
little point since most data points will directly depend on non local
data. One needs two sets of wiring, a 3D torus for predominantly
close range bulk and a tree for sync, just like the BlueGene has.
Making the same from newer hardware makes it more economical but
there's another 8 or so orders of magnitude to go before energy
efficiency parity with biology. Anyway, a many-times-over
qualitative gap. Human scale in real time might be just there
somewhere in reach with the stuff now in pre-release if there were no
bounds on cost or power consumption. We'd be talking billions at
least and even then it is iffy. But this is of little relevance as
long as the rat scale is at the level of a systems test for the
platform and not actually simulating a plausible organism in some sort
of virtual environment. Anyway, of biology I cannot judge and for the
computer science, Modha et al have it figured out about as well as can.
Simulation workloads are not the same as database. Database is easier
in a way since any global sync is very rare. 2PC seldom touches every
partition and if it does, the time to update was anyway greater than
the time to commit. Databases are generally multi-user, with a low
level of sync between users and a lot of asynchrony. On the
other hand, the general case of database does not have predictable
cluster locality. Well, if one has an OLTP app with a controlled set
of transactions and reports, then one can partition just for that and
have almost 100% affinity between the host serving the connection and
the data. With RDF for example, such is not generally possible or
would require such acrobatics of data layout that a DBA would not even
begin.
So, for database, there is a pretty much even probability for any
connection between the node running the query and any other. True,
function shipping can make the messages large and fairly async and
latency tolerant.
On the simulation side, it would seem that the wiring can mimic
locality of the simulated process. Messages to neighbors are more
likely than messages to remote nodes. So a 3D torus works well there,
complemented by a tree for control messages, like sending all nodes
the count of messages to expect. Of course, the control wiring
(reduce tree) must have far less latency than a long path through the
torus wiring and steps with long message deliveries on the torus must
be rare for this to bring any profit. Also, long distance messages
can go through the tree wiring if the volume is not excessive, else
the tree top gets congested.
So, looking past the multicore/multithread and a single level switched
interconnect, what would the architecture of the total knowledge
machine be? For the neural simulation, the above-described is the
best I can come up with and IBM already has come up with it anyway.
But here I am more concerned with a database/symbol processing
workload than about physical simulation. For the record, I'll say
that I expect no strong AI to emerge from these pursuits but that they
will still be useful, basically as a support for a linked data
"planetary datasphere." Like Google, except it supports arbitrary
queries joining arbitrary data at web scale. This would be a couple
of orders of magnitude above the text index of same. Now, in
practice, most queries will be rather trivial, so it is not that the 2
orders of magnitude are always realized. This would also involve a
bit more updating than the equivalent text index since reports would
now and then include private materialized inference results. As a
general rule, backward chaining would be nicer, since this is read
only but some private workspaces for distilling data cannot be
avoided.
So, given this general spec, where should architecture go? We can talk of processors, networks and software in turn.
Processors
For evaluating processors, the archetypal task is doing a single
random lookup out of a big index. Whether it is a B tree or hash is not
really essential since both will have to be built of fixed size pages
and will have more than one level. Both need some serialization for
checking if things are in memory and for pinning them in cache for the
duration of processing. This is eternally true as long as updates are
permitted, even if RAM were the bottom of the hierarchy. And if not,
then there is the checking of whether a page is in memory and the
logic for pinning it.
This means critical sections on shared data structures with small memory writes inside. This is
anathema, yet true. So, what the processor needs is shared memory for
threads and if possible an instruction for entry into a read-write
lock. If the lock is busy, there should be a settable interval before
the thread is removed from the core. With a multithread core, this
should be just like a memory cache miss. Only if this were really
prolonged would there be an interrupt to run the OS scheduler to
vacate the thread and load something else, and not even this if the
number of executable threads were less or equal the number of threads
on the cores.
For thread sync structures, the transactional memory ideas on the SPARC ROC
might be going in this direction. For on-core threading, I have not
tried how well SPARC T2 does but with the right OS support, it might
be in the ballpark. The X86_64 chips have nice thread speed but the
OS, whether Solaris or Linux, is a disaster if a mutex is busy. Don't
know why but so it is.
Things like IBM Cell, with multiple integrated distributed memory
processors on a chip might be workable if they had hardware for global
critical sections and if sub-microsecond tasks could be profitably
dispatched to the specialized cores (synergistic unit in Cell
terminology). If an index lookup, about 2-4 microseconds of real
time, could be profitably carried out on a synergistic core without
sinking it all in sync delays on the main core, there might be some
gain to this. Still, operations like cache lookups tend to be pointer
chasing and latency of small memory reads and sometimes writes is a
big factor. I have not measured Cell for this but it is not
advertised for this sort of workload. It might do nicely for the
neuron simulator but not for generic database, I would
guess.
Networks and Data Center
The point at which distributed memory wins over shared determines the
size of the single compute node. Problems of threads and critical
sections fall off but are replaced by network and the troubles of
scheduling and serializing and de-serializing messages. There are
really huge shared memory systems (by Cray, for example) but even there,
hitting the wrong address sends a high latency message over an
internal switched fabric, a bit like a network page fault. Well, if
one has millions of threads runnable for catching the slack of memory access over interconnect,
this might not be so bad, but this is not really the
architecture for a general purpose database. So, at some point, we
have clearly demarcated distributed memory with affinity of data to processor, simple as that.
For now, the high end clustered database benchmarks run off 1 Gbit
ethernet fabrics with some 30 compute nodes on a single switch. This
is for shared nothing systems. For shared disk, cache fusion systems
like Oracle RAC, we have more heavy duty networking like Infiniband,
as one would expect. I have discussed the merits of cache fusion vs
shared nothing partitioning on in a previous post.
As long as we are at a scale that fits on a single switch with even
port to port latency, we are set. For an RDF workload, throughput is
not really the issue but latency can be. With today's technology,
nodes with 4-8 cores and 16G RAM are practical and their number is
most often not really large. Adding two orders of magnitude, we get
more questions. Let's say that 2 billion triples fit with relative
comfort but not without disk on a 16G RAM node. This would make 500 nodes for a trillion triples.
This is an entirely relevant and reasonable scale, considering that
loading all public biomedical sets plus a pharma company's in house
data could approach this ballpark. Let alone anything on the scale of
the social web, i.e., the online conversation space.
So how to manage clusters like this? The current cluster gear is
oriented toward switch trees and is fairly expensive, about $1000 per
node. To make this easy, we would need a wiring free, modular
architecture. Picture a shelf with SATA drive size bays, each would
get a compute node of 8 cores and 16G plus interconnect. To simplify
the network, the module would have a laser on each side, for a cube
network topology, the enclosure could connect the edges with fiber
optics for the 3D torus. Mass storage would be in the same form
factor, as disks or flashes to be interspersed in the proper ratio
with compute nodes. All would have the same communication chip,
something like a Cray SeaStar with 6 or 7 external ports. A seventh
port could be used for a reduce tree. The rack would provide cooling
on the shelves by circulating a coolant fluid. Reconfiguring and
scaling would be a matter of adding shelves to the cabinet and laying
out Lego bricks, blue for compute and red for storage. The network
could achieve a latency for an arbitrary point to point message of no more than 10-20
microseconds by the right mix of cube and tree. This in a box of
thousands of nodes.
This type of layout would accommodate web scale without needing rows and rows of rack cabinets.
The external interfaces for web requests and replies are no big deal
after this, since the intra-cluster traffic is orders of magnitude
higher than the result set traffic. After all, the end user does not
want dumps of the database but highly refined and ranked results.
Software
We have previously talked about dynamic partitions a la Google
Bigtable or Amazon Dynamo. These techniques are entirely fine and
will serve for the universal knowledge store.
But what about query logic? OK, having a consistent map of partitions
shared over tens of thousands of nodes is entirely possible. So is
replication and logic for using a spatially closer partition when
multiple copies are know to exist. These things take some programming
but there is nothing really new about them. These are a direct straightforward extension of what we do for clustering right now.
But look to windward. How to run complex queries and inference on the
platform outlined above? There are some features of RDF querying like
same-as that can be easily parallelized, backward-chaining style: Just
proceed with the value at hand and initiate the lookup of synonyms and
let them get processed when they become available. Same for subclass
and sub-property. We already do this, but could do it with more
parallelism.
No matter what advances in architecture take place, I do not see a
world where every user materializes the entailment of their own
same-as-es and rules over a web scale data set. So, backward chaining
approaches to inference must develop. Luckily, what most queries need
is simple. A rule-oriented language, like Prolog without cut will
parallelize well enough. Some degree of memoization may be
appropriate for cutting down on re-proving the same thing over and
over. Memoization over a cluster is a problem though, since this
involves messages. I should say that one should not go looking for
pre-proven things beyond the node at hand and that computation should
not spread too quickly or too promiscuously so as not to make long
message paths. We must remember that a totally even round trip time
on a large cluster just will not happen.
Query planning on any system critically depends on correct ballpark
guesses on cardinality. If predicates are amplified with transitivity
and same-as at run time, the cardinality guessing becomes harder.
This can kill any plan no matter the hardware. Probably, an executing
query must be annotated with the underlying cardinality assumptions.
If these prove radically false, the execution may abort and a new plan
be made to better match what was found out. This bears some looking into.
There are some network algorithms like shortest path, traveling
salesman and similar that probably deserve a special operator in the
query language. These can benefit from parallelization and a
sequential implementation running on a cluster with latency will be a
disaster. Expressing the message flow in a rule language is not
really simple and pretty much no programmer will either appreciate the necessity
or go to the trouble. Therefore such things should likely be offered by the platform and made by the few people who understand such matters.
For forward chaining, it seems that any results should generally go
into their own graph so as not to pollute the base data. This graph,
supposing it is small enough, can have different partitioning from
the large base data set. If the data comes from far and wide but
results are local, there is better usage of a RETE like algorithm for
triggering inference when data comes in. RETE will parallelize well
enough, also for clusters,results just have to be broadcast to the
nodes that may have a use for them.
The programming model will typically be using a set of local overlays
on a large shared data set. Queries will most often not be against a
single graph. Strict transactionality will be the exception rather
than the rule. At the database node level, there must be
real transactionality even if the whole system most often did not run with
strict ACID semantics. This is due to ACID requirements of some
internal ops, e.g., some bitmap index operations, log checkpoints, DDL changes, etc.
For procedural tasks, map-reduce is OK. We have it even
in our SQL. Map-reduce is not the basis for making a
DBMS but it is a nice feature for some parts of query evaluation and application logic.
We have not talked about linking the data itself, but there is a whole workshop on this next week in Beijing; I will write about it separately.
Let this just serve to state that we are serious about the platform for this, present and future.
Conclusion
The web scale database, the Google with arbitrary joining and
inference, is one generation away, talking of economical
implementation. Today, a price tag of $100K will buy some 50-100
billion triples worth with reasonable query response. Unlimited
budget will take this a bit further, like one order of magnitude. and
then, returns might be decreasing.
Of course, this is worth nothing if the software isn't there.
Virtuoso with its cluster edition will allow one to use unlimited RAM and
processors. The frontier is now at getting just the right parallelism for
each task, including inference ones.