We now have Virtuoso 6 running most of its test suite in single process and cluster modes. It is now time to finalize how it is to be configured and deployed. A bit more on this later.
We would have been done in about half time if we had not also redone the database physical layout with key compression. Still, if we get 3x more data in the same memory, using 64 bit ids for everything, the effort is justified. For any size above 2 billion triples, this means 3x less cost.
A good amount of the time and effort goes into everything except the core. Of course, we first do the optimizations we find appropriate and measure them. After all, the rest has no point if these do not run in the desired ballpark.
For delivering something, the requirements are quite the opposite: For example, when when defining a unique index, what to do when the billionth key turns out not to be unique? And then what if one of the processes is killed during the operation? Does it all come out right also when played from the roll forward log? See what I mean? There is no end of such.
So, well after we are done with the basic functionality, we have to deal with this sort of thing. Even if we limited ourselves to RDF workloads only in the first cut, we still would need to do this since maintaining this would simply not be possible without some generic DBMS functionality. So we get the full feature generic clustered RDBMS in the same cut, no splitting the deliverable.
The basic cluster execution model is described here.
There are some further optimizations that we will do at and around the time of first public cut.
These have to do mostly with execution scheduling. For example, a bitmap intersection join must be done differently from a single server when there is latency in getting the next chunk of bits. Value sub-queries, derived tables and existences must be started as batches, just like joined tables.
Having too many threads on an index is no good. But having a large batch of random lookups to work with, even when each of them does not have its own thread, gives some possibilities for IO optimization. When one would block for disk, start the disk asynchronously, like with a read ahead and do the next index lookup from the batch. This is specially so in cluster situations where the index lookups naturally come in "pre-vectored" batches. You could say that the loop join is rolled out. This is done anyhow for message latency reasons.
Do we optimize for the right stuff? Well, looking into the future, it does not look like regular RAM will be the bottom of the storage hierarchy, no matter how you look at it. With solid state disks, locality may not be so important but latency is there to stay. With everything now growing sideways, as in number of cores and core multithreading, we are just looking to deepening our already warm and intimate relationship with the Moira which cuts the thread, Atropos, Lady Latency. The attention of the best minds of the industry is devoted to thee. pouring forth of the effort of the best minds in the industry.