Details

Orri Erling

Subscribe

Post Categories

Recent Articles

Display Settings

articles per page.
order.
Showing posts in all categories RefreshRefresh
Virtuoso 5.0 Preview

As previously said, we have a Virtuoso with brand new engine multithreading. It is now complete and passes its regular test suite. This is the basis for Virtuoso 5.0, to be available as the open source and commercial cuts as before.

As one benchmark, we used the TPC-C test driver that has always been bundled with Virtuoso. We ran 100000 new orders worth of the TPC-C transaction mix first with one client and then with 4 clients, each client going to its own warehouse, so there was not much lock contention. We did this on a 4 core Intel, the working set in RAM. With the old one, 1 client took 1m43 and 4 clients took 3m47. With the new one, one client took 1m30 and 4 clients took 2m37. So, 400000 new orders in 2m37, for 152820 new orders per minute as opposed to 105720 per minute previously. Do not confuse with the official tpmC metric, that one involves a whole bunch of further rules.

TPC-C has activity spread over a few different tables. With tests dealing with fewer tables, improvements in parallelism are far greater.

Aside from better parallelism, we have other features. One of them is a change in the read committed isolation, so that we now return the previous committed state for uncommitted changed rows instead of waiting for the updating transaction to terminate. This is similar to what Oracle does for read committed. Also we now do log checkpoints without having to abort pending write transactions.

When we have faster inserts, we actually see the RDF bulk loader run slower. This is really backwards. The reason is that while one thread parses, other threads insert and if the inserting threads are done they go to wait on a semaphore and this whole business of context switching absolutely kills performance. With slower inserts, the parser keeps ahead so there is less context switching, hence better overall throughput. I still do not get it how the OS can spend between 1.5 and 6 microseconds, several thousand instructions, deciding what to do next when there are only 3-4 eligible threads and all the rest is background which goes with a few dozen slices per second. Solaris is a little better than Linux at this but not dramatically so. Mac OS X is way worse.

As said, we use Oracle 10G2 on the same platform (Linux FC5 64 bit) for sparring. It is really a very good piece of software. We have written the TPC C transactions in SQL/PL. What is surprising is that these procedures run amazingly slowly, even with a single client. Otherwise the Oracle engine is very fast. Well, as I recall, the official TPC C runs with Oracle use an OCI client and no stored procedures. Strange. While Virtuoso for example fills the initial TPC C state a little faster than Oracle, the procedures run 5-10 times slower with Oracle than with Virtuoso, all data in warm cache and a single client. While some parts of Oracle are really well optimized, all basic joins and aggregates etc, we are surprised at how they could have neglected such a central piece as the PL.

Also, we have looked at transaction semantics. Serializable is mostly serializable with Oracle but does not always keep a steady count. Also it does not prevent inserts into a space that has been found empty by a serializable transaction. True, it will not show these inserts to the serializable transaction, so in this it follows the rules. Also, to make a read really repeatable, it seems that the read has to be FOR UPDATE. Otherwise one can not implement a reliable resource transaction, like changing the balance of an account.

Anyway, the Virtuoso engine overhaul is now mostly complete. This is of course an open ended topic but the present batch is nearing completion. We have gone through as many as 3 implementations of hash joins, some things have yet to be finished there. Oracle has very good hash joins. The only way we could match that was to do it all in memory, dropping any persistent storage of the hash. This is of course OK if the hash is not very large and anyway hash joins go sour if the hash does not fit in working set.

As next topics, we have more RDF and the LUBM benchmark to finish. Also we should revisit TPC-D.

Databases are really quite complicated and extensive pieces of software. Much more so than the casual observer might think.

# PermaLink Comments [0]
01/10/2007 15:08 GMT Modified: 04/17/2008 21:04 GMT
Virtuoso TPCC and Multiprocessor Linux and Mac

We have updated our article on Virtuoso scalability with two new platforms: A 2 x dual core Intel Xeon and a Mac Mini with an Intel Core Duo.

We have more than quadrupled the best result so far.

The best score so far is 83K transactions per minute with a 40 warehouse (about 4G) database. This is attributable to the process running in mostly memory, with 3 out of 4 cores busy on the database server. But even when doubling the database size and number of 3 clients, we stay at 49K transactions per minute, now with a little under 2 cores busy and am average of 20 disk reads pending at all times, split over 4 SATA disks. The measurement is the count of completed transactions during a 1h run. With the 80 warehouse database, it took about 18 minutes for the system to reach steady state, with a warm working set, hence the actual steady rate is somewhat higher than 49K, as the warm up period was included in the measurement.

The metric on the Mac Mini was 2.7K with 2G RAM and one disk. The CPU usage was about one third of one core. Since we have had rates of over 10K with 2G RAM, we attribute the low result to running on a single disk which is not very fast at that.

We have run tests in 64 and 32 bit modes but have found little difference as long as actual memory does not exceed 4g. If anything, 32 bit binaries should have an advantage in cache hit rate since most data structures take less space there. After the process size exceeds the 32 bit limit, there is a notable difference in favor of 64 bit. Having more than 4G of database buffers produces a marked advantage over letting the OS use the space for file system cache. So, 64 bit is worthwhile but only if there is enough memory. As for X86 having more registers in 64 bit mode, we have not specifically measured what effect that might have.

We also note that Linux has improved a great deal with respect to multiprocessor configurations. We use a very simple test with a number of threads acquiring and then immediately freeing the same mutex. On single CPU systems, the real time has pretty much increased linearly with the number of threads. On multiprocessor systems, we used to get very non-linear behavior, with 2 threads competing for the same mutex taking tens of times the real time as opposed to one thread. At last measurement, with a 64 bit FC 5, we saw 2 threads take 7x the real time when competing for the same mutex. This is in the same ballpark as Solaris 10 on a similar system. Mac OS X 10.4 Tiger on a 2x dual core Xeon Mac Pro did the worst so far, with two threads taking over 70x the time of one. With a Mac Mini with a single Core Duo, the factor between one thread and two was 73.

Also the proportion of system CPU on Tiger was consistently higher than on Solaris or Linux when running the same benchmarks. Of course for most applications this test is not significant but it is relevant for database servers, as there are many very short critical sections involved in multithreaded processing of indices and the like.

# PermaLink Comments [0]
09/25/2006 11:13 GMT Modified: 04/16/2008 16:53 GMT
         
Powered by OpenLink Virtuoso Universal Server
Running on Linux platform