Virtuoso TPCC and Multiprocessor Linux and Mac

(Cut & Pasted verbatim from Orri Erling's Weblog.)

Virtuoso TPCC and Multiprocessor Linux and Mac: "

We have updated our article on Virtuoso scalability with two new platforms: A 2 x dual core Intel Xeon and a Mac Mini with an Intel Core Duo.

We have more than quadrupled the best result so far.

The best score so far is 83K transactions per minute with a 40 warehouse (about 4G) database. This is attributable to the process running in mostly memory, with 3 out of 4 cores busy on the database server. But even when doubling the database size and number of 3 clients, we stay at 49K transactions per minute, now with a little under 2 cores busy and am average of 20 disk reads pending at all times, split over 4 SATA disks. The measurement is the count of completed transactions during a 1h run. With the 80 warehouse database, it took about 18 minutes for the system to reach steady state, with a warm working set, hence the actual steady rate is somewhat higher than 49K, as the warm up period was included in the measurement.

The metric on the Mac Mini was 2.7K with 2G RAM and one disk. The CPU usage was about one third of one core. Since we have had rates of over 10K with 2G RAM, we attribute the low result to running on a single disk which is not very fast at that.

We have run tests in 64 and 32 bit modes but have found little difference as long as actual memory does not exceed 4g. If anything, 32 bit binaries should have an advantage in cache hit rate since most data structures take less space there. After the process size exceeds the 32 bit limit, there is a notable difference in favor of 64 bit. Having more than 4G of database buffers produces a marked advantage over letting the OS use the space for file system cache. So, 64 bit is worthwhile but only if there is enough memory. As for X86 having more registers in 64 bit mode, we have not specifically measured what effect that might have.

We also note that Linux has improved a great deal with respect to multiprocessor configurations. We use a very simple test with a number of threads acquiring and then immediately freeing the same mutex. On single CPU systems, the real time has pretty much increased linearly with the number of threads. On multiprocessor systems, we used to get very non-linear behavior, with 2 threads competing for the same mutex taking tens of times the real time as opposed to one thread. At last measurement, with a 64 bit FC 5, we saw 2 threads take 7x the real time when competing for the same mutex. This is in the same ballpark as Solaris 10 on a similar system. Mac OS X 10.4 Tiger on a 2x dual core Xeon Mac Pro did the worst so far, with two threads taking over 70x the time of one. With a Mac Mini with a single Core Duo, the factor between one thread and two was 73.

Also the proportion of system CPU on Tiger was consistently higher than on Solaris or Linux when running the same benchmarks. Of course for most applications this test is not significant but it is relevant for database servers, as there are many very short critical sections involved in multithreaded processing of indices and the like.